Design and Implementation of a System for Exploring Semi-Structured Datasets


Design and Implementation of a System for Exploring Semi-Structured Datasets

Thesis type
  • Bachelor
Status Finished
Submitted in 2017
Proposal on 23. Sep 2016 16:00
Proposal room Seminarraum I5
Not all data is structured as the tables in RDBMS; especially, Big Data applications are processing data from various sources in heterogeneous formats. Everyday enormous data is generated unprecedentedly with all kinds of format, e.g., spreadsheets, XML files, text. Although the data has many different formats, the data has usually some kind of structure which could be exploited to build up semi-structured datasets.

Information extracted from semi-structured datasets can be converted to valuable insight for decision making. To achieve this, data has to be made available in an efficient system for data exploration and query processing.

The thesis goal has two parts: the first part is about analysis and comparison of the existing tools and libraries utilized for exploring and managing semi-structured data, for instance, Elasticsearch, Lucene, Tika, Solr and Jackrabbit.

Based on this, 1-3 systems should be chosen for the implementation of a proof-of-concept (POC). The POC should use data from the mi-Mappa project, in which patents and publications are analyzed to build profiles of researchers.

