Skip to content. | Skip to navigation

Informatik 5
Information Systems
Prof. Dr. M. Jarke
Personal tools
You are here: Home Theses Comparison of Open Source Big Data Integration tools


Prof. Dr. M. Jarke
RWTH Aachen
Informatik 5
Ahornstr. 55
D-52056 Aachen
Tel +49/241/8021501
Fax +49/241/8022321

How to find us

Annual Reports





Comparison of Open Source Big Data Integration tools

Thesis type
  • Bachelor
Status Open

The BDMM research group at Informatik 5 is developing a data lake system in which data from various sources should be made available. An important component in the data lake system is the data extractor that accesses the data sources. As many different source systems (DBMS, file formats, web services, ...) should be supported, several wrappers for these data sources need to be implemented, which is a tedious labor-intensive task.

However, there are already many open source systems available which also provide access to different data sources. These include data integration systems (e.g., Talend, Pentaho), data mining systems (e.g., KNIME), big data processing systems (e.g., Apache Spark), metadata frameworks (e.g., Apache Metamodel), etc. Thus, it would be more efficient to use such existing solutions instead of re-implementing a new one.

Therefore, the goal of this thesis is the comparison of data extraction frameworks (wrappers) which are available as open source. The following criteria are important for the evaluation of the frameworks:

  • License type: the license should allow extension, adaption, modification of the existing source in any kind of project (e.g., commercial, open-source, academic).
  • Functionality: the framework should provide already a large collection of existing wrappers for state-of-the-art data management systems. Different types of data sources should be supported (e.g., classical DBMS, streams, web services, files, ....). Configuring a new data source should not require too much overhead.
  • Extensibility: the framework should provide an easy way to add new wrappers for new data sources.
  • Active Development: there should be recent activities that extend or support the development of the framework. An active community (e.g., by mailing lists, forums) should be also present.
  • Extend the existing framework and integrate to the running data lake architecture
  • Create a wrapper for heterogeneous data formats that are given as an input to the Data Integration Framework.

To evaluate these issues in detail, a prototypical implementation of a data extraction process is required.

Related projects

Document Actions