Skip to content. | Skip to navigation

Personal tools
You are here: Home Publications Towards an Intelligent Data Lake System for Heterogeneous Data Sources


Prof. Dr. S. Decker
RWTH Aachen
Informatik 5
Ahornstr. 55
D-52056 Aachen
Tel +49/241/8021501
Fax +49/241/8022321

How to find us

Annual Reports





Towards an Intelligent Data Lake System for Heterogeneous Data Sources

Year 2016
PDF URL view

As the challenge of our time, Big Data still has many research hassles, especially the variety of data. The high diversity of data sources often results in information silos, a collection of non-integrated data management systems with heterogeneous schemas, query languages, and APIs. Data Lake (DL) systems have been proposed recently as a solution to this problem. However, as the term `Data Lake' has been coined in practice, the clear definition, research challenges involved, and system implementation are still missing. To meet such a gap, in this paper we describe the key features and define research objectives for data lakes, followed by the discussion of the methodologies to fulfill such objectives. As a complex problem, the development of data lakes includes research challenges such as incremental, on-demand schema management, query rewriting, and data quality management. Our vision for a data lake system is based on a generic and extensible architecture with a unied yet flexible metadata model. The system facilitates the ingestion, storage, and metadata management over heterogeneous data sources. We propose a data lake framework, which extracts, matches, and summarizes the structural and semantic metadata while providing data quality measurement. It also provides users a unied interface for query processing and data exploration, with embedded query rewriting engines supporting structured data and semi-structured data.


35th International Conference on Conceptual Modeling (ER), PhD Symposium


Related projects

Document Actions