Big Data Integration and Quality
The Big Data era is upon us: data is being generated, collected and analyzed at an unprecedented scale, and data-driven decision making is sweeping through all aspects of society. This seminar will discuss recent results in Big Data research, e.g., Big Data integration, Data Quality, Big Data Analytics, Text Mining, and Data Stream Processing.
Data quality problems arise frequently when data is integrated from disparate sources. In the context of Big Data applications, data quality is becoming more important because of the unprecedented volume, large variety, and high velocity. The challenges caused by volume and velocity of Big Data have been addressed by many research projects and commercial solutions and can be partially solved by modern, scalable data management systems. However, variety remains to be a daunting challenge for Big Data Integration and requires also special methods for data quality management. Variety (or heterogeneity) exists at several levels: at the instance level, the same entity might be described with different attributes; at the schema level, the data is structured with various schemas; but also at the level of the modeling language, different data models can be used (e.g., relational, XML, or a document-oriented JSON representation). This might lead to data quality issues such as consistency, understandability, or completeness.
The heterogeneity of data sources in the Big Data Era requires new integration approaches which can handle the large volume and speed of the generated data as well as the variety and quality of the data. Traditional ‘schema first’ approaches as in the relational world with data warehouse systems and ETL (Extract-Transform-Load) processes are inappropriate for a flexible and dynamically changing data management landscape. The requirement for pre-defined, explicit schemas is a limitation which has drawn interest of many developers and researchers to NoSQL data management systems as these systems should provide data management features for a high amount of schema-less data. Nevertheless, a one-size-fits-all Big Data system is unlikely to solve all the challenges which are required from data management systems today. Instead, multiple classes of systems, optimized for specific requirements or hardware platforms, will co-exist in a data management landscape.
Thus, heterogeneity and data quality are challenges for many Big Data applications. While in some applications, a limited data quality for individual data items does not cause serious problems when a huge amount of data is aggregated, data quality problems in data sources are often revealed by the integration of these sources with other information.
Detailed knowledge about database systems or data management is necessary (e.g., by attending the lectures "Databases and Information Systems" and "Implementation of Databases").
The first intorductory meeting, where you will get your topic and all information about the organization of the seminar will take place at: 29.07.2016 1:30 pm in room 5053.2 (B-IT Research School, opposite AH6)