11th International Workshop on Quality in DataBases

Special Focus: Big Data Integration and Quality

In conjunction with VLDB 2016

Published Proceedings

The proceedings of the workshop have now been published with RWTH Publications
and can be found following this link.

About the Workshop

Data quality problems arise frequently when data is integrated from disparate sources. In the context of Big Data applications, data quality is becoming more important because of the unprecedented volume, large variety, and high velocity. The challenges caused by volume and velocity of Big Data have been addressed by many research projects and commercial solutions and can be partially solved by modern, scalable data management systems. However, variety remains to be a daunting challenge for Big Data Integration and requires also special methods for data quality management. Variety (or heterogeneity) exists at several levels: at the instance level, the same entity might be described with different attributes; at the schema level, the data is structured with various schemas; but also at the level of the modeling language, different data models can be used (e.g., relational, XML, or a document-oriented JSON representation). This might lead to data quality issues such as consistency, understandability, or completeness.

The heterogeneity of data sources in the Big Data Era requires new integration approaches which can handle the large volume and speed of the generated data as well as the variety and quality of the data. Traditional ‘schema first’ approaches as in the relational world with data warehouse systems and ETL (Extract-Transform-Load) processes are inappropriate for a flexible and dynamically changing data management landscape. The requirement for pre-defined, explicit schemas is a limitation which has drawn interest of many developers and researchers to NoSQL data management systems as these systems should provide data management features for a high amount of schema-less data. Nevertheless, a one-size-fits-all Big Data system is unlikely to solve all the challenges which are required from data management systems today. Instead, multiple classes of systems, optimized for specific requirements or hardware platforms, will co-exist in a data management landscape.

Thus, heterogeneity and data quality are challenges for many Big Data applications. While in some applications, a limited data quality for individual data items does not cause serious problems when a huge amount of data is aggregated, data quality problems in data sources are often revealed by the integration of these sources with other information. Data quality has been coined as ‘fitness for use’; thus, if data is used in another context than originally planned, data quality might become an issue. Similar observations have been also made for data warehouses which lead to a separate research area about data warehouse quality.

The workshop QDB 2016 aims at discussing recent advances and challenges on data quality management in database systems, and focuses especially on problems in related to Big Data Integration and Big Data Quality. The workshop will provide a forum for the presentation of research results, a panel discussion, and an attractive keynote speaker.

Back to Top

Program

Room:     Boardroom

09:00 - 09:30       Welcome and Introduction

Christoph Quix (Frauhofer FIT & RWTH Aachen University),
Rihan Hai (RWTH Aachen University)

09:30 - 10:30       Session 1: Data Quality

Data Quality for Semantic Interoperable Electronic Health Records
Shivani Batra (Jaypee Institute of Information Technology University),
Shelly Sachdeva (Jaypee Institute of Information Technology University)
Slides: Paper3Slides Paper: Paper1Slides

Communicating Data Quality in On-Demand Curation
Poonam Kumari (University at Buffalo), Said Achmiz,
Oliver Kennedy (University at Buffalo)

Slides: Paper2Slides Paper: Paper1Slides

10:30 - 11:00

Coffee Break

11:00 - 12:30       Session 2: Keynote Talk & Discussion

Data Glitches = Constraint Violations – Empirical Explanations
Divesh Srivastava (AT&T Labs)
Slides: Paper5Slides

Open Discussion: More than 20 Years of Data Quality Research – What are the major results?
All participants

12:30 - 14:00

Lunch

14:00 - 15:30       Session 3: Data Cleaning & Integration

Data Cleaning in the Wild: Reusable Curation Idioms from a Multi-Year SQL Workload
Shrainik Jain (University of Washington),
Bill Howe (University of Washington)
Slides: Paper1Slides Paper: Paper1Slides

DIRA: Data Integration to Return Ranked Alternatives
Reham I. Abdel Monem (Cairo University),
Ali H. El-Bastawissy (University, Giza),
Mohamed M. Elwakil (Innopolis University)
Paper: Paper1Slides

Towards Rigorous Evaluation of Data Integration Systems - It's All About the Tools
Boris Glavic (Illinois Institute of Technology)
Slides: Paper5Slides

15:30 - 16:00

Coffee Break

16:00 - 17:30       Session 4: Data Exploration & Exchange

Three Semi-Automatic Advisors for Data Exploration
Thibault Sellam (CWI)

Graph-based Exploration of Non-graph Datasets
Udayan Khurana (IBM Research)
Slides: Paper7Slides

Data Quality Management in Data Exchange Platforms – An Approach for the Industrial Data Space in Germany
Christoph Quix (Fraunhofer FIT)
Slides: Paper8Slides

Keynote

Speaker:      Divesh Srivastava, AT&T Labs Research.
Divesh Srivastava is the head of Database Research at AT&T Labs-Research. He is a Fellow of the Association for Computing Machinery (ACM) and the managing editor of the Proceedings of the VLDB Endowment (PVLDB). His research interests and publications span a variety of topics in data management. He received his Ph.D. from the University of Wisconsin, Madison, USA, and his Bachelor of Technology from the Indian Institute of Technology, Bombay, India.
Title: Data Glitches = Constraint Violations – Empirical Explanations
Abstract: Data glitches are unusual observations that do not conform to data quality expectations, be they semantic or syntactic, logical or statistical. By naively applying integrity constraints, potentially large amounts of data could be flagged as being violations. Ignoring or repairing significant amounts of the data could fundamentally bias the results and conclusions drawn from analyses. In the context of Big Data where large volumes and varieties of data from disparate sources are integrated, it is likely that significant portions of these violations are actually legitimate usable data. We conjecture that empirical glitch explanations – concise characterizations of subsets of violating data – could be used to (a) identify legitimate data and release them back into the pool of clean data, thereby reduce cleaning-related statistical distortion of the data; and (b) refine existing integrity constraints and generate improved domain knowledge. We present a few real-world case studies in support of our conjecture, outline scalable techniques to address the challenges of discovering explanations, and demonstrate the utility of the explanations in reclaiming over 99% of the violating data.
Back to Top

Accepted Papers

Paper1      Data Cleaning in the Wild: Reusable Curation Idioms from a Multi-Year SQL Workload
     Shrainik Jain, Bill Howe
Paper2      Communicating Data Quality in On-Demand Curation
     Poonam Kumari, Said Achmiz, Oliver Kennedy
Paper2      Data Quality for Semantic Interoperable Electronic Health Records
     Shivani Batra, Shelly Sachdeva
Paper3      DIRA: Data Integration to Return Ranked Alternatives
     Reham I. Abdel Monem, Ali H. El-Bastawissy, Mohamed M. Elwakil
Back to Top

Abstracts

Data Quality for Semantic Interoperable Electronic Health Records
Shivani Batra (Jaypee Institute of Information Technology University),
Shelly Sachdeva (Jaypee Institute of Information Technology University)


The current study considers an example of healthcare domain from a BIG DATA perspective to address the issues related to data quality. Healthcare domain frequently demands for timely semantic exchange of data residing at disparate sources. It aids in providing support for remote medical care and reliable decision making. However, an efficient semantic exchange needs to address challenges such as, data misinterpretation, distinct definition and meaning of underlying medical concept and adoption of distinct schemas. The current research aims to provide an application framework that aids in syntactic, structural and semantic interoperability to resolve various issues related to semantic exchange of electronic health records data. It introduces a new generic schema which is capable of capturing any type of data without a need of modifying existing schema. Moreover, proposed schema handles sparse and heterogeneous data efficiently. The generic schema proposed is built on the top of relational database management system (RDBMS) to aid in providing high consistency and availability of data. For having a deep analysis of proposed schema considering timeliness parameter of data quality, experiments have been performed on two flavours of RDBMS namely row oriented (MySQL) and column oriented (MonetDB). Results achieved favours adoption of column oriented RDBMS over row oriented RDBMS under various tasks performed in current research for timely access of data stored in proposed generic schema.

Communicating Data Quality in On-Demand Curation
Poonam Kumari (University at Buffalo),
Said Achmiz, Oliver Kennedy (University at Buffalo)


On-demand curation (ODC) tools like Paygo, KATARA, and Mimir allow users to defer expensive curation effort until it is necessary. In contrast to classical databases that do not respond to queries over potentially eroneous data, ODC systems instead answer with guesses or approximations. The quality and scope of these guesses may vary and it is critical that an ODC system be able to communicate this information to an end-user. The central contribution of this paper is a preliminary user study evaluating the cognitive burden and expressiveness of three representations of ``attribute-level'' uncertainty. The study shows (1) insignificant differences in time taken for users to interpret the types of uncertainty tested, and (2) that small changes in formatting can trigger a significant change in the way people interpret and react to data. Ultimately, we show that a set of guidelines and best practices for representing uncertainty will be necessary for ODC tools to be effective. This paper represents the first steps towards establishing such guidelines.

Data Cleaning in the Wild: Reusable Curation Idioms from a Multi-Year SQL Workload
Shrainik Jain (University of Washington),
Bill Howe (University of Washington)


In this work-in-progress paper, we extract a set of curation idioms from a five-year corpus of hand-written SQL queries collected from a Database-as-a-Service platform called SQLShare. The idioms we discover in the corpus include structural manipulation tasks (e.g., vertical and horizontal recomposition), schema manipulation tasks (e.g., column renaming and reordering), and value manipulation tasks (e.g., manual type coercion, null standardization, and arithmetic transformations). These idioms suggest that users find SQL to be an appropriate language for certain data curation tasks, but we find that applying these idioms in practice is sufficiently awkward to motivate a set of new services to help automate cleaning and curation tasks. We present these idioms, the workload from which they were derived, and the features they motivate in SQL to help automate tasks. Looking ahead, we describe a generalized idiom recommendation service that can automatically apply appropriate transformations, including cleaning and curation, on data ingest.

DIRA: Data Integration to Return Ranked Alternatives
Reham I. Abdel Monem (Cairo University),
Ali H. El-Bastawissy (University, Giza),
Mohamed M. Elwakil (Innopolis University)


Data integration (DI) is the process of collecting data needed for answering a query from distributed and heterogeneous data sources and providing users with a unified form of this data. Data integration is strictly tied with data quality due to two main data integration challenges first, providing user with high qualitative query results second, identifying and solving values conflicts on the same real-world objects efficiently and in the shortest time. In our work, we focus on providing user with high qualitative query results. The quality of a query result can be enhanced by evaluating the quality of the data sources and retrieving results from the significant ones only. Data quality measures are used not only for determining the significant data sources but also in ranking data integration results according to user-required quality and presenting them in a reasonable time. In this paper, we perform an experiment that shows a mechanism to calculate and store a set of quality measures on different granularities through new data integration framework called data integration to return ranked alternatives (DIRA). These quality measures are used in selecting the most significant data sources and producing top-k query results according to query types that we proposed. DIRA validation using the transaction processing performance council (TPC) benchmark version called TPC-DI will show how our framework improves the returned query results.

Towards Rigorous Evaluation of Data Integration Systems - It's All About the Tools
Boris Glavic (Illinois Institute of Technology)

Given the maturity of the data integration field it is surprising that rigorous empirical evaluations of research ideas are so scarce. We identify a major roadblock for empirical work - the lack of tools that aide a researcher in generating the inputs and gold standard outputs for their integration tasks in a controlled, effective, and repeatable manner. In this talk, I will give an overview of our efforts for developing such tools and highlight how they have been used for streamlining the empirical evaluation of a wide variety of integration systems. Particularly, the talk will focus on two systems: iBench and BART. iBench is a metadata generator that can be used to evaluate a wide-range of integration tasks (data exchange, mapping creation, mapping composition, schema evolution, among many others). The system permits control over the size and characteristics of the metadata it generates (schemas, constraints, and mappings). BART (Benchmarking Algorithms for data Repairing and Translation) is a scalable system for introducing errors into clean databases for the purpose of benchmarking data-cleaning algorithms. The presentation will include a short live demonstration of both systems.

Three Semi-Automatic Advisors for Data Exploration
Thibault Sellam (CWI)

In data exploration, users query a database to discover its content. Typically, explorers operate by trial and error. They write a query, observe the results and reiterate. When the data is small, this approach is perfectly acceptable. But what if the database contains 100s of columns and 100,000s of tuples? During this talk, I will introduce Blaeu, Claude and Ziggy, three "advisors" for data exploration. The main idea is to use simple machine learning models to help users navigate the space of all possible queries and views. I will present practical use cases, discuss the main ideas behind each assistant and describe open research problems.

Graph-based Exploration of Non-graph Datasets
Udayan Khurana (IBM Research)

Graphs or networks provide a powerful abstraction to view and analyze relationships among different entities present in a dataset. However, much of the data of interest to analysts and data scientists resides in non-graph forms such as relational databases, JSON, XML, CSV and text. The effort and skill required in identifying and extracting the relevant graph representation from data is often the prohibitive and limits a wider adoption of graph-based analysis of non-graph data. In this paper, we demonstrate our system called GraphViewer, for accelerated graph-based exploration and analysis. It automatically discovers relevant graphs implicit within a given non-graph dataset using a set of novel rule-based and data-driven techniques, and optimizes their extraction and storage. It computes several node and graph level metrics and detects anomalous entities in data. Finally, it summarizes the results to support interpretation by a human analyst. While the system automates the computationally intensive aspects of the process, it is engineered to leverage human domain expertise and instincts to fine tune the data exploration process.

Data Quality Management in Data Exchange Platforms – An Approach for the Industrial Data Space in Germany
Christoph Quix (Fraunhofer FIT)

Data quality plays an important role in data marketplaces as a value is assigned to the data and customers pay for the received data. It is known that data quality problems arise especially in data integration projects, when data (from one organization) is used in a different context than originally planned. This problem is aggravated in a setting where data is exchanged between different organizations as in a data marketplace. In addition, data consumers expect a high data quality as they pay for the data. Research in data quality has derived many issues from quality management for classical products and transferred this to the case of data management. An open question is how results from quality assurance and pricing models in the classical product world can be transferred to data. In this talk, we will review the state of the art in the area of data quality management and pricing in data marketplaces and report on the initiative "Industrial Data Space" in Germany, in which open platform for data exchange between industrial organizations is being developed.

Back to Top

Publication

The proceedings of the workshop will be published online as a volume of the CEUR Workshop Proceedings(ISSN 1613-0073), a well-known website for publishing workshop proceedings. It is indexed by the major publication portals, such as Citeseer, DBLP and Google Scholar.
Furthermore, the authors of the best papers of the workshop will be invited to submit an extended version of their work to a special issue of the ACM Journal of Data and Information Quality. Back to Top

Important Dates

  • Submission (Extended): June 3, 2016
  • Notification: July 1, 2016
  • Camera-Ready Version: July 15, 2016
  • Workshop Date: September 5, 2016
Back to Top

Program Chair

Back to Top

List of Topics

Big Data Quality

  • Data quality in Big Data integration
  • Data quality models
  • Data quality in data streams
  • Data quality management for Big Data systems
  • Data cleaning, deduplication, record linkage
  • Big Data Provenance, Auditing

Big Data Integration

  • Big Data systems for data integration
  • Real-time (On-the-fly) data integration
  • Graph-based algorithms for Big Data integration
  • Integration and analytics over large-scale data stores
  • Data integration for data lakes
  • Efficiency and optimization opportunities in Big Data Integration
  • Data Stream Integration

Management of Heterogeneous Data

  • Query processing, indexing and storage for heterogeneous data
  • Information retrieval over semi-structured or unstructured data
  • Efficient index structures for keyword queries
  • Query processing of keyword queries
  • Data visualization for heterogeneous data
  • Management of heterogeneous graph structures
  • Knowledge discovery, clustering, data mining for heterogeneous Data

Schema and Metadata Management

  • Innovative algorithms and systems for "Schema-on-Read"
  • Schema inference in semi-structured data
  • Pay-as-you-go schema definition
  • Schema & graph summarization techniques
  • Metadata models for Big Data
  • Schema matching for Big Data
Back to Top

Submission Guidelines

QDB welcomes the full paper submission of original and previously unpublished research.
All submissions will be peer-reviewed and, once accepted, they will be included in the workshop proceedings.
Back to Top

Workshop Organizers

Local and Publicity Chair

Back to Top

Supported By

Back to Top

Call for Papers

The Call for Papers can be downloaded here.
Back to Top