About the Workshop
Data quality problems arise frequently when data is integrated from disparate sources. In the context
of Big Data applications, data quality is becoming more important because of the unprecedented volume, large variety,
and high velocity. The challenges caused by volume and velocity of Big Data have been addressed
by many research projects and commercial solutions and can be partially solved by modern,
scalable data management systems. However, variety remains to be a daunting challenge for Big
Data Integration and requires also special methods for data quality management. Variety (or heterogeneity) exists at several levels: at the instance level, the same
entity might be described with different attributes; at the schema level, the data is structured with
various schemas; but also at the level of the modeling language, different data models can be
used (e.g., relational, XML, or a document-oriented JSON representation). This might lead to data quality
issues such as consistency, understandability, or completeness.
The heterogeneity of data sources in the Big Data Era requires new integration approaches which
can handle the large volume and speed of the generated data as well as the variety and quality of the data.
Traditional ‘schema first’ approaches as in the relational world with data warehouse systems and
ETL (Extract-Transform-Load) processes are inappropriate for a flexible and dynamically changing
data management landscape. The requirement for pre-defined, explicit schemas is a limitation
which has drawn interest of many developers and researchers to NoSQL data management systems
as these systems should provide data management features for a high amount of schema-less data.
Nevertheless, a one-size-fits-all Big Data system is unlikely to solve all the challenges which are
required from data management systems today. Instead, multiple classes of systems, optimized
for specific requirements or hardware platforms, will co-exist in a data management landscape.
Thus, heterogeneity and data quality are challenges for many
Big Data applications. While in some applications, a limited data quality for individual data items
does not cause serious problems when a huge amount of data is aggregated, data quality problems
in data sources are often revealed by the integration of these sources with other information. Data
quality has been coined as ‘fitness for use’; thus, if data is used in another context than originally
planned, data quality might become an issue. Similar observations have been also made for data
warehouses which lead to a separate research area about data warehouse quality.
The workshop QDB 2016 aims at discussing recent advances and challenges on data quality management in database
systems, and focuses especially on problems in related to Big Data
Integration and Big Data Quality. The workshop will provide a forum for the presentation of research results, a panel
discussion, and an attractive keynote speaker.
Back to Top
Abstracts
Data Quality for Semantic Interoperable Electronic Health Records
Shivani Batra (Jaypee Institute of Information Technology University),
Shelly Sachdeva (Jaypee Institute of Information Technology University)
The current study considers an example of healthcare domain from a BIG DATA
perspective to address the issues related to data quality. Healthcare domain
frequently demands for timely semantic exchange of data residing at disparate
sources. It aids in providing support for remote medical care and reliable decision
making. However, an efficient semantic exchange needs to address challenges such as,
data misinterpretation, distinct definition and meaning of underlying medical concept
and adoption of distinct schemas. The current research aims to provide an application
framework that aids in syntactic, structural and semantic interoperability to resolve
various issues related to semantic exchange of electronic health records data. It
introduces a new generic schema which is capable of capturing any type of data without a
need of modifying existing schema. Moreover, proposed schema handles sparse and
heterogeneous data efficiently. The generic schema proposed is built on the top of
relational database management system (RDBMS) to aid in providing high consistency and
availability of data. For having a deep analysis of proposed schema considering timeliness
parameter of data quality, experiments have been performed on two flavours of RDBMS
namely row oriented (MySQL) and column oriented (MonetDB). Results achieved favours
adoption of column oriented RDBMS over row oriented RDBMS under various tasks
performed in current research for timely access of data stored in proposed generic schema.
Communicating Data Quality in On-Demand Curation
Poonam Kumari (University at Buffalo),
Said Achmiz, Oliver Kennedy (University at Buffalo)
On-demand curation (ODC) tools like Paygo, KATARA, and Mimir allow users to defer
expensive curation effort until it is necessary. In contrast to classical databases
that do not respond to queries over potentially eroneous data, ODC systems instead
answer with guesses or approximations. The quality and scope of these guesses may
vary and it is critical that an ODC system be able to communicate this information
to an end-user. The central contribution of this paper is a preliminary user study
evaluating the cognitive burden and expressiveness of three representations of ``attribute-level''
uncertainty. The study shows (1) insignificant differences in time taken for users
to interpret the types of uncertainty tested, and (2) that small changes in formatting
can trigger a significant change in the way people interpret and react to data. Ultimately,
we show that a set of guidelines and best practices for representing uncertainty
will be necessary for ODC tools to be effective. This paper represents the first
steps towards establishing such guidelines.
Data Cleaning in the Wild: Reusable Curation Idioms from a Multi-Year SQL Workload
Shrainik Jain (University of Washington),
Bill Howe (University of Washington)
In this work-in-progress paper, we extract a set of curation idioms from a five-year corpus of
hand-written SQL queries collected from a Database-as-a-Service platform called SQLShare.
The idioms we discover in the corpus include structural manipulation tasks (e.g., vertical and horizontal
recomposition), schema manipulation tasks (e.g., column renaming and reordering), and value manipulation
tasks (e.g., manual type coercion, null standardization, and arithmetic transformations).
These idioms suggest that users find SQL to be an appropriate language for certain data curation tasks,
but we find that applying these idioms in practice is sufficiently awkward to motivate a set of new
services to help automate cleaning and curation tasks. We present these idioms, the workload from which
they were derived, and the features they motivate in SQL to help automate tasks. Looking ahead,
we describe a generalized idiom recommendation service that can automatically apply appropriate
transformations, including cleaning and curation, on data ingest.
DIRA: Data Integration to Return Ranked Alternatives
Reham I. Abdel Monem (Cairo University),
Ali H. El-Bastawissy (University, Giza),
Mohamed M. Elwakil (Innopolis University)
Data integration (DI) is the process of collecting data needed for answering
a query from distributed and heterogeneous data sources and providing users with
a unified form of this data. Data integration is strictly tied with data quality
due to two main data integration challenges first, providing user with high qualitative
query results second, identifying and solving values conflicts on the same real-world
objects efficiently and in the shortest time. In our work, we focus on providing user
with high qualitative query results.
The quality of a query result can be enhanced by evaluating the quality of the data sources and
retrieving results from the significant ones only. Data quality measures are used not only for
determining the significant data sources but also in ranking data integration results according
to user-required quality and presenting them in a reasonable time. In this paper, we perform an
experiment that shows a mechanism to calculate and store a set of quality measures on different
granularities through new data integration framework called data integration to return ranked
alternatives (DIRA). These quality measures are used in selecting the most significant data
sources and producing top-k query results according to query types that we proposed. DIRA
validation using the transaction processing performance council (TPC) benchmark version
called TPC-DI will show how our framework improves the returned query results.
Towards Rigorous Evaluation of Data Integration Systems - It's All About the Tools
Boris Glavic (Illinois Institute of Technology)
Given the maturity of the data integration field it is surprising that rigorous empirical
evaluations of research ideas are so scarce. We identify a major roadblock for empirical
work - the lack of tools that aide a researcher in generating the inputs and gold
standard outputs for their integration tasks in a controlled, effective, and repeatable
manner. In this talk, I will give an overview of our efforts for developing such tools and
highlight how they have been used for streamlining the empirical evaluation of a wide variety
of integration systems. Particularly, the talk will focus on two systems: iBench and BART.
iBench is a metadata generator that can be used to evaluate a wide-range of integration
tasks (data exchange, mapping creation, mapping composition, schema evolution, among many others).
The system permits control over the size and characteristics of the metadata it generates
(schemas, constraints, and mappings). BART (Benchmarking Algorithms for data Repairing and
Translation) is a scalable system for introducing errors into clean databases for the
purpose of benchmarking data-cleaning algorithms. The presentation will include a short
live demonstration of both systems.
Three Semi-Automatic Advisors for Data Exploration
Thibault Sellam (CWI)
In data exploration, users query a database to discover its
content. Typically, explorers operate by trial and
error. They write a query, observe the results and reiterate.
When the data is small, this approach is perfectly acceptable. But what if the
database contains 100s of columns and 100,000s of tuples?
During this talk, I will introduce Blaeu, Claude and Ziggy, three "advisors" for data exploration.
The main idea is to use simple machine learning models to help users navigate the space of all possible
queries and views. I will present practical use cases, discuss the main ideas behind each assistant
and describe open research problems.
Graph-based Exploration of Non-graph Datasets
Udayan Khurana (IBM Research)
Graphs or networks
provide a powerful abstraction to view and analyze relationships
among different entities present in a dataset. However, much of the
data of interest to analysts and data scientists resides in
non-graph forms such as relational databases, JSON, XML, CSV and
text. The effort and skill required in identifying and extracting
the relevant graph representation from data is often the prohibitive
and limits a wider adoption of graph-based analysis of non-graph
data. In this paper, we demonstrate our system called GraphViewer,
for accelerated graph-based exploration and analysis. It
automatically discovers relevant graphs implicit within a given
non-graph dataset using a set of novel rule-based and data-driven
techniques, and optimizes their extraction and storage. It computes
several node and graph level metrics and detects anomalous entities
in data. Finally, it summarizes the results to support
interpretation by a human analyst. While the system automates the
computationally intensive aspects of the process, it is engineered
to leverage human domain expertise and instincts to fine tune the
data exploration process.
Data Quality Management in Data Exchange Platforms – An Approach for the Industrial Data Space in Germany
Christoph Quix (Fraunhofer FIT)
Data quality plays an
important role in data marketplaces as a value is assigned to the
data and customers pay for the received data. It is known that data
quality problems arise especially in data integration projects, when
data (from one organization) is used in a different context than
originally planned. This problem is aggravated in a setting where
data is exchanged between different organizations as in a data
marketplace. In addition, data consumers expect a high data quality
as they pay for the data. Research in data quality has derived many
issues from quality management for classical products and
transferred this to the case of data management. An open question is
how results from quality assurance and pricing models in the
classical product world can be transferred to data. In this talk, we
will review the state of the art in the area of data quality
management and pricing in data marketplaces and report on the
initiative "Industrial Data Space" in Germany, in which open
platform for data exchange between industrial organizations is being
developed.
Back to Top