Categories
Pages
-

DBIS

Virtual Graph-based Data Access in Heterogeneous Distributed-Analytics Environments

August 29th, 2024

The goal of this thesis is to evaluate the applicability of Virtual Graphs in a Distributed Analytics environment.

Thesis Type
  • Bachelor
Status
Running
Presentation room
Seminar room I5 6202
Supervisor(s)
Stefan Decker
Advisor(s)
Laurenz Neumann
Contact
laurenz.neumann@dbis.rwth-aachen.de

Distributed Analytics describes a paradigm of data analysis for distributed datasets, in which the data resides at the data-owners facilities and the algorithm for data analysis is shipped to the data. Instead of aggregating the data at a central location, the partial results of the analysis are aggregated. The feasibility of this concept has already been demonstrated, e.g. in the medical domain [https://www.mdpi.com/2076-3417/12/9/4336].

Virtual Graphs describe the concept of providing non-materialized views in the form of a linked data graph on non-linked data sources. Such a concept may ease data access when analyzing heterogeneous data sources in a DA-environment. This hopefully complement linked-data-based data sources such as the FHIR standard for medical data, by providing a uniform interface to data stored across multiple institutions regardless of the underlying structure of the data storage at each institution.
The vision here is that researchers creating analysis tasks in a DA setting do not need to concern themselves with the way data is stored in the different facilities, but rather just utilise a standardised ontology or vocabulary to access the data they need.

Thesis Goals:

  • Review existing solutions for creating Virtual Graphs and identify a candidate suitable for application in an existing DA-Infrastructure, e.g. Ontop VKG [https://ontop-vkg.org/].
  • Implement a proof-of-concept software which securely integrates VG-based access in an existing DA-Infrastructure, based on containerisation.
  • Investigate whether the VG-based data access can enable additional benefits in the DA domain, e.g. providing an additional layer of data security, privacy or fulfill regulatory necessities, such as creating a log of data access.

Initial Literature:

  • „Virtual Knowledge Graphs: An Overview of Systems and Use Cases“, Xiao et al. 2019
  • „Efficient SPARQL-to-SQL with R2RML mappings“, Rodríguez-Muro et al. 2015

Prerequisites:
  • Knowledge of semantic web standards such as RDF or SPARQL
  • Proficiency in software development with modern languages such as Python, Go or Rust, etc.
  • Experience with containerisation technologies such as Docker