The goal of this thesis is to evaluate the applicability of so-called Virtual Graphs in a Distributed Analytics environment.
Thesis Type |
|
Status |
Open |
Presentation room |
Seminar room I5 6202 |
Supervisor(s) |
Stefan Decker |
Advisor(s) |
Laurenz Neumann |
Contact |
laurenz.neumann@dbis.rwth-aachen.de |
The goal of this thesis is to evaluate the applicability of so-called Virtual Graphs in a Distributed Analytics environment. Distributed Analytics describes a paradigm of data analysis for distributed datasets, in which the data resides at the data-owners facilities and the algorithm for data analysis is shipped to the data. Instead of aggregating the data at a central location, the partial results of the analysis are aggregated. The feasibility of this concept has already been demonstrated, e.g. in the medical domain [https://www.mdpi.com/2076-3417/12/9/4336].
Virtual Graphs describe the concept of providing non-materialized views in the form of a linked data graph on non-linked data sources. Such a concept may ease data access when analyzing heterogeneous data sources in a DA-environment. This hopefully complement linked-data-based data sources such as the FHIR standard for medical data, by providing a uniform interface to data stored across multiple institutions regardless of the underlying structure of the data storage at each institution.
The vision here is that researchers creating analysis tasks in a DA setting do not need to concern themselves with the way data is stored in the different facilities, but rather just utilise a standardised ontology or vocabulary to access the data they need.
Thesis Goals:
- Review existing solutions for creating Virtual Graphs and identify a candidate suitable for application in an existing DA-Infrastructure, e.g. Ontop VKG [https://ontop-vkg.org/].
- Implement a proof-of-concept software which securely integrates VG-based access in an existing DA-Infrastructure, based on containerisation.
- Investigate whether the VG-based data access can enable additional benefits in the DA domain, e.g. providing an additional layer of data security, privacy or fulfill regulatory necessities, such as creating a log of data access.
Initial Literature:
- „Virtual Knowledge Graphs: An Overview of Systems and Use Cases“, Xiao et al. 2019
- „Efficient SPARQL-to-SQL with R2RML mappings“, Rodríguez-Muro et al. 2015
- Knowledge of semantic web standards such as RDF or SPARQL
- Proficiency in software development with modern languages such as Python, Go or Rust, etc.
-
Experience with containerisation technologies such as Docker