Categories
Pages
-

DBIS

Declerative Decentralized Analytics Workflows for FAIRData Sharing and Utilization

July 10th, 2025


This thesis focuses on addressing the limitations of current Distributed Analytics Architectures by developing a declarative approach to model and automate cross-institutional analysis workflows. It aims to implement a secure, low-complexity architecture that enhances reproducibility and extensibility while evaluating its effectiveness compared to existing methods.

Thesis Type
  • Master
Status
Running
Presentation room
Seminar room I5 6202
Supervisor(s)
Stefan Decker
Advisor(s)
Laurenz Neumann
Contact
laurenz.neumann@dbis.rwth-aachen.de

Decentralized Analytics is a paradigm that allows for the consumption of data across institutional boundaries while protecting the data sovereignty of the data holders. The core idea is to move the algorithm to the data holders and aggregate results in an incremental or parallel approach, rather than collecting the data at a central location. Initial applications have shown that this approach can make data accessible that is normally locked behind institutional borders. This enables, for example, training a machine learning model to detect melanomas using data that is typically not available outside a hospital.

However, current architectures are limited in the reproducibility of such analysis workflows as they lack abstractions, necessitating technical knowledge of underlying implementation details and requiring a human-in-the-loop with the software to manage all the details
of an analysis workflow.

This thesis addresses these challenges by investigating the development of a decentralized analytic architecture where cross-institutional analysis workflows can be declaratively modeled. Drawing inspiration from declarative container orchestration systems, this research aims to define an abstract specification of concepts in the decentralized analytics domain, enabling automated and trustworthy execution on distributed datasets.

For example, in Kubernetes, you are able to describe how a specific deployment should look, including aspects such as the number of replicas and whether any volumes should be used. Similarly, you should be able to describe an analysis workflow in terms of which datasets in which institutions you would like to use, the order in which institutions should be visited, and so on. The software should create the necessary resources and schedule the execution. While there is initial work for descriptive metadata for such analysis architectures, it has not been investigated how to utilize declarative metadata for such a task.

The core motivation is that such declarative analysis tasks can be easily shared for reproducibility. Further benefits include the automated definition of such workflows, for example, with Large Language Models (LLMs). Additionally, this approach allows for further automation of parts of the workflow and creates a base for future extensions.

Another challenge limiting the FAIRness (Findability, Accessibility, Interoperability, and
Reusability) of current architectures is their reliance on a central server. While there are initial
studies on P2P decentralized analysis architectures and their challenges, there is no implementation
of such a system yet. Hence, the architecture should be designed with P2P in mind, not needing a
single point of orchestration.

Thesis Goals:

  • Identify Limitations: Identify limitations in existing Distributed Analytics Architectures.
  • Model Abstractions: Develop sufficient abstractions of concepts to enable a declarative description of distributed analytics workflows.
  • Investigate Secure Methods: Investigate methods to enable secure peer-to-peer (P2P) Distributed Analytics while maintaining low complexity in running the workflows.
  • Implement and Evaluate: Implement an initial version of a Distributed Analytics architecture capable of executing such workflows. Evaluate its capabilities in comparison to existing approaches, ensuring the software remains extensible while limiting complexity.

Initial Literature:

  • Welten et al.: A privacy-preserving distributed analytics platform for health care data
  • Beyan et al.: Distributed analytics on sensitive medical data: the personal health train
  • Welten, Neumann et al.: DAMS: A distributed analytics metadata schema
  • Existing descriptive Metadata schema for the Personal Health Train based on the
    paper above: https://schema.padme-analytics.de/

Prerequisites:
  • Knowledge of how to create modern network applications incorporating existing
    technologies such as Docker.
  • Motivation to independently work on research problems.
  • Nice to have: Experience with declarative infrastructure (e.g., Kubernetes,
    Terraform) to have an initial understanding of declarative technologies.

Related Projects:
SAFERWATER