Skip to content. | Skip to navigation

Personal tools
You are here: Home Theses Data Curation for Model Training using Representation Learning and Feature Embeddings


Prof. Dr. S. Decker
RWTH Aachen
Informatik 5
Ahornstr. 55
D-52056 Aachen
Tel +49/241/8021501
Fax +49/241/8022321

How to find us

Annual Reports





Data Curation for Model Training using Representation Learning and Feature Embeddings

Thesis type
  • Master
Status Running

In the context of this Master Thesis, the student should develop a containerisation pipeline for algorithms used in distributed analytics.

In recent years, as newer technologies have evolved around the healthcare ecosystem, more and more data have been generated. 

Advanced analytics could power the data collected from numerous sources, both from healthcare institutions, or generated by individuals themselves via apps and devices, and lead to innovations in treatment and diagnosis of diseases; improve the care given to the patient; and empower citizens to participate in the decision-making process regarding their own health and well-being. However, the sensitive nature of the health data prohibits healthcare organizations from sharing the data. 

The Personal Health Train (PHT) is a novel approach, aiming to establish a distributed data analytics infrastructure enabling the (re)use of distributed healthcare data, while data owners stay in control of their own data. 

The main principle of the PHT is that data remains in its original location, and analytical tasks visit data sources and execute the tasks. The PHT provides a distributed, flexible approach to use data in a network of participants, incorporating the FAIR principles. 


It facilitates the responsible use of sensitive and/or personal data by adopting international principles and regulations.


This Master Thesis focuses on the creation of a automated data curation service for raw data.


In a distributed analytics ecosystem various data providers expose their data in a different and inconsistent way. For every data type, the data interfaces have to be adjusted for every train, which complicates the training of distributed (Machine Learning) algorithms. One solution is the provision of pre-processed data in a pre-defined encoding, e.g., feature embedding.

The goal of this thesis is the design of an automated data curation tool, which transforms the raw data into a, e.g, feature space embedding or even other encodings. Algorithms visiting this station will use this already encoded data for the model training instead of working on the raw data.

These encodings should be based on common techniques and should also follow privacy guidelines. Further, a protocol should be developed in order to harmonise different embeddings among different data providers. The goal is that every encoding is offered at every station such that the model training is based on the same feature space.


If you are interested in this thesis, a related topic or have additional questions, please do not hesitate to send a message to


In-depth knowledge in Machine Learning and general statistics
In-depth knowledge in containerisation technologies, preferably Docker.

Related projects

Document Actions