Applying a Curiosity-Module to training on rare events

May 25th, 2022

Thesis Type
  • Bachelor
Sascha Welten

Rare events or features are – as the name might suggest –  very infrequently distributed. These highly skewed datasets complicate the discovery of new insights as data model training suffers from these immanent statistical problems. These limitations especially affect tremendous medical research directions, in which a dataset is not sufficiently available at each single hospital. Rare diseases (RD) are a typical class of diseases in this described context. In the European Union, an RD is defined as a class of disease having a prevalence below 5 occurrences per 10.0000 individuals (For more information visit the German Ministry of Health: Additionally, RDs are characterised with extreme heterogeneity and complexity, which complicates the finding and the application of uniform treatments3. Therefore, RDs are usually treated in specialised medical centres (so-called RD centres) in which RD experts for selected RDs are working. Such RD centres are distributed over Europe; for some RD there are only a few of such centres worldwide available. Due to the nature of RDs, they are neither adequately diagnosed nor reported and documented in a single hospital. Most patients experience an odyssey and move from doctor to doctor with the hope to get clarification. In this way, records of RD patients are typically highly distributed among different hospitals and RD centres. In general, we can identify three main challenges arising during the treatment of possible RDs patients2,3.

First, the rarity of RDs causes late or even absent diagnoses, which originates from a lack of expertise and limited access to (new) therapies. Second, due to the lack of experts, RD patients often visit several geographically distributed RD centres to arrange an individual consultancy. Hence, patient data is horizontally distributed over several RD centres and need to be reconciled before they can be analysed. One solution is the centralisation of the patient records into a single RD registry (which is specifically for each class of RD). However, centralising sensitive data will cause privacy concerns and the loss of data sovereignty of each RD centre.

One solution to mitigate the impact of imbalanced data on the models predictive performance is to weight each data instance according to its “novelty”. In other words, for each feature representation, we have to define how “new” the model is for the model training. The hypothesis is that newer data instances can fuel model trainings as the model does not benefit from already seen instances.

One approach to measure novelty is the application of so-called curiosity modules, which have been already applied to Reinforcement Learning.

The objective of this thesis is to investigate the behaviour of a curiosity module in the rare event setting to improve the model training for rare event detection.
First, the behaviour of the curiosity module should be investigated on centralised data sets.
In a second phase, the problem setting will be more realistic as the behaviour should be investigated on decentralised data.
In a more advanced approach, the module should be applied to dynamic data streams.
The goal is to gather a global finding about the curiosity module’s behaviour on rare events (e.g. rare diseases) in various settings and requirements.

If you are interested in this thesis, do not hesitate to contact us via



Knowledge about Machine Learning