Bridging the gap between design and deployment of statistical analyses in Distributed Analytics

May 11th, 2022

Thesis Type	Bachelor
Status	Finished
Supervisor(s)
Advisor(s)	Sascha Welten Yongli Mou
Contact	welten@dbis.rwth-aachen.de mou@dbis.rwth-aachen.de

With the increasing amount of data that is generated and accessible, data analytics can foster innovation and reveal valuable knowledge. Especially in the healthcare domain, data analytics offers improvements for clinical decision making, diagnostics, or finding treatment plans for patients. As a result, data analytics is viewed as a possibility to improve quality and efficiency while reducing cost in the European healthcare system, granting it a key role. Besides healthcare, other sectors such as business, chemical engineering, or education have recognized or used the capabilities offered by data analytics.

Traditionally, the workflow for data analysis consists of the following three steps: First, data is collected from several sources and then stored in a central location. Afterward, this centralized data is used to execute the analysis. However, storing all data in a central location poses several challenges. For example, due to the exponential growth of the data, the resulting data volume might not allow central storage, or in some cases, central storage would be too expensive. Moreover, because of the large data volumes involved, the data transfer from the data sources to the central location raises obstacles such as possible bottlenecks in the analysis. Besides technical challenges, regulations such as the GDPR in the EU or The Data Protection Act in the UK prohibit or limit the centralization of personal data due to privacy concerns. This issue is especially pressing in the context of health-related data because of its sensitive nature.

One proposed solution to the stated problems is so-called distributed analytics (DA). At its core, DA reverses the paradigm by bringing the analysis code to the data and therefore eliminating the need for data centralization. Implementations of this idea, such as DataSHIELD (DS), the Personal Health Train (PHT), or Vantage6, have already demonstrated the feasibility of this approach.

As a result of these problems, using fundamental approaches for software development like prototyping is very challenging when developing algorithms for DA systems. This issue also becomes apparent when considering the typical phases of the software development lifecycle. There is the gap between developing the analysis algorithm and actually deploying it that is currently present in the DA architectures is visible. Moreover, it can be seen that the design of the statistical analysis also poses challenges because of the overhead involved in acquiring the data schema information.

A straightforward solution to the aforementioned problems would be to collect data samples of each participating data source. This would allow a developer to view the schemas as well as test the algorithm. However, this approach is not privacy-preserving and therefore conflicts with one of the core benefits of DA systems. Consequently, this does not constitute a valuable option.

Overall the lack of straightforward access to data schema information as well as testing capabilities for the analysis algorithms poses an important gap in DA systems that is not present in the centralized counterpart. This gap currently raises the threshold in terms of usability for the decentralized approaches and constitutes an important omission in the research of DA systems.

This thesis should closing the gaps mentioned-above.
To reach this goal, first, different DA systems are assessed and compared to find common ground. With this common ground in mind, the following research questions will be evaluated:

If privacy-preserving testing of analysis algorithms can be enabled in DA
How the barriers for accessing schema information can be lowered to alleviate the created overhead and therefore bring the design of distributed analysis algorithms on par with the centralized counterpart.
If the correctness of existing statistical analyses can be ensured without executing them in the environment.

If you are interested in this thesis, do not hesitate to contact us via welten@dbis.rwth-aachen.de.

Implementation can be found at: https://github.com/PADME-PHT/playground

Prerequisites:

Containerisation Technologies
Basic understanding of Machine Learning

Related Projects:

FAIR Data Spaces
Smart Medical Information Technology for Healthcare (SMITH)

DBIS

Bridging the gap between design and deployment of statistical analyses in Distributed Analytics

Quick Links

Recent News

Recent Publications