Is there data? Repository approaches to simplify the search for reusable research data

October 4th, 2023

The discovery of scientific knowledge has always been a continuous and exponentially growing process. As data-driven research generates massive amounts of datasets, there has been a growing awareness among researchers and institutions regarding the importance of making research data openly accessible. Making the underlying data freely accessible contributes to reproducibility and transparency in research and fosters public faith in scientific discovery. In this light, the role of data repositories as a medium to share reusable data and thus to connect data providers with data consumers is becoming highly relevant. This thesis investigates approaches for data repositories to enhance collaborative, data-driven research, with a special focus on improving data discoverability.

Thesis Type
  • Master
Sandra Geisler
Soo-Yon Kim

Over the years, to establish data sharing as a habit, numerous publications and funding agencies have introduced mandates and incentives to promote sharing and maximise research potential. The scientific community is working toward overcoming the technical as well as the cultural shortcomings of the present ecosystem. Data repositories, libraries and other data centres are working towards educating researchers and enforcing data management practises that make research data fit for sharing and reuse.

Reusing research data is not a simple activity but a complex process that involves finding and accessing a dataset, examining its relevance, and determining how to integrate it with one’s research. These processes will be considerably different based on the nature of reuse – comparative vs integrative. In the search for reusable data, if researchers assume that data has been shared, where can they find it? Experienced researchers may be familiar with suitable repositories, but a substantial amount of time and effort has to be invested in simply discovering data that exists.

Scientific dataset discovery refers to the whole process – from defining a data need to assessing its suitability. Despite the limited literature on this subject, there is a common understanding that dataset discovery for reuse differs from finding publications or similar content. Research data discovery must consider the semantic and entity relationships within a dataset. While the research focuses on discovering publicly available datasets, many enterprises have a critical need for data discovery systems. To promote faster development cycles and maximise innovation, providing flexible, project-wide or organisation-wide access to the datasets generated by their employees is of utmost concern.

This thesis aims to analyse existing approaches to dataset discovery and their limitations to make recommendations for a dataset discovery platform in the context of large-scale projects. The platform should afford the following benefits:

G1. Improve visibility of datasets across institutes of large-scale projects
G2. Facilitate cross-discipline research within large-scale projects.

Based on these goals, the research questions are framed as follows:

RQ1. What are the components of large-scale projects and their limitations with respect
to dataset discovery?
RQ2. How can data stewards extend available infrastructure to simplify the discovery
of research assets?