Categories
Pages
-

DBIS

Formalizing Early-Stage Data Science Requirements for an LLM-Based Data Acquisition Agent

February 9th, 2026

This thesis investigates how to formally represent early-stage data science requirements and how to support the automation of early-stage data science through an LLM-based agent.

Thesis Type
  • Bachelor
Status
Open
Presentation room
Seminar room I5 6202
Supervisor(s)
Sandra Geisler
Advisor(s)
Soo-Yon Kim
Contact
kim@dbis.rwth-aachen.de

Overview

In many data science projects, the first step is to translate a real-world problem into a concrete analytical task and a set of data requirements. This includes defining the prediction or analysis target, relevant data attributes, constraints, and how suitable data can be obtained. For example, a clinician may be interested in training an algorithm for the early detection of a disease, which must be translated into an exact clinical event definition, a list of required clinical attributes, constraints such privacy or security, and a plan for where such data can be sourced. In practice, this phase is largely manual and often consumes substantial time and coordination. Recent research indicates that Large Language Models (LLMs) hold automation potential for this phase. This thesis investigates how to formally represent early-stage data science requirements and how to support the automation of early-stage data science through an LLM-based agent.

 

Research Questions

  • What information must be represented to describe early-stage data science requirements (e.g., target, data attributes, constraints, acquisition options)?
  • How can this information be formalized in a structured and machine-interpretable way?
  • How can such a formalization be operationalized by an agent to generate concrete data acquisition artifacts?
  • How can the expressiveness and usefulness of the formalization be evaluated?

 

Methodology

  • Design a structured formalization for task and data requirements.
  • Review related work on LLM-based agents and data discovery.
  • Implement a proof-of-concept LLM-based agent that operationalizes the formalization and generates acquisition artifacts for a sample use case.
  • Evaluate the formalization and agent output, e.g., against a predefined ground truth, and against baseline methods (such as prompt-only).

 

Tasks

  • Literature review
  • Formalization of requirements
  • Implementation of agent prototype
  • Experimental evaluation

 

Initial Literature

In case you are interested in this thesis, please write an email to the thesis advisor with your CV and transcript of records.


Prerequisites:
  • Experience in data science
  • Knowledge or strong interest in LLMs
  • Interest in data management
  • Preferred: Experience in Python or similar