Formalizing Early-Stage Data Science Requirements for an LLM-Based Data Acquisition Agent

February 9th, 2026

This thesis investigates how to formally represent early-stage data science requirements and how to support the automation of early-stage data science through an LLM-based agent.

Thesis Type	Bachelor
Status	Running
Presentation room	Seminar room I5 6202
Supervisor(s)	Sandra Geisler
Advisor(s)	Soo-Yon Kim
Contact	kim@dbis.rwth-aachen.de

Overview

In many data science projects, the first step is to translate a real-world problem into a concrete analytical task and a set of data requirements. This includes defining the prediction or analysis target, relevant data attributes, constraints, and how suitable data can be obtained. For example, a clinician may be interested in training an algorithm for the early detection of a disease, which must be translated into an exact clinical event definition, a list of required clinical attributes, constraints such privacy or security, and a plan for where such data can be sourced. In practice, this phase is largely manual and often consumes substantial time and coordination. Recent research indicates that Large Language Models (LLMs) hold automation potential for this phase. This thesis investigates how to formally represent early-stage data science requirements and how to support the automation of early-stage data science through an LLM-based agent.

Research Questions

What information must be represented to describe early-stage data science requirements (e.g., target, data attributes, constraints, acquisition options)?
How can this information be formalized in a structured and machine-interpretable way?
How can such a formalization be operationalized by an agent to generate concrete data acquisition artifacts?
How can the expressiveness and usefulness of the formalization be evaluated?

Methodology

Design a structured formalization for task and data requirements.
Review related work on LLM-based agents and data discovery.
Implement a proof-of-concept LLM-based agent that operationalizes the formalization and generates acquisition artifacts for a sample use case.
Evaluate the formalization and agent output, e.g., against a predefined ground truth, and against baseline methods (such as prompt-only).

Tasks

Literature review
Formalization of requirements
Implementation of agent prototype
Experimental evaluation

Initial Literature

Rahman, M. et al. (2025): LLM-Based Data Science Agents: A Survey of Capabilities, Challenges, and Future Directions. https://doi.org/10.48550/arXiv.2510.04023
Sancricca, C. (2024): DIANA: a Knowledge-driven Framework for Data-centric AI. https://ceur-ws.org/Vol-3651/PhDW-4.pdf

In case you are interested in this thesis, please write an email to the thesis advisor with your CV and transcript of records.

Prerequisites:

Experience in data science
Knowledge or strong interest in LLMs
Interest in data management
Preferred: Experience in Python or similar

DBIS