Automated Data Processing and Knowledge Discovery for Time Series Data utilizing Large Language Models

February 14th, 2024

Large language models (LLMs) have proven the ability to assist diverse users in conducting a variety of individual tasks via intuitive and natural conversations. This thesis discusses a utilization of LLMs to perform (semi-)automated data processing and analyses on time series data. One major goal is to reduce expertise-related dependencies, allowing more people to manipulate data and gain beneficial insights.

Thesis Type
  • Master
Presentation room
Seminar room I5 6202
Stefan Decker

The demand for data processing/-analysis across various domains and levels of expertise is a significant reason for the growth of automated data process solutions. However, a few challenges such as the need to understand associated products and being restricted by their available functionality remain.
In the scope of ENGAGE-D (Enhancing Gaia-X Data Access through Interactive SD Generation and Discovery using LLMs and Automating Data Processing for Time Series Analysis), a project funded by Fraunhofer FIT, we plan to investigate the capabilities of modern LLMs to automate data processing and analyses for time series data. We especially evaluate the ability of different LLMs to generate program code for automated pipelines, as well as to support a generation of methodically optimized solutions.


    • Identifying the potential of LLMs to correctly generate executable program code for predefined data process/analysis tasks (e.g., data wrangling, clustering, etc.)

    • Identifying the potential of LLMs to identify general facts and representative characteristics of time series data.

    • Identifying the potential of LLMs for task-associated problem-solving including
        • Process pipeline generation

        • Method selection and optimization


    • Design and implementation of an evaluation framework based on LangChain

    • Creation of interesting test cases for the evaluation of predefined scenarios based on ground truth information

    • Potential implementation of LLM-related optimizations (e.g., prompt engineering)


    • Basic skills in Python and common data science libraries (e.g., numpy, pandas, sklearn, etc.)

    • Basic knowledge in data science and machine learning


Interested to work on this topic for your thesis? Write an email to