Categories
Pages
-

DBIS

Generating Synthetic Training Data with LLMs for Sentiment Analysis

January 23rd, 2026

Sentiment analysis models detect emotion in text, but need retraining for each new context. To generate training data, Large Language Models (LLMs) are increasingly being used but performance is still limited. We aim to improve it via the creation of a structured framework for LLM-driven data synthesis.

Thesis Type
  • Master
Status
Open
Presentation room
Seminar room I5 6202
Supervisor(s)
Sandra Geisler
Stefan Decker
Advisor(s)
Maximilian Kißgen
Soo-Yon Kim
Contact
kissgen@dbis.rwth-aachen.de
kim@dbis.rwth-aachen.de

Sentiment analysis models are able to detect emotion in text, but their performance depends on
context. For example, “That’s a bold move!” could be positive when praising a politician’s healthcare
initiative, but negative if a student shows up unprepared for their final exam. And while models may
excel in one language, they often fail in others. It is therefore needed to retrain or fine-tune models
for each new context; however, training data is often scarce or non-existent.

Recently, researchers have turned to Large Language Models (LLMs) to generate synthetic
training data. However, a structured framework for LLM-driven data synthesis is still lacking.

 

 

Your task for this thesis is threefold:
1. Develop an end-to-end framework for generating synthetic training data in text classification,
including an extensive requirements analysis step.
2. With the framework, choose and implement an existing approach for data synthesis with
LLMs in sentiment analysis or related domains.
3. Fine-tune an existing base model with the synthetic data and compare them on benchmarks

Preliminary Sources:

  • https://www.sciencedirect.com/science/article/pii/S0167811622000477
  • https://aclanthology.org/2024.findings-naacl.246/
  • https://www.nature.com/articles/s41598-024-60210-7
  • https://openreview.net/forum?id=ws5phQki00 (LLM-based Synthesis in Stance Detection

 

If you’re interested in the thesis, please contact the advisors with a CV and a grade record.


Prerequisites:
  • Knowledge of or strong interest in machine learning and LLMs
  • Fluency in Python or a related programming language
  • Preferred: Experience with social data science or the RWTH HPC pipeline