Generating Synthetic Training Data with LLMs for Sentiment Analysis

January 23rd, 2026

Sentiment analysis models detect emotion in text, but need retraining for each new context. To generate training data, Large Language Models (LLMs) are increasingly being used but performance is still limited. We aim to improve it via the creation of a structured framework for LLM-driven data synthesis.

Thesis Type	Master
Status	Running
Presentation room	Seminar room I5 6202
Supervisor(s)	Sandra Geisler Stefan Decker
Advisor(s)	Maximilian Kißgen Soo-Yon Kim
Contact	kissgen@dbis.rwth-aachen.de kim@dbis.rwth-aachen.de

Sentiment analysis models are able to detect emotion in text, but their performance depends on
context. For example, “That’s a bold move!” could be positive when praising a politician’s healthcare
initiative, but negative if a student shows up unprepared for their final exam. And while models may
excel in one language, they often fail in others. It is therefore needed to retrain or fine-tune models
for each new context but training data is often scarce or non-existent.

Recently, researchers have turned to Large Language Models (LLMs) to generate synthetic
training data. However, a structured framework for LLM-driven data synthesis is still lacking.

Your task for this thesis is threefold:
1. Develop an end-to-end framework for generating synthetic training data in text classification,
including an extensive requirements analysis step.
2. With the framework, adapt an existing approach for data synthesis with
LLMs in sentiment analysis or related domains.
3. Fine-tune an existing base model with the synthetic data and compare it on benchmarks

Preliminary Sources:

https://www.sciencedirect.com/science/article/pii/S0167811622000477
https://aclanthology.org/2024.findings-naacl.246/
https://www.nature.com/articles/s41598-024-60210-7
https://openreview.net/forum?id=ws5phQki00 (LLM-based Synthesis in Stance Detection)

If you’re interested in the thesis, please contact the advisors with a CV and a grade record.

Prerequisites:

Knowledge of or strong interest in machine learning and LLMs
Fluency in Python or a related programming language
Preferred: Experience with social data science or the RWTH HPC pipeline

DBIS

Generating Synthetic Training Data with LLMs for Sentiment Analysis

Quick Links

Recent News

Recent Publications