Sentiment analysis models detect emotion in text, but need retraining for each new context. To generate training data, Large Language Models (LLMs) are increasingly being used but performance is still limited. We aim to improve it via the creation of a structured framework for LLM-driven data synthesis.
Thesis Type |
|
Status |
Open |
Presentation room |
Seminar room I5 6202 |
Supervisor(s) |
Sandra Geisler Stefan Decker |
Advisor(s) |
Maximilian Kißgen Soo-Yon Kim |
Contact |
kissgen@dbis.rwth-aachen.de kim@dbis.rwth-aachen.de |
Sentiment analysis models are able to detect emotion in text, but their performance depends on
context. For example, “That’s a bold move!” could be positive when praising a politician’s healthcare
initiative, but negative if a student shows up unprepared for their final exam. And while models may
excel in one language, they often fail in others. It is therefore needed to retrain or fine-tune models
for each new context; however, training data is often scarce or non-existent.
Recently, researchers have turned to Large Language Models (LLMs) to generate synthetic
training data. However, a structured framework for LLM-driven data synthesis is still lacking.
Your task for this thesis is threefold:
1. Develop an end-to-end framework for generating synthetic training data in text classification,
including an extensive requirements analysis step.
2. With the framework, choose and implement an existing approach for data synthesis with
LLMs in sentiment analysis or related domains.
3. Fine-tune an existing base model with the synthetic data and compare them on benchmarks
Preliminary Sources:
- https://www.sciencedirect.com/science/article/pii/S0167811622000477
- https://aclanthology.org/2024.findings-naacl.246/
- https://www.nature.com/articles/s41598-024-60210-7
- https://openreview.net/forum?id=ws5phQki00 (LLM-based Synthesis in Stance Detection
If you’re interested in the thesis, please contact the advisors with a CV and a grade record.
- Knowledge of or strong interest in machine learning and LLMs
- Fluency in Python or a related programming language
- Preferred: Experience with social data science or the RWTH HPC pipeline