While recent advancements in natural language processing have been largely driven by increasingly powerful large language models (LLMs), the role of data quality in fine-tuning these models remains underexplored. This thesis addresses the often-overlooked but critical aspect of data-centric AI by investigating how different types and levels of data degradation affect the performance of fine-tuned LLMs on tasks such as summarization and question answering. Unlike model-centric approaches that focus on architectural improvements, this work systematically introduces controlled degradations, such as spelling errors, grammatical mistakes, and semantic noise, into high-quality datasets. The goal is to quantify how these degradations impact downstream performance and identify which types of data noise are most detrimental. The results aim to inform best practices in dataset curation and reinforce the importance of data quality in building robust, task-specific LLM applications.
Thesis Type |
|
Status |
Open |
Presentation room |
Seminar room I5 6202 |
Supervisor(s) |
Sandra Geisler |
Advisor(s) |
Soo-Yon Kim |
Contact |
kim@dbis.rwth-aachen.de |
Tasks:
-
Identify and define relevant data quality dimensions and metrics:
Conduct a literature review to select appropriate types of data degradation (e.g., spelling errors, grammar violations, semantic inconsistencies) and identify or define metrics for quantifying degradation severity. -
Implement a data degradation pipeline:
Develop a system to introduce varying levels of controlled noise into clean text datasets across the identified quality dimensions, enabling fine-grained experimentation. -
Fine-tune and evaluate LLMs under degraded conditions:
Fine-tune selected LLMs on the degraded datasets for specific NLP tasks (e.g., summarization, question answering), and evaluate their performance using task-relevant metrics (e.g., ROUGE, BLEU, F1) to assess the sensitivity of model performance to data quality changes.