{"id":6224,"date":"2025-06-03T12:34:46","date_gmt":"2025-06-03T10:34:46","guid":{"rendered":"https:\/\/dbis.rwth-aachen.de\/dbis\/?p=6224"},"modified":"2026-06-22T12:17:35","modified_gmt":"2026-06-22T10:17:35","slug":"applications-closed-analyzing-the-effect-of-data-quality-on-the-performance-of-fine-tuned-large-language-models","status":"publish","type":"post","link":"https:\/\/dbis.rwth-aachen.de\/dbis\/index.php\/2025\/applications-closed-analyzing-the-effect-of-data-quality-on-the-performance-of-fine-tuned-large-language-models\/","title":{"rendered":"Analyzing the Effect of Data Quality on the Performance of Fine-Tuned Large Language Models"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">While recent advancements in natural language processing have been largely driven by increasingly powerful large language models (LLMs), the role of data quality in fine-tuning these models remains underexplored. This thesis addresses the often-overlooked but critical aspect of data-centric AI by investigating how different types and levels of data degradation affect the performance of fine-tuned LLMs on tasks such as summarization and question answering. Unlike model-centric approaches that focus on architectural improvements, this work systematically introduces controlled degradations, such as spelling errors, grammatical mistakes, and semantic noise, into high-quality datasets. The goal is to quantify how these degradations impact downstream performance and identify which types of data noise are most detrimental. The results aim to inform best practices in dataset curation and reinforce the importance of data quality in building robust, task-specific LLM applications.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>While recent advancements in natural language processing have been largely driven by increasingly powerful large language models (LLMs), the role of data quality in fine-tuning these models remains underexplored. This thesis addresses the often-overlooked but critical aspect of data-centric AI by investigating how different types and levels of data degradation affect the performance of fine-tuned [&hellip;]<\/p>\n","protected":false},"author":42,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[21],"tags":[],"class_list":["post-6224","post","type-post","status-publish","format-standard","hentry","category-thesis"],"acf":[],"_links":{"self":[{"href":"https:\/\/dbis.rwth-aachen.de\/dbis\/index.php\/wp-json\/wp\/v2\/posts\/6224","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/dbis.rwth-aachen.de\/dbis\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/dbis.rwth-aachen.de\/dbis\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/dbis.rwth-aachen.de\/dbis\/index.php\/wp-json\/wp\/v2\/users\/42"}],"replies":[{"embeddable":true,"href":"https:\/\/dbis.rwth-aachen.de\/dbis\/index.php\/wp-json\/wp\/v2\/comments?post=6224"}],"version-history":[{"count":3,"href":"https:\/\/dbis.rwth-aachen.de\/dbis\/index.php\/wp-json\/wp\/v2\/posts\/6224\/revisions"}],"predecessor-version":[{"id":7229,"href":"https:\/\/dbis.rwth-aachen.de\/dbis\/index.php\/wp-json\/wp\/v2\/posts\/6224\/revisions\/7229"}],"wp:attachment":[{"href":"https:\/\/dbis.rwth-aachen.de\/dbis\/index.php\/wp-json\/wp\/v2\/media?parent=6224"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/dbis.rwth-aachen.de\/dbis\/index.php\/wp-json\/wp\/v2\/categories?post=6224"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/dbis.rwth-aachen.de\/dbis\/index.php\/wp-json\/wp\/v2\/tags?post=6224"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}