Categories
Pages
-

DBIS

Ontology-Based Data Augmentation with LLMs for Narrative Classification

May 12th, 2026

Narrative Classification identifies stories via NLP but often lacks generalizability. While LLMs augment other text tasks, their narrative application remains exploratory. This thesis investigates whether an ontology-based LLM-agent framework incorporating specific data characteristics improves synthetic training data quality.

Thesis Type
  • Bachelor
  • Master
Status
Running
Presentation room
Seminar room I5 6202
Supervisor(s)
Stefan Decker
Advisor(s)
Maximilian Kißgen
Contact
kissgen@dbis.rwth-aachen.de

Narrative Classification refers to the use of automated and data-driven methods such as natural language processing and machine learning to identify and categorize the types of stories that circulate in social media, news, political discourse, and other large text corpora. As with other text classification fields, models however suffer from being too specific to the topics of their training data. Since the last two years, Large language models are starting to be used to generate synthetic training data or augment existing datasets for text classification tasks such as stance detection and sentiment analysis. However, their use in narrative classification has so far been researched only in a limited and largely exploratory manner. In addition, prior work often treats augmentation as a generic generation problem and does not explicitly account for data characteristics such as narrative style, label distribution, length, complexity, or domain-specific linguistic patterns. This thesis therefore investigates whether an LLM-agent framework that incorporates such data characteristics for augmentation can improve the quality and usefulness of augmented training data for narrative classification.

The first goal of the thesis is to define an ontology of data properties in narrative classification that can serve to characterize existing pre-labeled data
The second goal is to design and implement an augmentation pipeline based on a pre-existing approach that generates task-relevant narrative data points based on the characterized pre-labeled data.
The evaluation target of the thesis is to use a pre-existing training method on the resulting data and assess the resulting model’s performance against established benchmarks and alternative augmentation strategies.

Research Questions

1. How can an ontology for narrative classification data and an LLM-agent framework for data augmentation be designed and implemented?
2. How well does a model trained with LLM-augmented narrative data perform on established benchmarks?
3. Does the labelling of data characteristics lead to improved performance in data augmentation?

 

Sources

https://arxiv.org/pdf/2512.03582
https://dl.acm.org/doi/full/10.1145/3717867.3717868
https://arxiv.org/pdf/2402.11621
https://ieeexplore.ieee.org/abstract/document/11080380 (LLM-based Synthesis in Text Classification)
https://openreview.net/forum?id=ws5phQki00 (LLM-based Synthesis in Stance Detection)


Prerequisites:

• Programming skills, especially in Python and preferably PyTorch
• Basic experience with machine learning and natural language processing
• Familiarity with ontologies, large language models and prompt-based methods
• Knowledge of text classification and ideally narrative classification
• Experience working with datasets, model training, and evaluation in an experimental setting