This research investigates hallucination in vision-language models, focusing on the role of the attention mechanism in contributing to and potentially mitigating hallucinations. The work explores how attention layers influence the integration of visual and textual information and identifies techniques for reducing the generation of inaccurate or irrelevant outputs. A critical research question is understanding how attention mechanisms can be adjusted or improved to decrease hallucination in vision-language models, thus enhancing reliability in applications like image captioning and visual question answering.
Thesis Type |
|
Student |
Jan Ebigt |
Status |
Running |
Presentation room |
Seminar room I5 6202 |
Supervisor(s) |
Stefan Decker |
Advisor(s) |
Yongli Mou Sulayman K. Sowe |
Contact |
mou@dbis.rwth-aachen.de sowe@dbis.rwth-aachen.de |
Background
Vision-language models are designed to interpret and describe visual inputs in natural language, enabling applications such as image captioning, object recognition, and visual question answering. However, these models often suffer from hallucination, generating outputs that are unfaithful to the visual input. Hallucination arises due to over-reliance on learned patterns from training data, misalignment between textual and visual modalities, or limitations in attention layers responsible for guiding the model’s focus. This project delves into how attention mechanisms affect these hallucinations, aiming to refine the alignment between modalities to improve model accuracy and relevance.
Objectives
- Investigate how attention layers contribute to hallucination in vision-language models.
- Develop techniques to reduce hallucination by adjusting attention mechanisms.
- Evaluate the effectiveness of attention-modification strategies on vision-language model outputs.
Tasks
- Analyze existing vision-language models with a focus on attention layers.
- Experiment with various modifications to attention mechanisms to reduce hallucination.
- Assess the impact of attention adjustments on the quality of model outputs in tasks like image captioning and visual question answering.
Deep Knowledge of Deep Learning, Large Language Models
Programming language – Python (PyTorch, Transformers, etc.)