On the Analysis and Mitigation of Hallucination in Vision-Language Models

October 17th, 2024

This research investigates hallucination in vision-language models, focusing on the role of the attention mechanism in contributing to and potentially mitigating hallucinations. The work explores how attention layers influence the integration of visual and textual information and identifies techniques for reducing the generation of inaccurate or irrelevant outputs. A critical research question is understanding how attention mechanisms can be adjusted or improved to decrease hallucination in vision-language models, thus enhancing reliability in applications like image captioning and visual question answering.

Thesis Type	Bachelor
Student	Jan Ebigt
Status	Running
Presentation room	Seminar room I5 6202
Supervisor(s)	Stefan Decker
Advisor(s)	Yongli Mou Sulayman K. Sowe
Contact	mou@dbis.rwth-aachen.de sowe@dbis.rwth-aachen.de

Background

Vision-language models are designed to interpret and describe visual inputs in natural language, enabling applications such as image captioning, object recognition, and visual question answering. However, these models often suffer from hallucination, generating outputs that are unfaithful to the visual input. Hallucination arises due to over-reliance on learned patterns from training data, misalignment between textual and visual modalities, or limitations in attention layers responsible for guiding the model’s focus. This project delves into how attention mechanisms affect these hallucinations, aiming to refine the alignment between modalities to improve model accuracy and relevance.

Objectives

Investigate how attention layers contribute to hallucination in vision-language models.
Develop techniques to reduce hallucination by adjusting attention mechanisms.
Evaluate the effectiveness of attention-modification strategies on vision-language model outputs.

Tasks

Analyze existing vision-language models with a focus on attention layers.
Experiment with various modifications to attention mechanisms to reduce hallucination.
Assess the impact of attention adjustments on the quality of model outputs in tasks like image captioning and visual question answering.

Prerequisites:

Deep Knowledge of Deep Learning, Large Language Models
Programming language – Python (PyTorch, Transformers, etc.)

DBIS

On the Analysis and Mitigation of Hallucination in Vision-Language Models

Quick Links

Recent News

Recent Publications