Knowledge-augmented multiple-choice question answering (MCQA) aims to improve robustness and factual grounding by integrating external structured knowledge (e.g., knowledge graphs) into language-model-based decision making. Current high-performing systems typically retrieve a local subgraph relevant to a question and candidate answers, then combine pretrained language representations with explicit graph reasoning modules. This thesis investigates an alternative representation path: instead of processing retrieved knowledge graph (KG) subgraphs as symbolic triples with graph neural networks, the subgraphs are deterministically rendered into a compact 2D “visual graph” representation and encoded with a vision backbone. The resulting visual KG evidence is fused with a encoder-only language model via attention-based cross-modal interaction. The core research question is whether a visually encoded KG can preserve decision-relevant relational structure and support competitive knowledge-augmented MCQA performance on CommonsenseQA and OpenBookQA (optionally extending to MedQA-USMLE).
Thesis Type |
|
Student |
Ishita Vashist |
Status |
Running |
Presentation room |
Seminar room I5 - 6202 |
Supervisor(s) |
Stefan Decker |
Advisor(s) |
Yixin Peng |
Contact |
peng@dbis.rwth-aachen.de |
Background
Reproducible end-to-end training of language models is increasingly important for understanding how model behavior emerges from architectural and optimization choices. Tiny models in the approximate 100M-parameter regime provide a realistic experimental setting because they allow controlled comparisons without requiring large-scale infrastructure. MiniMind is designed as a lightweight training framework for small language models and is therefore a suitable basis for this thesis.
Recent work on Attention Residuals argues that standard residual accumulation treats all earlier layer contributions uniformly, which can lead to hidden-state growth and weaker control over depth-wise information flow. To improve scalability, the paper proposes Block AttnRes, where layers are partitioned into blocks and only block summaries are used for cross-layer aggregation. While this idea has shown promising results in larger-scale settings, its effect on tiny language models trained from scratch under matched budgets remains largely unexplored. This thesis addresses that gap through a controlled comparison between standard residual connections and Block AttnRes within the same training pipeline and evaluation setting.
Tasks
a) Baseline reproduction
- Set up the end-to-end training pipeline based on MiniMind for a tiny causal language model, including tokenizer preparation, pretraining, instruction tuning, and downstream fine-tuning.
- Run and document a baseline with standard residual connections.
- Implement a reproducible CommonsenseQA evaluation pipeline, including logging of seeds, configurations, checkpoints, and dataset versions.
b) Implementation of Block AttnRes
- Implement Block Attention Residuals in the MiniMind backbone.
- Partition layers into blocks and aggregate information via block-level representations.
- Ensure that the standard residual and Block AttnRes variants are comparable under matched core hyperparameters and training budgets.
c) Controlled training and evaluation
- Train both model variants under aligned settings.
- Fine-tune both variants on CommonsenseQA.
- Compare dev/test accuracy and efficiency-related measures such as memory usage and training cost.
d) Analysis and ablations
- Analyze training stability and representation behavior across depth.
- Perform a qualitative error analysis of CommonsenseQA predictions.
- Conduct optional ablations by varying block size or number of blocks and studying their effect on performance and efficiency.
- Optionally include a lightweight visualization component to illustrate block-level residual aggregation during inference.
References
- Kimi Team et al.: Attention Residuals. arXiv preprint arXiv:2603.15031 (2026).
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention Is All You Need. In: Advances in Neural Information Processing Systems 30 (2017).
- He, K., Zhang, X., Ren, S., Sun, J.: Deep Residual Learning for Image Recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016).
- Talmor, A., Herzig, J., Lourie, N., Berant, J.: CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4149–4158 (2019).
- Speer, R., Chin, J., Havasi, C.: ConceptNet 5.5: An Open Multilingual Graph of General Knowledge. In: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, pp. 4444–4451 (2017).
- Gong, J.: MiniMind. GitHub repository. Available at: https://github.com/jingyaogong/minimind, last accessed 2026/04/16.
- Strong Python skills and experience with data processing and experiment management.
- Familiarity with PyTorch and Transformer training.
- Basic understanding of language modeling, supervised fine-tuning, and evaluation methods.
- Interest in neural architecture analysis, especially residual connections and attention mechanisms.