Categories
Pages
-

DBIS

Training a Tiny LLM with Block Attention Residuals on CommonsenseQA

April 23rd, 2026

Knowledge-augmented multiple-choice question answering (MCQA) aims to improve robustness and factual grounding by integrating external structured knowledge (e.g., knowledge graphs) into language-model-based decision making. Current high-performing systems typically retrieve a local subgraph relevant to a question and candidate answers, then combine pretrained language representations with explicit graph reasoning modules.

This thesis investigates an alternative representation path: instead of processing retrieved knowledge graph (KG) subgraphs as symbolic triples with graph neural networks, the subgraphs are deterministically rendered into a compact 2D “visual graph” representation and encoded with a vision backbone. The resulting visual KG evidence is fused with a encoder-only language model via attention-based cross-modal interaction. The core research question is whether a visually encoded KG can preserve decision-relevant relational structure and support competitive knowledge-augmented MCQA performance on CommonsenseQA and OpenBookQA (optionally extending to MedQA-USMLE).

Thesis Type
  • Bachelor
Student
Ishita Vashist
Status
Running
Presentation room
Seminar room I5 - 6202
Supervisor(s)
Stefan Decker
Advisor(s)
Yixin Peng
Contact
peng@dbis.rwth-aachen.de

Background

Reproducible end-to-end training of language models is increasingly important for understanding how model behavior emerges from architectural and optimization choices. Tiny models in the approximate 100M-parameter regime provide a realistic experimental setting because they allow controlled comparisons without requiring large-scale infrastructure. MiniMind is designed as a lightweight training framework for small language models and is therefore a suitable basis for this thesis.

Recent work on Attention Residuals argues that standard residual accumulation treats all earlier layer contributions uniformly, which can lead to hidden-state growth and weaker control over depth-wise information flow. To improve scalability, the paper proposes Block AttnRes, where layers are partitioned into blocks and only block summaries are used for cross-layer aggregation. While this idea has shown promising results in larger-scale settings, its effect on tiny language models trained from scratch under matched budgets remains largely unexplored. This thesis addresses that gap through a controlled comparison between standard residual connections and Block AttnRes within the same training pipeline and evaluation setting.


Tasks

a) Baseline reproduction

  • Set up the end-to-end training pipeline based on MiniMind for a tiny causal language model, including tokenizer preparation, pretraining, instruction tuning, and downstream fine-tuning.
  • Run and document a baseline with standard residual connections.
  • Implement a reproducible CommonsenseQA evaluation pipeline, including logging of seeds, configurations, checkpoints, and dataset versions.

b) Implementation of Block AttnRes

  • Implement Block Attention Residuals in the MiniMind backbone.
  • Partition layers into blocks and aggregate information via block-level representations.
  • Ensure that the standard residual and Block AttnRes variants are comparable under matched core hyperparameters and training budgets.

c) Controlled training and evaluation

  • Train both model variants under aligned settings.
  • Fine-tune both variants on CommonsenseQA.
  • Compare dev/test accuracy and efficiency-related measures such as memory usage and training cost.

d) Analysis and ablations

  • Analyze training stability and representation behavior across depth.
  • Perform a qualitative error analysis of CommonsenseQA predictions.
  • Conduct optional ablations by varying block size or number of blocks and studying their effect on performance and efficiency.
  • Optionally include a lightweight visualization component to illustrate block-level residual aggregation during inference.

References

  1. Kimi Team et al.: Attention Residuals. arXiv preprint arXiv:2603.15031 (2026).
  2. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention Is All You Need. In: Advances in Neural Information Processing Systems 30 (2017).
  3. He, K., Zhang, X., Ren, S., Sun, J.: Deep Residual Learning for Image Recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016).
  4. Talmor, A., Herzig, J., Lourie, N., Berant, J.: CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4149–4158 (2019).
  5. Speer, R., Chin, J., Havasi, C.: ConceptNet 5.5: An Open Multilingual Graph of General Knowledge. In: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, pp. 4444–4451 (2017).
  6. Gong, J.: MiniMind. GitHub repository. Available at: https://github.com/jingyaogong/minimind, last accessed 2026/04/16.

Prerequisites:
  • Strong Python skills and experience with data processing and experiment management.
  • Familiarity with PyTorch and Transformer training.
  • Basic understanding of language modeling, supervised fine-tuning, and evaluation methods.
  • Interest in neural architecture analysis, especially residual connections and attention mechanisms.