Training a Tiny LLM with Block Attention Residuals on CommonsenseQA

April 23rd, 2026

Knowledge-augmented multiple-choice question answering (MCQA) aims to improve robustness and factual grounding by integrating external structured knowledge (e.g., knowledge graphs) into language-model-based decision making. Current high-performing systems typically retrieve a local subgraph relevant to a question and candidate answers, then combine pretrained language representations with explicit graph reasoning modules.

This thesis investigates an alternative representation path: instead of processing retrieved knowledge graph (KG) subgraphs as symbolic triples with graph neural networks, the subgraphs are deterministically rendered into a compact 2D “visual graph” representation and encoded with a vision backbone. The resulting visual KG evidence is fused with a encoder-only language model via attention-based cross-modal interaction. The core research question is whether a visually encoded KG can preserve decision-relevant relational structure and support competitive knowledge-augmented MCQA performance on CommonsenseQA and OpenBookQA (optionally extending to MedQA-USMLE).

Thesis Type	Bachelor
Student	Ishita Vashist
Status	Running
Presentation room	Seminar room I5 - 6202
Supervisor(s)	Stefan Decker
Advisor(s)	Yixin Peng
Contact	peng@dbis.rwth-aachen.de

Background

Reproducible end-to-end training of language models is increasingly important for understanding how model behavior emerges from architectural and optimization choices. Tiny models in the approximate 100M-parameter regime provide a realistic experimental setting because they allow controlled comparisons without requiring large-scale infrastructure. MiniMind is designed as a lightweight training framework for small language models and is therefore a suitable basis for this thesis.

Recent work on Attention Residuals argues that standard residual accumulation treats all earlier layer contributions uniformly, which can lead to hidden-state growth and weaker control over depth-wise information flow. To improve scalability, the paper proposes Block AttnRes, where layers are partitioned into blocks and only block summaries are used for cross-layer aggregation. While this idea has shown promising results in larger-scale settings, its effect on tiny language models trained from scratch under matched budgets remains largely unexplored. This thesis addresses that gap through a controlled comparison between standard residual connections and Block AttnRes within the same training pipeline and evaluation setting.

Tasks

a) Baseline reproduction

Set up the end-to-end training pipeline based on MiniMind for a tiny causal language model, including tokenizer preparation, pretraining, instruction tuning, and downstream fine-tuning.
Run and document a baseline with standard residual connections.
Implement a reproducible CommonsenseQA evaluation pipeline, including logging of seeds, configurations, checkpoints, and dataset versions.

b) Implementation of Block AttnRes

Implement Block Attention Residuals in the MiniMind backbone.
Partition layers into blocks and aggregate information via block-level representations.
Ensure that the standard residual and Block AttnRes variants are comparable under matched core hyperparameters and training budgets.

c) Controlled training and evaluation

Train both model variants under aligned settings.
Fine-tune both variants on CommonsenseQA.
Compare dev/test accuracy and efficiency-related measures such as memory usage and training cost.

d) Analysis and ablations

Analyze training stability and representation behavior across depth.
Perform a qualitative error analysis of CommonsenseQA predictions.
Conduct optional ablations by varying block size or number of blocks and studying their effect on performance and efficiency.
Optionally include a lightweight visualization component to illustrate block-level residual aggregation during inference.

References

Kimi Team et al.: Attention Residuals. arXiv preprint arXiv:2603.15031 (2026).
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention Is All You Need. In: Advances in Neural Information Processing Systems 30 (2017).
He, K., Zhang, X., Ren, S., Sun, J.: Deep Residual Learning for Image Recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016).
Talmor, A., Herzig, J., Lourie, N., Berant, J.: CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4149–4158 (2019).
Speer, R., Chin, J., Havasi, C.: ConceptNet 5.5: An Open Multilingual Graph of General Knowledge. In: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, pp. 4444–4451 (2017).
Gong, J.: MiniMind. GitHub repository. Available at: https://github.com/jingyaogong/minimind, last accessed 2026/04/16.

Prerequisites:

Strong Python skills and experience with data processing and experiment management.
Familiarity with PyTorch and Transformer training.
Basic understanding of language modeling, supervised fine-tuning, and evaluation methods.
Interest in neural architecture analysis, especially residual connections and attention mechanisms.

DBIS