From Triples to Pixels: Visual Knowledge Graph Encoding for Knowledge-Augmented Multiple-Choice Question Answering

March 2nd, 2026

Knowledge-augmented multiple-choice question answering (MCQA) aims to improve robustness and factual grounding by integrating external structured knowledge (e.g., knowledge graphs) into language-model-based decision making. Current high-performing systems typically retrieve a local subgraph relevant to a question and candidate answers, then combine pretrained language representations with explicit graph reasoning modules.

This thesis investigates an alternative representation path: instead of processing retrieved knowledge graph (KG) subgraphs as symbolic triples with graph neural networks, the subgraphs are deterministically rendered into a compact 2D “visual graph” representation and encoded with a vision backbone. The resulting visual KG evidence is fused with a encoder-only language model via attention-based cross-modal interaction. The core research question is whether a visually encoded KG can preserve decision-relevant relational structure and support competitive knowledge-augmented MCQA performance on CommonsenseQA and OpenBookQA (optionally extending to MedQA-USMLE).

Thesis Type	Master
Student	Shiwei Luo
Status	Running
Presentation room	Seminar room I5 - 6202
Supervisor(s)	Stefan Decker
Advisor(s)	Yixin Peng
Contact	peng@dbis.rwth-aachen.de

Background

Recent KG-augmented QA methods such as GreaseLM and QA-GNN tightly couple pretrained language encoders with subgraph retrieval and graph message passing. These systems repeatedly exchange information between text and graph representations, enabling two-way grounding and improved performance on commonsense MCQA benchmarks. However, the dominant paradigm remains graph-centric: KGs are represented as triples/subgraphs and processed with graph-specific computation, which increases architectural complexity and can reduce modularity when swapping encoders, fusion mechanisms, or reasoning components.

In parallel, work in vision-language and document understanding has shown that structured information can be consumed directly from pixels through transformer-based vision encoders, sometimes avoiding brittle intermediate symbolization pipelines. This motivates exploring whether KG subgraphs—when rendered into a deterministic 2D layout with controlled density—can be encoded efficiently by a vision model and fused with text via cross-modal attention. Complementary recent work in “visual graph” reasoning suggests that representing graph structure as images can be beneficial, but this has not been systematically studied in the specific setting of KG-augmented MCQA with standard QA benchmarks.

The thesis hypothesis is not that vision models inherently solve graph reasoning, but that a carefully designed rendering + fusion pipeline may provide a competitive and simpler alternative to explicit GNN reasoning in knowledge-augmented MCQA, and can clarify when the visual representation is beneficial or detrimental.

Tasks

a) Baseline reproduction and experimental alignment. Reproduce strong KG-augmented MCQA baselines under comparable experimental conditions.

Reproduce GreaseLM on CommonsenseQA and OpenBookQA (optionally MedQA-USMLE), including KG subgraph retrieval and evaluation protocol alignment.
Implement and validate the end-to-end baseline pipeline (data preprocessing, retrieval, training, inference), and document deviations (software versions, seeds, compute constraints).
Establish reference comparisons relevant to this thesis (LM-only, and at least one representative KG+LM baseline such as QA-GNN or MHGRN where feasible).

b) KG-to-image representation design space. Define and implement a deterministic mapping from retrieved KG subgraphs (triples) to 2D images suitable for a vision encoder.

Specify a rendering schema (node/edge glyphs, textual labels vs. label-free encoding, layout strategy, and handling of directionality / relation types).
Ensure determinism and comparability: identical subgraphs yield identical images; scaling rules constrain density and resolution across datasets.
Implement a small set of controlled rendering variants to support later ablations (e.g., different layout algorithms or edge-encoding choices).

c) Proposed architecture: encoder-only text model + image encoder + attention fusion. Build a multimodal knowledge-augmented MCQA model with attention-based fusion between text and visual KG evidence.

Text side: define a prompt and scoring format for MCQA with a encoder-only language model.
Vision side: encode KG-rendered images using a ViT/CLIP-style vision encoder to obtain visual embeddings/tokens.
Fusion: implement a cross-modal interaction mechanism (e.g., cross-attention or bottleneck-style fusion) to condition answer scoring on visual KG evidence.

d) Training, evaluation, and analysis. Train/evaluate the proposed approach and characterize when visual KG encoding helps.

Evaluate on standard MCQA splits/protocols for CommonsenseQA and OpenBookQA (optionally MedQA-USMLE), reporting the primary benchmark metric(s) used by the datasets.
Ablations: rendering variants; fusion variants; subgraph size limits; robustness under noisy/missing retrieval.
Error analysis: categorize failure modes (e.g., relational errors, missing edges, label ambiguity, layout-induced information loss) and relate them to representation choices.

References

Zhang, X., Bosselut, A., Yasunaga, M., Ren, H., Liang, P., Manning, C.D., Leskovec, J.: GreaseLM: Graph REASoning Enhanced Language Models for Question Answering. In: Proc. International Conference on Learning Representations (ICLR) (2022).
Yasunaga, M., Ren, H., Bosselut, A., Liang, P., Leskovec, J.: QA-GNN: Reasoning with Language Models and Knowledge Graphs for Question Answering. In: Proc. NAACL-HLT 2021, pp. 535–546 (2021). doi:10.18653/v1/2021.naacl-main.45
Feng, Y., Chen, X., Lin, B.Y., Wang, P., Yan, J., Ren, X.: Scalable Multi-Hop Relational Reasoning for Knowledge-Aware Question Answering. In: Proc. EMNLP 2020, pp. 1295–1309 (2020). doi:10.18653/v1/2020.emnlp-main.99
Liu, W., Zhou, P., Zhao, Z., Wang, Z., Ju, Q., Deng, H., Wang, P.: K-BERT: Enabling Language Representation with Knowledge Graph. In: Proc. AAAI 2020, pp. 2901–2908 (2020).
Talmor, A., Herzig, J., Lourie, N., Berant, J.: CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge. In: Proc. NAACL-HLT 2019, pp. 4149–4158 (2019). doi:10.18653/v1/N19-1421
Mihaylov, T., Clark, P., Khot, T., Sabharwal, A.: Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering. In: Proc. EMNLP 2018, pp. 2381–2391 (2018). doi:10.18653/v1/D18-1260
Speer, R., Chin, J., Havasi, C.: ConceptNet 5.5: An Open Multilingual Graph of General Knowledge. In: Proc. AAAI 2017, pp. 4444–4451 (2017).
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale. In: Proc. ICLR (2021).
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning Transferable Visual Models From Natural Language Supervision. In: Proc. ICML 2021, pp. 8748–8763 (2021).
Li, J., Selvaraju, R.R., Gotmare, A.D., Joty, S.R., Xiong, C., Hoi, S.C.H.: Align before Fuse: Vision and Language Representation Learning with Momentum Distillation. In: Proc. NeurIPS 2021, pp. 9694–9705 (2021).
Nagrani, A., Yang, S., Arnab, A., Jansen, A., Schmid, C., Sun, C.: Attention Bottlenecks for Multimodal Fusion. In: Proc. NeurIPS 2021, pp. 14200–14213 (2021).
Kim, G., Hong, T., Yim, M., Nam, J.Y., Park, J., Yim, J., Hwang, W., Yun, S., Han, D., Park, S.: OCR-Free Document Understanding Transformer. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. LNCS, vol. 13688, pp. 498–517. Springer, Cham (2022). doi:10.1007/978-3-031-19815-1_29
Lee, K., Joshi, M., Turc, I.R., Hu, H., Liu, F., Eisenschlos, J.M., Khandelwal, U., Shaw, P., Chang, M.-W., Toutanova, K.: Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding. In: Proc. ICML 2023. PMLR, vol. 202, pp. 18893–18912 (2023).
Wei, Y., Fu, S., Jiang, W., Zhang, Z., Zeng, Z., Wu, Q., Kwok, J.T., Zhang, Y.: GITA: Graph to Visual and Textual Integration for Vision-Language Graph Reasoning. In: Proc. NeurIPS (2024).

Prerequisites:

Strong foundations in deep learning and NLP (Transformers, encoder-only language models, fine-tuning)
PyTorch and modern training/evaluation tooling (reproducibility, experiment tracking)
Multimodal modeling basics (vision encoders, cross-modal attention / fusion)
Knowledge-augmented QA and KGs (subgraph retrieval, commonsense KGs such as ConceptNet, MCQA evaluation)

DBIS