Categories
Pages
-

DBIS

Deep Table-Structure Integration for LLM-based Semantic Table Understanding

February 19th, 2026

This thesis investigates how Large Language Models (LLMs) can be equipped with a deeper, architecture-level understanding of tabular data, going beyond “tables-as-serialized-text” toward tables-as-structured objects that expose row/column topology, header semantics, cell neighborhoods, and inter-cell dependencies to the model in a principled way [1,2,8]. The target setting is Semantic Table Interpretation (STI) as studied in the SemTab challenge, focusing on three standard downstream tasks: Cell Entity Annotation (CEA)Column Type Annotation (CTA), and Column Property Annotation (CPA) [3,4].

The work will be developed and evaluated primarily using the MammoTab 25 benchmark (Wikipedia-scale tables annotated against Wikidata) and SemTab-style evaluation protocols [6,7,14].

Thesis Type
  • Master
Student
Kehao Li
Status
Running
Presentation room
Seminar room I5 - 6202
Supervisor(s)
Stefan Decker
Advisor(s)
Yixin Peng
Contact
peng@dbis.rwth-aachen.de

Background

Tables are one of the most common formats for representing real-world knowledge, but table meaning is distributed across multiple layers: cell values, headers, schema cues, row/column context, and often a global table intent [1]. Realistic web/Wikipedia tables introduce additional challenges such as ambiguity, aliasing, missing context, and NIL mentions [6,7].

While recent LLM-based approaches show strong potential on table tasks, many still rely on surface-level serialization (flattening the 2D structure into a 1D prompt), which can underutilize table topology and makes it harder to maintain global consistency across many cells—especially under long-context constraints [1,2,13]. This motivates exploring structure-aware model designs and pretraining/fine-tuning strategies that explicitly encode tabular signals into the model pipeline [8,10].


Tasks

a) Baseline reproduction

  • Reproduce and analyze baselines:
    • TURL (structure-aware encoder + table-specific pretraining) [8]
    • TableLlama (instruction-tuned LLM with long-context) [9]

b) Understand and analyze two pretraining / adaptation paradigms

  1. Structured encoder + MER + MLM (from TURL). Study TURL’s structure-aware encoder and its pretraining objectives, especially Masked Entity Recovery (MER) and Masked Language Modeling (MLM) for learning contextualized table representations and entity-aware semantics [8].
  2. Instruction tuning + long-context fine-tuning. Analyze how instruction tuning aligns models to follow task directives [10], and how long-context fine-tuning methods (e.g., LongLoRA-style context extension) support large or complex tables [11].

c) Adapt a decoder-only model and transfer the two paradigms to STI tasks

  • Modify a decoder-only base model (e.g., DeepSeek-V3.2 or a comparable modern decoder-only LLM) by extending the embedding layer to inject table-structure representations (e.g., row/column embeddings, header roles, and cell neighborhood), enabling tighter structural fusion at the architecture level [12].
  • Apply both paradigms from (b) to this adapted decoder-only model, then fine-tune/evaluate on:
    • CEA (cell → entity in Wikidata)
    • CTA (column → semantic type/class)
    • CPA (column pair → semantic relation/property)

    following SemTab task definitions and protocols [3,4].

d) Train & evaluate on MammoTab using standard metrics

  • Train and evaluate on MammoTab 25 (and its SemTab track variants when applicable) using standard STI metrics Precision / Recall / F1 [3,6,7].
  • Perform targeted error analysis (e.g., ambiguity, NIL handling, long tables, and cross-cell consistency) aligned with MammoTab’s documented challenges [6,7].

Optional: When feasible, prepare the thesis outcomes for participation/submission to the SemTab Challenge, with the goal of contributing a paper for ISWC 2026 [3,15].


References

  1. Wu, X., Ritter, A., Xu, W.: Tabular Data Understanding with LLMs: A Survey of Recent Advances and Challenges. arXiv:2508.00217 (2025). DOI: 10.48550/arXiv.2508.00217.
  2. Sui, Y., Zhou, M., Zhou, M., Han, S., Zhang, D.: Table Meets LLM: Can Large Language Models Understand Structured Table Data? A Benchmark and Empirical Study. In: Proc. of the 17th ACM Int. Conf. on Web Search and Data Mining (WSDM 2024). ACM (2024). DOI: 10.1145/3616855.3635752.
  3. SemTab 2025: Semantic Web Challenge on Tabular Data to Knowledge Graph Matching (CEA/CTA/CPA). Challenge website (2025). Available at: https://sem-tab-challenge.github.io/2025/ (accessed 19 Feb 2026).
  4. Hassanzadeh, O., Cremaschi, M., D’Adda, F., Azanzi, F.J., Petit BIKIM, J., Jiménez-Ruiz, E.: Results of SemTab 2025. In: Proc. of Ontology Matching 2025 (OM 2025), CEUR Workshop Proceedings, Vol. 4144, pp. 216–220. CEUR-WS.org (2025).
  5. Hassanzadeh, O., Abdelmageed, N., Cremaschi, M., Cutrona, V., D’Adda, F., Efthymiou, V., Kruit, B., Lobo, E., Mihindukulasooriya, N., Pham, N.H.: Results of SemTab 2024. In: Proc. of the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching (SemTab 2024), CEUR Workshop Proceedings, Vol. 3889, pp. 1–11. CEUR-WS.org (2024).
  6. Cremaschi, M., Belotti, F., D’Souza, J., Palmonari, M.: MammoTab 25: A Large-Scale Dataset for Semantic Table Interpretation – Training, Testing, and Detecting Weaknesses. In: Garijo, D. et al. (eds) The Semantic Web – ISWC 2025. Lecture Notes in Computer Science, Vol. 16141, pp. 131–148. Springer, Cham (2026). DOI: 10.1007/978-3-032-09530-5_8.
  7. Marzocchi, M., Cremaschi, M., Pozzi, R., Avogadro, R., Palmonari, M.: MammoTab: A Giant and Comprehensive Dataset for Semantic Table Interpretation. In: Proc. of SemTab 2022, CEUR Workshop Proceedings, Vol. 3320, pp. 28–33. CEUR-WS.org (2022).
  8. Deng, X., Sun, H., Lees, A., Wu, Y., Yu, C.: TURL: Table Understanding through Representation Learning. Proc. VLDB Endow. 14(3), 307–319 (2021). DOI: 10.14778/3430915.3430921.
  9. Zhang, T., Yue, X., Li, Y., Sun, H.: TableLlama: Towards Open Large Generalist Models for Tables. In: Proc. of NAACL 2024, pp. 5940–5963 (2024). DOI: 10.18653/v1/2024.naacl-long.335.
  10. Ouyang, L. et al.: Training Language Models to Follow Instructions with Human Feedback. In: Advances in Neural Information Processing Systems 35 (NeurIPS 2022) (2022). Also: arXiv:2203.02155.
  11. Chen, Y., Qian, S., Tang, H., Lai, X., Liu, Z., Han, S., Jia, J.: LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models. In: ICLR 2024 (2024). arXiv:2309.12307. DOI: 10.48550/arXiv.2309.12307.
  12. DeepSeek: DeepSeek-V3.2 Release. DeepSeek API Docs (1 Dec 2025). Available at: https://api-docs.deepseek.com/news/news251201 (accessed 19 Feb 2026). Also: DeepSeek-V3.2 technical report, arXiv:2512.02556 (2025).
  13. Chen, S.-A., Miculicich, L., Eisenschlos, J.M., Wang, Z., Wang, Z., Chen, Y., Fujii, Y., Lin, H.-T., Lee, C.-Y., Pfister, T.: TableRAG: Million-Token Table Understanding with Language Models. arXiv:2410.04739 (2024).
  14. Vrandečić, D., Krötzsch, M.: Wikidata: A Free Collaborative Knowledge Base. Commun. ACM 57(10), 78–85 (2014). DOI: 10.1145/2629489.
  15. ISWC 2026: International Semantic Web Conference 2026 (Bari, Italy; 25–29 Oct 2026). Conference website (accessed 19 Feb 2026). Available at: https://iswc2026.semanticweb.org/.

Prerequisites:
  • Solid foundation in Machine Learning / Deep Learning and NLP
  • Good programming in Python
  • Experience with PyTorch and modern LLM tooling (Transformers, fine-tuning, Hugging Face, vLLM)
  • Basic familiarity with Knowledge Graphs is beneficial [14]