We are seeking a motivated master’s student to explore the application of Large
Language Models (LLMs) and Small Language Models (SLMs) for automated semantic mapping in data integration scenarios involving sensitive information. This thesis addresses a critical challenge in modern data management: domain experts often possess the knowledge needed to align local data sources with global schemas but lack the technical expertise to implement these mappings, while traditional automated approaches struggle with the semantic complexity of the task. This research investigates how language models can bridge this gap by enabling more intuitive, knowledge-driven data integration while maintaining strict data privacy and security requirements.
Thesis Type |
|
Status |
Open |
Presentation room |
Seminar room I5 6202 |
Supervisor(s) |
Sandra Geisler Stefan Decker |
Advisor(s) |
Laurenz Neumann Soo-Yon Kim |
Contact |
laurenz.neumann@dbis.rwth-aachen.de kim@dbis.rwth-aachen.de |
Research Questions
This thesis will investigate several key aspects of LLM-based semantic mapping:
- What is the optimal balance between information richness and privacy preservation in the input design (e.g. schema only vs. sample data)?
- Are smaller, locally deployable language models (SLMs) sufficient for semantic mapping tasks, or do they require the capabilities of larger models for sufficient inference speed and mapping quality?
- How can we incorporate the specialised domain knowledge of user via Human-in-the-loop approaches?
Methodology
The research will involve developing and evaluating different approaches to LLM-based semantic mapping, including comparative studies of input strategies (schema-only vs. schema-with-examples) and model architectures (cloud LLMs vs. local SLMs). You will design experiments using benchmark datasets and potentially collaborate with industry partners handling sensitive data.
Tasks
- Comprehensive literature review on semantic mapping and LLM applications
- Implementation of a proof-of-concept tool demonstrating different approaches
- Experimental evaluation with quantitative and qualitative analysis
Initial Literature
- Towards self-configuring Knowledge Graph Construction Pipelines using LLMs – A
Case Study with RML, Hofer et al. - Interactive Data Harmonization with LLM Agents, Santos et al.
- KONDA: An LLM-based Tool for Semantic Annotation and Knowledge Graph Creation
Using Ontologies for Research Data, Kim et al.
- Knowledge about databases and information systems
- Experience or strong interest in LLM applications
- Familiarity with semantic web technologies such as RDF
- Preferred: experience in software development, ideally python and/or java