Categories
Pages
-

DBIS

Incremental Knowledge Graph Ingestion with Change Detection and Provenance Tracking

March 26th, 2026

Keeping a knowledge graph up to date as its source data evolves is harder than building one from scratch. New records appear, existing records are corrected, and metadata is enriched over time. Each type of change a corrected DOI, an added co-author, a retracted publication carries different semantic implications and may require a different update strategy. Detecting these changes efficiently and propagating them without introducing duplicates, losing provenance, or overwriting valid data remains an open challenge, particularly when the goal is to avoid heavyweight versioning infrastructure.

Thesis Type
  • Bachelor
Student
Moritz Ahlrichs
Status
Running
Proposal on
29/04/2026 12:00 am
Proposal room
Seminar room I5 6202
Presentation room
Seminar room I5 6202
Supervisor(s)
Stefan Decker
Advisor(s)
Tim Holzheim
Contact
holzheim@dbis.rwth-aachen.de

The Dagstuhl Research Online Publication Server (DROPS) provides a concrete setting for this problem. DROPS hosts high-quality, semantically enriched metadata for a wide range of scholarly entity types such as publications and authors. Despite the richness of this metadata, it currently resides in fragmented JSON-LD documents with no unified access layer. This creates significant barriers for programmatic consumers as the data is not properly queryable. Loading this metadata into an RDF knowledge graph with Named-Graph-per-document isolation is straightforward. The research challenge lies in managing what happens when source documents change. This research will design, implement, and evaluate an incremental ingestion pipeline that converts DROPS JSON-LD to Turtle, detects and categorizes changes between document versions, and records lightweight provenance metadata.

 

Research Questions


  1. What taxonomy of change types arises in scholarly metadata and how do these types differ in their propagation effects on a Named-Graph-based knowledge graph?
  2. What merge strategies allow safe updates and corrections of previously ingested documents while avoiding duplication and semantic drift?
  3. How can changes between successive versions of a source document be detected efficiently at the triple level without storing complete previous document versions, and what is the trade-off between detection granularity and computational cost?
  4. What idempotent update strategy (full Named Graph replacement, selective triple-level patching, or hybrid) provides the best trade-off between correctness, performance, and change-tracking fidelity for different change types?

Prerequisites:
  • Understanding of Semantic Web technologies, including RDF, SPARQL, JSON-LD, and Linked Data principles.
  • Familiarity with ontology and schema languages, in particular OWL, SHACL and PROV.
  • Basic programming skills (Python, JavaScript, or Java)
  • Basic knowledge of data pipeline concepts, including incremental processing, idempotency, and ETL/ELT workflows.