Categories
Pages
-

DBIS

Ontology-Grounded Extraction of Research SoftwareMentions from Scientific Publications

March 27th, 2026

Research software is among the least discoverable scholarly outputs. While standards like CodeMeta and CFF enable structured software metadata at the repository level, they require active curation by maintainers and see inconsistent adoption. On the publication side, only select publishers such as Schloss Dagstuhl’s DROPS platform provide citable software artifacts, again contingent on explicit author action. As a result, most research software is mentioned only in unstructured publication text, invisible to metadata-driven search and agentic systems, and non-compliant with FAIR principles. This problem is compounded in applied domains like visualization research, where publications frequently describe custom tools and prototypes that are never publicly released, a case not covered by current metadata schemas. Automating ontology-grounded extraction of software mentions from publications would close this gap and enrich the metadata foundation for downstream services such as the research copilots developed within NFDI4DS.

Thesis Type
  • Bachelor
Student
Diego López Benito
Status
Running
Proposal on
29/04/2026 12:20 am
Proposal room
Seminar room I5 6202
Presentation room
Seminar room I5 6202
Supervisor(s)
Stefan Decker
Advisor(s)
Tim Holzheim
Contact
holzheim@dbis.rwth-aachen.de

The goal of this thesis is to systematically compare ontology-grounded LLM-based knowledge extraction approaches for identifying and structuring research software mentions in scientific publications against the CodeMeta ontology. Implement and adapt at least two representative extraction pipelines, benchmarked against DROPS ground-truth metadata and stress-tested on other publications such as publications from EuroVis proceedings, with additional analysis of how extraction context size affects result quality.

 

Research Questions


  1. How do ontology-grounded extraction approaches (SPIRES/OntoGPT, ODKE+, OneKE) compare when extracting CodeMeta-structured software mentions from publications?
  2. What architectural and methodological differences most strongly influence extraction quality in this domain?
  3. How does extraction context size (abstract, sections, full text) affect completeness and accuracy?
  4. To what extent can these approaches be adapted to a specific target ontology, and what are the practical trade-offs?

Prerequisites:
  • Solid Python skills
  • familiarity with LLMs and NLP concepts
  • basic understanding of ontologies and linked data (RDF, JSON-LD)
  • willingness to engage with the research software metadata landscape (CodeMeta, CFF, FAIR)
  • Prior knowledge extraction experience is beneficial but not required

Related Projects:
NFDI4DS