Ontology-Grounded Extraction of Research SoftwareMentions from Scientific Publications

March 27th, 2026

Research software is among the least discoverable scholarly outputs. While standards like CodeMeta and CFF enable structured software metadata at the repository level, they require active curation by maintainers and see inconsistent adoption. On the publication side, only select publishers such as Schloss Dagstuhl’s DROPS platform provide citable software artifacts, again contingent on explicit author action. As a result, most research software is mentioned only in unstructured publication text, invisible to metadata-driven search and agentic systems, and non-compliant with FAIR principles. This problem is compounded in applied domains like visualization research, where publications frequently describe custom tools and prototypes that are never publicly released, a case not covered by current metadata schemas. Automating ontology-grounded extraction of software mentions from publications would close this gap and enrich the metadata foundation for downstream services such as the research copilots developed within NFDI4DS.

Thesis Type	Bachelor
Student	Diego López Benito
Status	Running
Proposal on	29/04/2026 12:20 am
Proposal room	Seminar room I5 6202
Presentation room	Seminar room I5 6202
Supervisor(s)	Stefan Decker
Advisor(s)	Tim Holzheim
Contact	holzheim@dbis.rwth-aachen.de

The goal of this thesis is to systematically compare ontology-grounded LLM-based knowledge extraction approaches for identifying and structuring research software mentions in scientific publications against the CodeMeta ontology. Implement and adapt at least two representative extraction pipelines, benchmarked against DROPS ground-truth metadata and stress-tested on other publications such as publications from EuroVis proceedings, with additional analysis of how extraction context size affects result quality.

Research Questions

How do ontology-grounded extraction approaches (SPIRES/OntoGPT, ODKE+, OneKE) compare when extracting CodeMeta-structured software mentions from publications?
What architectural and methodological differences most strongly influence extraction quality in this domain?
How does extraction context size (abstract, sections, full text) affect completeness and accuracy?
To what extent can these approaches be adapted to a specific target ontology, and what are the practical trade-offs?

Prerequisites:

Solid Python skills
familiarity with LLMs and NLP concepts
basic understanding of ontologies and linked data (RDF, JSON-LD)
willingness to engage with the research software metadata landscape (CodeMeta, CFF, FAIR)
Prior knowledge extraction experience is beneficial but not required

Related Projects:

NFDI4DS

DBIS

Ontology-Grounded Extraction of Research SoftwareMentions from Scientific Publications

Research Questions

Quick Links

Recent News

Recent Publications