Research software is among the least discoverable scholarly outputs. While standards like CodeMeta and CFF enable structured software metadata at the repository level, they require active curation by maintainers and see inconsistent adoption. On the publication side, only select publishers such as Schloss Dagstuhl’s DROPS platform provide citable software artifacts, again contingent on explicit author action. As a result, most research software is mentioned only in unstructured publication text, invisible to metadata-driven search and agentic systems, and non-compliant with FAIR principles. This problem is compounded in applied domains like visualization research, where publications frequently describe custom tools and prototypes that are never publicly released, a case not covered by current metadata schemas. Automating ontology-grounded extraction of software mentions from publications would close this gap and enrich the metadata foundation for downstream services such as the research copilots developed within NFDI4DS.
Thesis Type |
|
Student |
Diego López Benito |
Status |
Running |
Proposal on |
29/04/2026 12:20 am |
Proposal room |
Seminar room I5 6202 |
Presentation room |
Seminar room I5 6202 |
Supervisor(s) |
Stefan Decker |
Advisor(s) |
Tim Holzheim |
Contact |
holzheim@dbis.rwth-aachen.de |
The goal of this thesis is to systematically compare ontology-grounded LLM-based knowledge extraction approaches for identifying and structuring research software mentions in scientific publications against the CodeMeta ontology. Implement and adapt at least two representative extraction pipelines, benchmarked against DROPS ground-truth metadata and stress-tested on other publications such as publications from EuroVis proceedings, with additional analysis of how extraction context size affects result quality.
Research Questions
- How do ontology-grounded extraction approaches (SPIRES/OntoGPT, ODKE+, OneKE) compare when extracting CodeMeta-structured software mentions from publications?
- What architectural and methodological differences most strongly influence extraction quality in this domain?
- How does extraction context size (abstract, sections, full text) affect completeness and accuracy?
- To what extent can these approaches be adapted to a specific target ontology, and what are the practical trade-offs?
- Solid Python skills
- familiarity with LLMs and NLP concepts
- basic understanding of ontologies and linked data (RDF, JSON-LD)
- willingness to engage with the research software metadata landscape (CodeMeta, CFF, FAIR)
- Prior knowledge extraction experience is beneficial but not required