Skip to content. | Skip to navigation

Personal tools
You are here: Home Theses Metadata-Based Fact Extraction from Wikipedia


Prof. Dr. S. Decker
RWTH Aachen
Informatik 5
Ahornstr. 55
D-52056 Aachen
Tel +49/241/8021501
Fax +49/241/8022321

How to find us

Annual Reports





Metadata-Based Fact Extraction from Wikipedia

Thesis type
  • Bachelor
Student Johannes Karoff
Status Finished
Submitted in 2013
Proposal on 04. Sep 2012 15:00
Proposal room Seminarraum I5
Add proposal to calendar vCal
Presentation on 29. Jan 2013 14:15
Presentation room Seminarraum I5
Add presentation to calendar vCal

In order to enable question answering systems reasoning over a large amount of data, information extraction is of high interest. Different than information retrieval, information extraction aims to extract structured data from documents allowing exact answers to queries, rather than a document centric approach, where documents are returned after querying for keywords. Furthermore, information extraction is meaningful for building ontologies and knowledge bases, which enable semantic search over structured data. Regarding the World Wide Web, humans create information every day and provide their knowledge to the community. This creates a huge amount of (mainly) unstructured information – e.g. newspapers, encyclopedias, newsgroups, etc. Wikipedia serves as a daily updated encyclopedia, which is backed by a large community. Furthermore, important articles are protected, where only administrators or approved users are allowed to edit the corresponding article. This reduces the risk of corruption or vandalism. As a result Wikipedia achieves a certain measure of quality assurance.

This thesis targets information extraction from the English version of Wikipedia, the main goal is information extraction from natural language text in from of triples – (subject, predicate, object). Such a triple can be interpreted as a relation(ship) between two entities, the subject and the object. Triples of this form enable reasoning over structured data from previously unstructured text.

Related projects

Document Actions