Dynamic Topic Mining for Visual Analytics on Large Document Collections
Thesis type |
|
---|---|
Student | Nikou Günnemann-Gholizadeh |
Status | Finished |
Submitted in | 2013 |
Proposal on | 04. Sep 2012 15:30 |
Proposal room | Seminarraum I5 |
Add proposal to calendar |
![]() ![]() |
Presentation on | 21. Mar 2013 10:30 |
Presentation room | Bibliothek I5 |
Add presentation to calendar |
![]() ![]() |
Supervisor(s) | |
Advisor(s) |
This thesis will conceive, implement and deploy a dynamic topic modelling approach to expose topic dynamics within existing large community mediabases. The objective of this topic modelling approach is to identify topics as well as their bursts and shifts over time within different kinds of media in a community, e.g. in blogs, wikis, research projects, published papers.
A Community Mediabase is a set of databases which comprise different media artifacts relevant to a specific community, as well as tools to access that data. The artifacts typically include blogs, wikis, newslists, and similar social software artifacts; they may also include other relevant information like the community's collaboration networks, shared projects, and publications. In the scope of the TEL-Map EU project, a Community Mediabase for Technology Enhanced Learning (TEL) was created including databases for TEL projects, publications and blogs. Also, social network analysis (SNA) was performed on these data sets to identify the most relevant authors, projects, organizations, etc. in TEL [1].
The aim of this thesis is to complement the SNA approach with a semantic view by conceiving, implementing and deploying a probabilistic topic modelling approach to expose topic dynamics within the the Community Mediabase data. Topic Modelling is an emerging unsupervised machine learning field (see [2]), although with existing library code. This approach tries to extract topics from a text source using a "bag of words" paradigm and a model where topics are defined by a certain distribution of linguistic terms and where documents are deemed to be about several topics, each with different weightings. A number of algorithms exist to "reverse engineer" these distributions given the actual document content.
If you are interested in this thesis, please contact Dr. Michael Derntl.
References:
- M. Derntl et al.: Mediabase ready and first analysis report. TEL-Map deliverable D4.3.
- D.M. Blei: Introduction to Probabilistic Topic Models. Princeton University.
Prerequisites
We expect the student to have existing skills in SQL and OO programming, and have interest and/or experience in data analysis, text mining, topic modelling, information visualisation and R (the language and environment for statistical computing).