Skip to content. | Skip to navigation

Informatik 5
Information Systems
Prof. Dr. M. Jarke
Sections
Personal tools
You are here: Home Theses Intelligent, Self-adapting Crawler for Internet Forums

Contact

Prof. Dr. M. Jarke
RWTH Aachen
Informatik 5
Ahornstr. 55
D-52056 Aachen
Tel +49/241/8021501
Fax +49/241/8022321

How to find us

Annual Reports

Disclaimer

Webmaster

 

 

Intelligent, Self-adapting Crawler for Internet Forums

Thesis type
  • Master
Student Jinhui Li
Status Finished
Submitted in 2012
Proposal on 15. May 2012 16:00
Proposal room Seminarraum I5
Add proposal to calendar vCal
iCal
Supervisor(s)
Advisor(s)
Social media monitoring becomes more and more important for companies and organizations. They can use this overwhelming amount of information to analyze how their products are received by customers, if there are complaints about them, or even what their own employees think about the company, their executives, and strategies.
IBM has built several products to isolate interesting postings from the rest, and to further analyze them. However those products are semi-automated and require considerably high effort to add new data sources as an example.
In this master thesis a new crawler for internet forums has to be created, which is intelligent enough to extract the contents of any internet forum - regardless of the used framework and language. Current solutions are not flexible enough and have to be modified for each new internet forum. The new solution should make manual modifications obsolete and also be smart enough to extract as much context information about an author and posting as possible. All this is useful information in further steps to analyze and classify the reliability of a data source, their importance, and the urgency. The results of this new crawler have then to be imported automatically into an existing IBM product for further analysis.
It is up to the student to decide about the right technology to be used, but the amount of information and the necessary integration with existing IBM products suggests the usage of a massive parallel execution framework such as Hadoop or InfoSphere Streams.

Related projects

Document Actions