Project Group: Web Data Mining

Tutors: Dr. Marcus Handte, Arman Arzani

An important goal of the University Duisburg-Essen is to increase the number of startups that transform innovative research results of the university into sustainable businesses. To reach this goal, it is necessary to connect the researchers that have generated promising results with the business advisors and innovation coaches of the university that help scientists to successfully launch a new business.

The goal of this project group is to design and implement a web-based tool that supports the advisors and coaches of the university in identifying research groups or researchers that are working on innovative topics. As primary input the tool shall process the web pages of the university to automatically extract relevant information. Some examples are:

  • The organization and structure of the university (names of the faculties, research groups and researchers, etc.)
  • The research projects of the different research groups (project name and topic, project duration, funding scheme and budget, etc.)
  • The publications of the different researchers (authors, title, type of publication, etc.)

In addition, the project group shall develop a simple web-based application that enables the business advisors and innovation coaches to browse and search the extracted information.

From a technical perspective, the project will encompass the development and integration of a web-crawler, a search index, a data mining framework with the associated templates to extract the desired information and a web application to access the data. For the web-crawler and search index, we are currently planning on using Apache Nutch and Elasticsearch. The technologies used to perform the actual data mining can be freely defined by the students.

From a theoretical perspective, the project group covers fundamental concepts related to web search and data mining in theory and practice. This includes web crawling and search as well as data extraction and information integration. In addition, the participants will prepare individual seminar talks and papers on selected research topics related to web search and data mining.

The admission to this course is managed centrally. If you have any questions, please contact marcus.handte@uni-due.de.