Skip to main content

Refine your search

Students working on computers.

Doctoral defence of Himat Shah, MSc, 15 Nov 2022: Automatic keyword extraction for webpages

The doctoral dissertation in the field of Computer Science will be examined at the Faculty of Science and Forestry, Joensuu Campus.

What is the topic of your doctoral research? Why is it important to study the topic?

A keyword is a single word, or a sequence of words (keyphrase) in the text that provide concise, high-level description of the content to readers. The extraction of keywords is a fundamental step in text summarization, information retrieval, topic model construction, clustering, and advertising systems. It is common to use keywords or keyphrases interchangeably, but researchers typically define a keyword as a single word and a keyphrase as a group of words.

What are the key findings or observations of your doctoral research?

In this research work, we have developed three language-independent and one language-dependent method to extract keywords from webpages. Most existing methods rely on Natural Language Processing (NLP) techniques, including Part-of-Speech (POS) tagging, stemming, and lemmatization, which are language-dependent and makes it difficult to generalize the method to other languages. This research aims to find a method that can be applied to webpages regardless of their language, by extracting only language-independent features. It is challenging to extract keywords from web documents for two reasons: the first is the presence of navigation bars, menus, comments, and advertisements. The second is the presence of multiple topics and multiple languages in the same page. It is therefore important to have a general keyword extraction without having to rely on a particular language.

How can the results of your doctoral research be utilised in practice?

Our research proposes four new automatic keyword extraction methods: Hrank, D-rank, WebRank, and ACI-rank. All methods are based on statistical, structural, and language-related features. Frequent words are more likely to be good keywords, but simple counts can be misleading. In Finnish language most common words in any document are “Kuin” and “Minä” but they hardly serve the purpose. Instead, a good keyword is common in the particular web page but not in other pages. Good keywords are also scattered all over the page instead of in one part. Keywords also more often have visual emphasis like bigger and boldface font and used in the section titles.

The doctoral dissertation of Himat Shah, MSc, entitled Automatic keyword extraction for webpages, will be examined at the Faculty of Science and Forestry. The Opponent will be Professor Jyrki Nummenmaa, Tampere University, and the Custos will be Professor Pasi Fränti, University of Eastern Finland. Language of the public defence is English.

For more information, please contact:

Himat Shah, himats@uef.fi, tel. +358 465 840 099