Improving clustering methods for better data analysis

1.12.2020

Technology and innovations

Increase in the amount, variety and complexity of data makes it more difficult to understand or process information. Data clustering can help in these challenges by summarizing the data or finding patterns in it. It works by forming groups of the objects of a dataset so that the objects in the same group are more similar to each other than to objects in other groups. Data clustering has a major role in research and data analysis. For example, in 2019, according to Google Scholar, the word "clustering" appeared in 128 000 research articles.

K-means is one of the most popular clustering algorithms. It is used widely in many types of data analysis applications and is included in a large number of statistical, numerical and data-analysis software such as SPSS, NCSS, SAS, STATA, R, Matlab and CrimeStat. Still, despite its popularity, much remains unknown about it. For example, it is known that k-means frequently produces less than optimal results. But it has been unknown exactly when and why it fails.

The doctoral thesis of Sami Sieranoja investigates the situations when k-means fails. It provides important information about the limitations of k-means and how to tune it to get better quality results. It is often assumed that clustering algorithms work better when clusters are more separated. The thesis shows that the opposite is true in case of k-means. It works better when clusters have some overlap. The separation of clusters is the most significant factor causing errors in results. Knowing this deficiency, it can be mostly overcome by using a better initialization technique and multiple repeats.

Many clustering algorithms are limited to small datasets. The thesis proposes a new methodology that makes a known Density Peaks clustering 99% faster for large datasets. This helps to scale the method to large datasets of over 1 million in size.

A key part in achieving this speed up is to use a data structure called k nearest neighbour graph. It connects each data object to the most similar objects in the data. The thesis proposes two fast methods for constructing this graph. The kNN graph has also applications beyond clustering such as in recommendation systems. For example, in Spotify it is used to recommend other artists which are similar as some artist the user has listened to.

The doctoral dissertation of Master of Science Sami Sieranoja, entitled Clustering with kNN graph and k-means will be examined at the Faculty of Science and Forestry on the 3rd of December online. The opponent in the public examination will be Professor Julius Žilinskas of Vilnius University, and the custos will be Professor Pasi Fränti of the University of Eastern Finland. The public examination will be held in English.

Photo available for download

Link to the event

Link to the dissertation