Skip to main content

Coronavirus situation at the University of Eastern Finland

Photo of an audio track.

Ivan Kukanov, MSc, Doctoral defence 10 Dec 2021: Polyphonic sound event detection

The doctoral dissertation in the field of Computer Science will be examined at the Faculty of Science and Forestry online.

What is the topic of your doctoral research?

In everyday life, we experience acoustic environments with sounds from multiple sources or polyphonic environments. For example, the domestic environment has human speech, video games, percussive sounds, broadband noise from house appliances, and others. Even speech, passing through the vocal tract, has multiple organs involved in sound production; therefore, it is polyphonic as well. These sounds produced by the vocal tract are called articulatory phonetics. Our hearing system is able to separate those multiple sound events and focus on only specific events of interest. In addition, our brain can interpret what has occurred to cause those sound events and which sounds go together.

This dissertation focuses on developing and optimizing novel automatic methods for detecting polyphonic events in the domestic environment and phonetic articulation. The ability to detect acoustic environmental events is in high demand in voice assistance or data analytics.

What are the key findings or observations of your doctoral research?

In practice, polyphonic datasets for training automatic computer methods are frequently imbalanced. That is, there are many sounds for one class of events and few sounds for another. Therefore, careful decision rules are needed to detect equally well as frequent as well as rare sounds. To optimize polyphonic detection methods, the cost sensitive criteria are also needed; these criteria measure the cost of decisions and minimize the error on each iteration. Polyphonic decision rules and cost sensitive criteria are proposed in the maximal figure-of-merit mathematical framework (MFoM). The MFoM achieves state-of-the-art results and shows 12.4% of equal error rate in acoustic event detection tasks. The MFoM demonstrates the effectiveness for articulatory phonetics detection as well; it helps to separate multiple phonetic events. In the dissertation, it is shown that articulatory phonetics benefit accent and spoken language recognition practical applications. Finally, we provide a mathematical analysis of the convergence of the MFoM. Specifically, the analytical error bounds of the stochastic gradient are inferred.

How can the results of your doctoral research be utilised in practice?

The ability to detect acoustic environmental events and articulatory phonetics is in demand in voice assistance or data analytics. Many voice assistants use speech technologies to enable automatic speech recognition (ASR) to understand and interact with customers. Speech and audio technology systems go beyond only speech recognition and understand many other sounds in the environment around us. In smart homes, sound event detection can be applied to security applications, such as recognizing glass breaking, gunshots, and fire alarms. In self-driving cars, acoustic event detection is similar to a hearing system that helps to detect sounds critical for safety measures, such as sirens and horns honking. One of the applications, which was tackled in this dissertation, is the problem of the domestic sound classification, where one audio recording can contain one or more acoustic events, and a recognizer labels all the events. The other applications, considered in this dissertation, are accent and spoken language recognition with universal articulatory phonetics.

What are the key research methods and materials used in your doctoral research?

For both applications, environmental sound classification and articulatory phonetics, deep neural networks (DNNs) are used to model sound classes. Deep learning is the “intelligent” method that learns a multi-layer representation of data. The articulatory phonetics method was trained on 7 hours of multi-language speech (Spanish, Mandarin, Japanese, Hindi, German, English). Throughout the series of papers presented in this dissertation, we explored multiple deep learning architectures, namely deep multi-layer feed-forward network, 1D-CNN and 2D-CNN, and, finally, a convolutional recurrent neural network (CRNN). The models are optimized using the proposed MFoM criteria with the polyphonic decision rules.

The doctoral dissertation of Ivan Kukanov, MSc, entitled Polyphonic Sound Event Detection: Phonetic Features and Environmental Sounds, will be examined at the Faculty of Science and Forestry, online, on 10 December at 10 am. The opponent will be Staff Engineer, Cheung-Chi Leung, Alibaba Group, and the custos will be Senior Researcher, Ville Hautamäki, University of Eastern Finland. Language of the public defence is English.

For further information, please contact:

Ivan Kukanov, ivan (a),

Public examination

Photo available for download (to be updated)

Dissertation book online