Skip to main content

Refine your search

Tomi Kinnunen.

Speech deepfakes continue to challenge researchers

Creating speech deepfakes is becoming increasingly easy. Not so long ago, the Finnish language still posed an obstacle, but not anymore.

  • Text Marianne Mustonen
  • Photos Niko Jouhkimainen

“Today, anyone can create a speech deepfake. In the past, it took greater technical dedication, but nowadays, numerous voice cloning services are available to virtually anyone,” says Professor Tomi Kinnunen of the School of Computing at the University of Eastern Finland.

Speech synthesis could, in principle, be used to deceive biometric authentication systems as part of scam calls or disinformation on social media. Therefore, it is essential to understand when automatic systems and humans can be deceived – and develop countermeasures accordingly.

“Such countermeasures include, for instance, speech deepfake detection and deepfake source tracing, that is, identifying the voice cloning or synthesis software used to create the deepfake. In the case of biometric authentication, the aim is to improve the robustness of systems against various attacks,” Kinnunen notes.

“Neural networks and artificial intelligence are widely used in research in this field. Personally, however, I’ve felt it important to move on to more interpretable methods in which the detection method can ‘justify’ its decisions.”

Developing automated deepfake detection

Speech as a field of research is rapidly evolving, and there is plenty to investigate. Speech research has an interdisciplinary focus, drawing on machine learning, data collection, speech sciences and explainable AI.

According to Kinnunen, deepfake research is like playing cat and mouse. Recent years have seen significant advances in the accuracy of detection methods and countermeasures, but model generalisation remains a major challenge.

“Machine learning is based on fitting models to large sets of training data, and models can easily overfit to the training data used. As a result, the detection of speech deepfakes created with previously unseen synthesis techniques becomes difficult,” he explains.

“An additional challenge arises from the fact that real-world deepfakes often contain encoded or compressed speech, which masks the artefacts produced by speech synthesis. This makes detection more difficult.”

Speech technology research utilises signal processing and machine learning, essentially deep neural network models trained on large datasets.

Tomi Kinnunen.
Tomi Kinnunen.