Doctoral defence of Vishwanath Pratap Singh B. Tech, 21.5.2026: Advancing automatic speech recognition and speaker verification for children: Causality, data augmentation, and guided fine-tuning

The doctoral dissertation in the field of Computer Science will be examined at the Faculty of Science, Forestry and Technology, Joensuu campus

11.5.2026

Technology and innovations

What is the topic of your doctoral research? Why is it important to study the topic?

My doctoral research focuses on developing robust ASR and ASV systems for children. Although modern speech technologies perform remarkably well for adults, they struggle to accurately recognize and verify children’s speech. This performance gap limits the reliability of voice-enabled educational tools, digital assistants, language learning platforms, and child-oriented interactive technologies.

Studying this topic is important because children increasingly interact with voice-controlled devices in education, entertainment, and online services. However, current systems are primarily trained on adult speech and fail to generalize effectively to children due to physiological, cognitive, and environmental differences.

In addition, high-quality children’s speech data is scarce, making model development particularly challenging. This research addresses these limitations and develops data-efficient and explainable methods to build more robust and inclusive speech technologies for children.

What are the key findings or observations of your doctoral research?

The dissertation presents following methodological advances for improving speech technologies for children. First, it introduces physiology-guided data augmentation methods that simulate children’s speech from adult speech recordings in zero-resource settings. This helps address the scarcity of child-specific speech data without requiring large new datasets.

Second, the research develops a reinforcement learning–based framework that automatically determines the optimal balance between original and augmented data during training. The results show that excessive augmentation can degrade performance, while dynamically controlled augmentation improves robustness and generalization.

Third, the dissertation proposes a causality-based explainability framework to analyze why speech systems fail on children’s speech. Unlike earlier studies that examined factors independently, this work models the relationships among physiological, cognitive, and environmental factors to identify their individual and combined impact on recognition errors. The findings reveal that commonly observed age-related performance degradation is strongly influenced by other interacting factors.

Finally, the research introduces a data-efficient guided fine-tuning strategy for large speech foundation models. Using attribution-based analysis, the method identifies the most important layers for adaptation to children’s speech, reducing computational cost and data requirements.

How can the results of your doctoral research be utilised in practice?

The results of this research can be applied to improve the reliability and inclusiveness of speech-enabled technologies used by children. The proposed methods can help developers build more accurate voice assistants, educational applications, reading tutors, pronunciation assessment systems, and child-oriented interactive devices.

The data augmentation and guided fine-tuning methods are especially valuable in low-resource settings where only limited children’s speech data is available. This can reduce development costs and improve accessibility for languages and regions with scarce speech resources. The causal explainability framework can also help researchers and engineers better understand why speech systems fail, enabling the design of fairer and inclusive AI for children.

What are the key research methods and materials used in your doctoral research?

This doctoral research combined methods from deep learning, speech signal processing, and causal analysis. The work focused on three interconnected themes: data augmentation, causal explainability, and guided fine-tuning of speech foundation models for children’s speech. Publicly available children’s speech corpora, including CSLU Kids and MyST, were used together with adult speech datasets for augmentation and adaptation experiments.

The study developed physiology-guided algorithms to simulate children’s speech from adult speech and applied reinforcement learning to optimize the use of augmented data during training. To investigate recognition errors, causal graphical models and explainability frameworks were designed to analyze physiological, cognitive, and environmental factors. Attribution-based methods were also used to identify the most important layers for efficient model adaptation.

The doctoral dissertation of Vishwanath Pratap Singh, B. Tech, entitled Advancing automatic speech recognition and speaker verification for children: Causality, data augmentation, and guided fine-tuning will be examined at the Faculty of Science, Forestry and Technology, Joensuu campus. The opponent will be Professor Mikko Kurimo, Aalto University, and the custos will be Professor Tomi H. Kinnunen, University of Eastern Finland. Language of the public defence is English.

For further information, please contact:

Vishwanath Pratap Singh, [email protected], tel. +91 700 240 7942

Public examination
Dissertation (PDF)

Vishwanath Singh

Visiting Researcher

School of Computing, Faculty of Science, Forestry and Technology

vishwanath.singh@uef.fi

News