NDA

AI-Powered Pronunciation Correction for Children’s Speech Therapy

Key results

Achieved 86% accuracy in phoneme recognition
Reduced the manual workload for speech therapists

AI-Powered Pronunciation Correction for Children’s Speech Therapy

SERVICES

Audio processing

Speech recognition

INDUSTRY

Speech therapy

Healthcare

TECHNOLOGIES USED

Python

AWS

Terraform

FastAPI

Gradio

Transformers

PyTorch

Accelerate

Wav2Vec2

TEAM

2-4 AI engineers

LOCATION

United Kingdom

PROJECT DURATION

1 year

About the Client

The client provides online counseling and speech and language therapy for children and young people, with a mission to make therapeutic support more accessible and engaging through innovative technology.

Their platform hosts therapy sessions enriched with interactive activities, creative tools, and games. This gamified approach helps reduce anxiety, encourages participation, and supports a wide range of therapeutic methods.

To expand the platform’s capabilities, the client partnered with It-Jim to explore how AI could automate parts of speech therapy – starting with a system that detects and corrects children’s pronunciation mistakes using speech recognition.

The Challenges & Goals

The project set out to solve a key challenge in paediatric speech therapy: automatically identifying and tracking pronunciation mistakes without constant therapist involvement. The goal was to develop an AI system that detects mispronounced phonemes in real time, handles non-linear speech patterns, and delivers feedback as accurately as – or better than – a trained therapist. But children’s speech brings a unique set of challenges:

Unpredictable articulation and speech variability

Children’s pronunciation changes as they grow, making phoneme boundaries inconsistent and difficult for traditional models to interpret.

Diverse accents and speech patterns

Regional and individual pronunciation differences required the model to handle a wide range of phonetic variations.

Noisy real-world environments

Sessions often take place outside ideal recording conditions, with background noise, interruptions, or low-quality microphones.

Limited training data for children’s speech

Few existing datasets capture the variety of children’s voices and speech errors needed to train accurate recognition models.

Solution Overview

We brought the vision to life through a multi-stage approach, from research and model selection to dataset creation and deployment. Each step addressed specific technical and data challenges while aligning with therapy needs. Using advanced speech recognition, phoneme-level training, and a custom child speech dataset, we built an AI system that delivers reliable, actionable feedback even in noisy, real-world environments.

Stage 1: Framing the Problem and Model Selection

The client’s vision was clear: automate pronunciation correction for children. But traditional speech recognition models weren’t designed for the nuances of kids’ speech, especially with accents, articulation variability, and noise.

What we did:

Identified phoneme recognition as the most effective approach for tracking mispronunciations.
Selected Wav2Vec 2.0, a self-supervised model known for handling raw audio and noisy data.
Proposed a fine-tuning strategy tailored to children’s speech patterns.

This gave the project a clear path forward and the confidence that the goal was technically achievable.

Stage 2: Adapting Wav2Vec 2.0 for Children

Wav2Vec 2.0 had strong potential, but it needed careful fine-tuning. Children’s speech is less structured and more variable than adult audio, which makes direct transcription difficult.

Key steps:

Fine-tuned the model using Connectionist Temporal Classification (CTC) loss , aligning unsegmented audio with phoneme sequences.
Integrated a multilingual variant of Wav2Vec 2.0 to handle diverse phonetic inputs.
Added logic to handle non-linear, irregular speech patterns, making the model more robust.

The result was a fine-tuned model capable of understanding real-world, imperfect children speech.

Stage 3: Building a Custom Dataset

Data availability was a critical roadblock. Existing datasets lacked both the quantity and diversity needed for children’s phoneme-level speech.

How we addressed it:

Built a custom dataset with recorded child speech across varied conditions.
Partnered with speech specialists for accurate phoneme-level annotation.
Created a validation pipeline to automate checks and reduce expert workload.
Applied feature engineering to capture markers of disordered or unclear speech.

The dataset became the foundation that enabled real-world accuracy rather than just lab success.

Stage 4: Delivering the Solution

With the model trained and validated, we moved into deployment, focusing on usability, monitoring, and production-readiness.

Key results:

Achieved 86% accuracy, outperforming the 70% baseline of manual therapist assessment.
Delivered actionable phoneme-level feedback for children’s pronunciation.
Created a production-ready inference pipeline with built-in analytics.
Deployed a CI/CD system to streamline future updates and improvements.

This collaboration demonstrated how AI and computer vision can bring lab-level insights into users’ pockets, redefining how personal health can be tracked and understood.

Let’s Talk About Building AI That Actually Helps People

Whether you’re working on health tech, education, or human development – we’re here to bring AI into the real world with you. No buzzwords. No overpromising. Just a thoughtful conversation with our technical team to explore your idea.

What Happens Next?

Your message goes straight to our core team – not a sales department.
You’ll talk directly with our CEO, CTO, or a domain expert for a consultation-style first call.
We’ll ask the right technical questions to understand your challenge and scope.
If we’re a fit, we’ll follow up with a proposed path forward.

More Projects Like This

Crystal Clear Memories

Computer Vision

3D GenAI

Automated transformation of photos into 3D memorial keepsakes using a hybrid of classical computer vision and 3D generative AI.

Pardigm

Computer Vision

Edge AI

Hormone tracking application powered by Computer Vision with 93%+ accuracy in real-life conditions.