NDA

AI-Powered Pronunciation Correction for Children’s Speech Therapy

Key results

    • Achieved 86% accuracy in phoneme recognition
    • Reduced the manual workload for speech therapists
AI-Powered Pronunciation Correction for Children’s Speech Therapy
Services SERVICES
Audio processing
Speech recognition
Industry INDUSTRY
Speech therapy
Healthcare
Technologies TECHNOLOGIES USED
Python
AWS
Terraform
FastAPI
Gradio
Transformers
PyTorch
Accelerate
Wav2Vec2
Team TEAM
2-4 AI engineers
Location LOCATION
United Kingdom
Duration PROJECT DURATION
1 year
About the Client

About the Client

The client provides online counseling and speech and language therapy for children and young people, with a mission to make therapeutic support more accessible and engaging through innovative technology.

Their platform hosts therapy sessions enriched with interactive activities, creative tools, and games. This gamified approach helps reduce anxiety, encourages participation, and supports a wide range of therapeutic methods.

To expand the platform’s capabilities, the client partnered with It-Jim to explore how AI could automate parts of speech therapy – starting with a system that detects and corrects children’s pronunciation mistakes using speech recognition.

The Challenges & Goals

The project set out to solve a key challenge in paediatric speech therapy: automatically identifying and tracking pronunciation mistakes without constant therapist involvement. The goal was to develop an AI system that detects mispronounced phonemes in real time, handles non-linear speech patterns, and delivers feedback as accurately as – or better than – a trained therapist. But children’s speech brings a unique set of challenges:

Unpredictable articulation and speech variability

Children’s pronunciation changes as they grow, making phoneme boundaries inconsistent and difficult for traditional models to interpret.

Diverse accents and speech patterns

Regional and individual pronunciation differences required the model to handle a wide range of phonetic variations.

Noisy real-world environments

Sessions often take place outside ideal recording conditions, with background noise, interruptions, or low-quality microphones.

Limited training data for children’s speech

Few existing datasets capture the variety of children’s voices and speech errors needed to train accurate recognition models.

Solution Overview

We brought the vision to life through a multi-stage approach, from research and model selection to dataset creation and deployment. Each step addressed specific technical and data challenges while aligning with therapy needs. Using advanced speech recognition, phoneme-level training, and a custom child speech dataset, we built an AI system that delivers reliable, actionable feedback even in noisy, real-world environments.

Stage 1: Framing the Problem and Model Selection

The client’s vision was clear: automate pronunciation correction for children. But traditional speech recognition models weren’t designed for the nuances of kids’ speech, especially with accents, articulation variability, and noise.

What we did:

  • Identified phoneme recognition as the most effective approach for tracking mispronunciations.
  • Selected Wav2Vec 2.0, a self-supervised model known for handling raw audio and noisy data.
  • Proposed a fine-tuning strategy tailored to children’s speech patterns.

This gave the project a clear path forward and the confidence that the goal was technically achievable.

Image

Stage 2: Adapting Wav2Vec 2.0 for Children

Wav2Vec 2.0 had strong potential, but it needed careful fine-tuning. Children’s speech is less structured and more variable than adult audio, which makes direct transcription difficult.

Key steps:

  • Fine-tuned the model using Connectionist Temporal Classification (CTC) loss , aligning unsegmented audio with phoneme sequences.
  • Integrated a multilingual variant of Wav2Vec 2.0 to handle diverse phonetic inputs.
  • Added logic to handle non-linear, irregular speech patterns, making the model more robust.

The result was a fine-tuned model capable of understanding real-world, imperfect children speech.

Image

Stage 3: Building a Custom Dataset

Data availability was a critical roadblock. Existing datasets lacked both the quantity and diversity needed for children’s phoneme-level speech.

How we addressed it:

  • Built a custom dataset with recorded child speech across varied conditions.
  • Partnered with speech specialists for accurate phoneme-level annotation.
  • Created a validation pipeline to automate checks and reduce expert workload.
  • Applied feature engineering to capture markers of disordered or unclear speech.

The dataset became the foundation that enabled real-world accuracy rather than just lab success.

Image

Stage 4: Delivering the Solution

With the model trained and validated, we moved into deployment, focusing on usability, monitoring, and production-readiness.

Key results:

  • Achieved 86% accuracy, outperforming the 70% baseline of manual therapist assessment.
  • Delivered actionable phoneme-level feedback for children’s pronunciation.
  • Created a production-ready inference pipeline with built-in analytics.
  • Deployed a CI/CD system to streamline future updates and improvements.

This collaboration demonstrated how AI and computer vision can bring lab-level insights into users’ pockets, redefining how personal health can be tracked and understood.

Image
Image

Let’s Talk About Building AI That Actually Helps People

Whether you’re working on health tech, education, or human development – we’re here to bring AI into the real world with you. No buzzwords. No overpromising. Just a thoughtful conversation with our technical team to explore your idea.

    Tell us about your project
    File upload (optional)

    By submitting this form, you agree with our Privacy Policy

    What Happens Next?
    1. Your message goes straight to our core team – not a sales department.
    2. You’ll talk directly with our CEO, CTO, or a domain expert for a consultation-style first call.
    3. We’ll ask the right technical questions to understand your challenge and scope.
    4. If we’re a fit, we’ll follow up with a proposed path forward.

    More Projects Like This

    Crystal Clear Memories

    Crystal Clear Memories

    Automated transformation of photos into 3D memorial keepsakes using a hybrid of classical computer vision and 3D generative AI.

    Pardigm

    Pardigm

    Hormone tracking application powered by Computer Vision with 93%+ accuracy in real-life conditions.