Published March 23, 2023 11 min read

How Text-to-Speech Models Work: Theory and Practice

by Ustym Khaburskyi

Audio AI engineer

Text-to-speech (TTS) has been a popular topic for some time, and its development shows no signs of slowing down. There are a plethora of deep learning models, software programs, and companies offering this service. It’s no surprise, given the broad range of applications, from voice assistants and answering machines to creating audio versions of articles, books, and even automatic voiceovers for videos. Consequently, developers strive to improve the quality of these systems by creating more natural-sounding voices that are indistinguishable from human speech.

This blog aims to describe and compare the best open-source text-to-speech models and Python libraries that are easy to use. If you’re curious about how these models work, we’ll briefly explain the theory behind them and their structure. If you’re here to quickly learn how to use them in Python, feel free to skip ahead to the corresponding paragraph about the VITS model. This model provides fast, real-time text-to-speech synthesis with over 100 pre-trained voices. Alternatively, if you’re interested in good-quality voice cloning, the paragraph about Tortoise TTS may pique your interest. Please note, however, that this is a slow model.

Text-to-speech explained

To begin with, let’s clarify what a text-to-speech model is. TTS, or speech synthesis, is a system that takes text as input and generates an audio signal from it. The primary goal of modern TTS is to make a synthesized speech from text sound not only comprehensible but also natural. However, achieving a high level of naturalness is often subjective and evaluated using metrics such as the Mean Opinion Score (MOS). MOS is a subjective rating obtained from human listeners who evaluate synthetic speech samples generated by the TTS system. Listeners rate speech samples on a scale of 1 to 5, where 1 represents poor quality, and 5 represents excellent quality. The scores from multiple listeners are then averaged to determine the overall MOS for the TTS system. It’s worth noting that MOS is usually calculated when a model is trained on the LJspeech, the most popular open and free-to-use dataset. However, even original samples from this dataset do not receive a five-score evaluation.

Let’s take a deep dive into the theory of TTS models. Typically, modern deep learning TTS models consist of three essential components:

a text analysis module (also known as the frontend),
an acoustic model, and
a vocoder.

The text analysis module converts a text sequence into linguistic features. The acoustic model generates acoustic features from those linguistic features. Finally, the vocoder synthesizes a waveform from those acoustic features. Of course, there are many different types of each of these components, and if you want to dive deeper into the various acoustic and linguistic features and their representations, I highly recommend checking out this survey about TTS models. To give you a better idea of how the system works, look at the basic visualization of such a system in Fig. 1. Here, the text analysis module has phonemes as output, and the acoustic model produces a mel-spectrogram.

How Text-to-Speech Models Work: Theory and Practice - picture 1

Fig 1. Basic components of a TTS model.

While the abstract diagram of the TTS model may appear straightforward, each of its components has a much more intricate structure. The field of TTS boasts numerous text-to-speech models and vocoders, including Tacotron2, FastSpeech2, Glow-TTS, VITS, HiFi-GAN, MelGAN, and WaveGlow, among others. It is worth noting that if these models share the same acoustic features, such as the mel-spectrogram, it is possible to combine the acoustic model from one TTS with another vocoder in some cases. Figure 2 illustrates the data flows of various TTS models available as of July 2021.

How Text-to-Speech Models Work: Theory and Practice - picture 2

Fig. 2 The data flow from text to waveform and different models from the survey on TTS solutions.

One example of an end-to-end TTS model that leverages this approach is VITS, which combines the GlowTTS encoder and HiFi-GAN vocoder. While the underlying theory of TTS still applies, VITS has demonstrated superior performance compared to other models. In fact, its MOS output on the VCTK multi-voice dataset achieves a ground truth level. If you’re interested in trying out this fast TTS model, read on for a practical guide on how to use it.

Voice Cloning

The field of voice cloning is a challenging subfield of TTS. Its goal is to generate speech that closely matches an audio reference voice. If you have a significant amount of audio data of the voice you want to clone, you can train any TTS model on it. However, a different approach is needed if you have only a few samples. Fortunately, there is one – Tortoise TTS, a model specifically designed for high-quality voice cloning.

Tortoise TTS consists of five separately trained neural networks that are pipelined together to produce the final output. Its components draw inspiration from image generation models such as DALL-E and denoising diffusion probabilistic models. The production of the final waveform has the following steps:

Text input and reference clips are fed to the autoregressive decoder that outputs latents and corresponding token codes representing highly-compressed audio data. This step is repeated several times to produce multiple “candidate” latents.
The CLVP (Contrastive Language-Voice Pretraining) and CVVP (contrastive voice-voice pretraining) models select the best candidate. The CLVP model produces a similarity score between the input text and each candidate code sequence, while the CVVP model produces a similarity score between the reference clips and each candidate. These two similarity scores are combined with a weighting provided by the user, and the candidate with the highest total similarity proceeds to the next step.
The diffusion decoder then consumes the autoregressive latents and reference clips to produce a mel-spectrogram representing the speech output.
Finally, a UnivNet vocoder is used to transform the mel-spectrogram into actual waveform data.

TortoiseTTS was called this way for a good reason – the model is very slow, and running it on a GPU is highly recommended. However, its ability to produce high-quality voice cloning results with only a few reference audio samples makes it a valuable tool for many applications.

Read also:

Audio Processing Basics in Python

Practical Guide: VITS

In this practical guide, we will show you how to use the VITS model to synthesize speech with the CoquiTTS library in Python. This library contains:

acoustic models: Tacotron, Tacotron2, Glow-TTS, Speedy-Speech, Align-TTS, FastPitch, FastSpeech, SC-GlowTTS, Capacitron, OverFlow,
vocoders: MelGAN, MultiBandMelGAN, ParallelWaveGAN, GAN-TTS discriminators, WaveRNN, WaveGrad, HiFiGAN, UnivNet,
and two end-to-end TTS models: VITS, YourTTS (based on VITS).

From our experiments, VITS produces the most natural results and works with low RTF (real-time factor), even on the CPU.

If you are not planning to train this model, to get started, install the CoquiTTS library using pip:

pip install TTS

Once you have installed the library, synthesizing speech is straightforward. Here is an example code snippet that generates speech from a text input using the VITS model:

 
from TTS.api import TTS 

text = 'Hello, dear readers of the blog about TTS models!' 
model_name = 'tts_models/en/vctk/vits' 
speaker_id = 77 

tts = TTS(model_name) 
tts.tts_to_file(text=text, speaker=tts.speakers[speaker_id], file_path='output.wav') 
# tts.tts(text=text, speaker=tts.speakers[speaker_id])

In the code above, we first define the input text, model name, and speaker ID. We then create a TTS object using the VITS model and use the tts_to_file method to synthesize speech and save it to a file. If you want to use the output as a signal instead of saving it to a file, you can use the tts method instead of tts_to_file. The sampling rate of the result = 22050 Hz.

Note that the first time you run this code, the model will be downloaded to your machine, and in the future, it will be used automatically. You can try more than 100 voices from the VCTK dataset by choosing the corresponding speaker_id. There are both female and male voices with different accents.

CoquiTTS repository: link

Listen to the result examples:

Male voice:

Female voice:

Read also:

GStreamer for Computer Vision and Audio Processing

Practical Guide: Tortoise TTS

You can use Tortoise TTS in the official Google Colab. However, this practical guide will help if you want to run it locally. We will start by preparing the environment and the reference audio, then we will split the reference audio into small clips and finally, we will use Tortoise TTS to clone the voice. Let’s get started!

First, you need reference audio to clone the voice – so make sure you prepare reference audio or video where only one person is speaking. It can be 10 seconds or longer, we will split it in the code.

Now, let’s create a separate conda environment and activate it. Install the repository:

git clone https://github.com/neonbjb/tortoise-tts.git
cd tortoise-tts
python -m pip install -r ./requirements.txt
python setup.py install

In addition, install pydub, which will help us to create an audio clips from video or audio file.

pip install pydub

Let’s import functions and libraries that we will use:

import torchaudio
import os
import math

from tortoise.api import TextToSpeech
from tortoise.utils.audio import load_voice
from pydub import AudioSegment

It is recommended to split reference audio into small 6-10 second audio clips. Here is a simple function that will split audio if it is longer than 12 seconds into small clips and save them to the voices folder.

def prepare_voice(voice_name, path_to_original_voice):
    custom_voice_folder = f"tortoise/voices/{voice_name}"
    os.makedirs(custom_voice_folder)

    full_audio = AudioSegment.from_file(path_to_original_voice)
    duration = len(full_audio) / 1000

    if duration < 12:
        full_audio.export(os.path.join(custom_voice_folder, f'{0}.wav'), format='wav')
        return

    amount_of_parts = math.floor(duration / 6)
    part_duration = len(full_audio) / amount_of_parts

    for i in range(amount_of_parts):
        part = full_audio[i * part_duration:(i + 1) * part_duration]
        i += 1
        part.export(os.path.join(custom_voice_folder, f'{i}.wav'), format='wav')

Run this function with your parameters, where voice_name is the name of a new voice, path_to_original_voice is the path to the video or audio with the voice you want to clone.

Finally, we can create the speech from text with voice cloning (don’t forget to change the voice_name):

voice_name = ‘custom’
text = ‘Hello, dear readers of the blog about TTS models!’
output_path = ‘output.wav’

tts = TextToSpeech()
preset = "high_quality"

voice_samples, conditioning_latents = load_voice(voice_name)
audio_signal = tts.tts_with_preset(text, voice_samples=voice_samples, 
                                   conditioning_latents=conditioning_latents, 
                                   preset=preset)

torchaudio.save(output_path, audio_signal.squeeze(0).cpu(), 24000)

And that’s it! With these simple steps, you can use TortoiseTTS to clone a voice from reference audio.

Official repository: link

Listen to the result examples (Both examples took approximately 30 min each to run on the local GPU):

Voice of Amy Winehouse, cloned from the 30s of this interview.

Voice of Jordan Peterson, cloned from the 30s cut from this video.

Conclusions

The conclusions of this article highlight the accessibility and ease of use of modern TTS systems. Whether you’re a developer looking to integrate TTS into your projects or a content creator seeking to produce audio content, there are a variety of TTS tools available to suit your needs. From the VITS model in the CoquiTTS library to the slower but higher-quality TortoiseTTS, TTS technology has come a long way and continues to evolve, promising exciting advancements in the future.

Post Views: 5,353

audio processingText-to-speechTortoise TTSVITSVoice cloning

Ready to Make Your Business Processes Up to 90% More Efficient?

Partner with a team that builds AI to work in the real business world. We help companies cut manual work, speed up operations, and turn complexity into clarity.