Text-to-speech (TTS) has been a popular topic for some time, and its development shows no signs of slowing down. There are a plethora of deep learning models, software programs, and companies offering this service. It’s no surprise, given the broad range of applications, from voice assistants and answering machines to creating audio versions of articles, books, and even automatic voiceovers for videos.
In the previous article, we’ve learned what GStreamer is and its most common use cases. Now, it’s time to start coding in C++. This tutorial does not replace but rather complements the official GStreamer tutorials. Here we focus on using appsrc and appsink for custom video (or audio) processing in the C++ code. In such situations, GStreamer is used mainly for encoding and decoding of various audio and video formats.
You might have heard of something called “GStreamer”. I know what you think. This is some old and boring geek-and-nerd stuff from Linux, right? But what is it? What is the use of GStreamer? If we want computer vision or audio (speech, music) processing, can GStreamer help us? In this article, I’ll try to answer these questions. This article is beginner-level and assumes no or little previous experience with GStreamer.
If you’ve ever come close to anything related to audio or other signal processing, you likely already know about spectrograms. Those fancy-looking and usually colorful plots are commonly used to represent a spectrum’s change over time.
If you want to try some sound processing in Python (with neural network or otherwise) and don’t know where to start, then this article is for you. This post is for absolute beginners. What do we want? Basically 3 tasks. Read and write audio files in different formats (WAV, MP3, WMA etc.). Play the sound on your computer. Represent the sound as a waveform, and process it: filter, resample, build spectrograms etc.