If you want to try some sound processing in Python (with neural network or otherwise) and don’t know where to start, then this article is for you. This post is for absolute beginners.
What do we want? Basically 3 tasks.
- Read and write audio files in different formats (WAV, MP3, WMA etc.).
- Play the sound on your computer.
- Represent the sound as a waveform, and process it: filter, resample, build spectrograms etc.
The sound is typically represented as a waveform: a float or integer (quantized) array representing sound signal A(t) over the discrete time variable t. It can have multiple channels for stereo, 5.1, etc.
Waveform, a typical representation of sound.
In Python, the waveform can be numpy.ndarray or a similar format, e.g. torch.Tensor. Some libraries have their own waveform formats, which are usually easy to convert to numpy.ndarray if needed. The waveform has sampling rate fs, a number of samples per second, e.g. 8k, 16k, 22k, 44k, 48k etc. The highest frequency represented by the waveform is fs/2. A waveform is useless if you don’t know fs, thus fs must always accompany a waveform. Sound-processing algorithms often require a fixed fs, thus if you have an input waveform of different fs, you must resample it first, i.e. interpolate the signal A(t) to a different sample rate. Resampling can be done externally (using ffmpeg command line tool or some other software), or internally in your code.
Most sound-processing libraries in Python (like almost everything in Python) are wrappers around C/C++ libraries. Sometimes installing a library with PIP (or CONDA) is not enough, it requires installing additional stuff system-wide, like “sudo apt install libsndfile1” on ubuntu. If something does not work, you can usually google an answer for your OS.
There are lots and lots of audio file formats. One must understand the difference between container, a file format that contains one or more audio (or video) tracks, e.g. OGG, and the codec of each track, e.g. Vorbis, a codec often used in OGG files. Very few libraries strive to support all (or nearly all) existing codecs and file formats. The prominent cross-platform examples are FFMpeg and GStreamer (and to some extent libSoX), which rely on multiple codec-specific libraries and plugins. Other libraries which work with sound typically have a very limited choice of supported formats, such as uncompressed WAV, or sometimes OGG. Because of that, uncompressed WAV is often used in sound-processing applications, especially neural networks. Upside: it loads faster, and no resources are wasted on decoding a codec. Downside: it takes much more hard disk space compared to MP3, OGG or WMA.
Now let’s have a look at some particular Python libraries we tried.
A minimal library (based on sndfile C library, “sudo apt install libsndfile1”) for reading and writing uncompressed WAV files as numpy.ndarray plus fs waveforms. Code example:
import soundfile as sf y, sr = sf.read('stella.wav') print(y.shape, y.dtype, sr) sf.write('out.wav', y, sr)
This rather popular Python library has lots of sound processing, spectrograms and such. It can also read audio files using soundfile, and audioread. WAV and maybe OGG are supported, but not MP3 (tries to load it but fails). A Waveform is represented as numpy.ndarray plus fs. Librosa cannot play the sound. The saving function has been removed in recent versions (if you see it in old code, replace it with sf.write() ). File loading examples:
# Keep sf of the file y, sr = librosa.load('stella.wav', sr=None) # Automatically resample to a desired fs y, sr = librosa.load('stella.wav', sr=44100) # Load the Nutcracker example filename = librosa.example('nutcracker') y, sr = librosa.load(filename, sr=None)
Visualize the waveform with matplotlib:
librosa.display.waveshow(y, sr) plt.show()
Or an STFT spectrogram in dB:
d = librosa.stft(y) s_db = librosa.amplitude_to_db(np.abs(d), ref=np.max) librosa.display.specshow(s_db) plt.colorbar() plt.show()
But how can we play the sound? The simplest option is SoundDevice, based on PortAudio. Note: this is for python desktop, for Jupyter in Web Browser there is a Jupyter-specific Audio() function.
import sounddevice as sd y, sr = librosa.load('stella.wav', sr=None) # This is mono playback, stereo is a bit trickier sd.play(y, sr) sd.wait()
But what if we want to read or write MP3 or WMA? Then we have no choice but to move to heavyweight stuff. The most user-friendly option is probably PyDub, based on ffmpeg (‘sudo apt install ffmpeg’). PyDub has its own format for waveforms, called AudioSegment, which contains raw waveform, fs and other metadata. It can also play the sound (including stereo).
import pydub import pydub.playback a = pydub.AudioSegment.from_mp3('song.mp3') pydub.playback.play(a)
AudioSegment a can be easily converted to numpy if needed. Let’s play this with SoundDevice:
y = a.get_array_of_samples() sr = a.frame_rate # Returns array.array with interlaced left-right channels # Convert to numpy and extract one channel y = np.array(y)[::2] print(type(y), y.shape, y.dtype, sr) # Convert int16 to float32 and normalize y = y.astype('float32') / 10000 y -= y.mean() # Play with SoundDevice sd.play(y, sr) sd.wait()
If you are using PyTorch in your code, you might prefer to use TorchAudio for everything. It uses SoX (good) or SoundFile (uncompressed WAV only) backends. It keeps waveforms in torch.Tensor. Loading and saving files:
import torchaudio y, sr = torchaudio.load('song.mp3') print(type(y), y.shape, y.dtype, y.device) print(sr) torchaudio.save('out.wav', y, sr)
Play this with sd (one of the 2 channels):
sd.play(y.numpy(), sr) sd.wait()
TorchAudio also has many things like spectrograms, implemented via PyTorch (gradients and GPUs are supported) and pre-trained neural networks in torchaudio.models.
There are many other audio libraries for Python, including Python wrappers of heavyweight C libraries FFMpeg, GStreamer and LibSoX.
Use the following libraries for the tasks:
- Read and write uncompressed WAVs: Soundfile, Librosa, TorchAudio
- Read and and write other formats : PyDub, TorchAudio
- Play sound on desktop: SoundDevice, PyDub
- Classical audio processing: Librosa
- Neural networks : TorchAudio