The Bizarre Google World: Bazel, ProtoBuf, and More

It was not easy at all to master MediaPipe. We thought little in C++ could surprise us. MP did. They say Google libraries do not work outside of Google. We can confirm this is the truth. The ways Google uses the C++ language are highly unusual from our point of view.

How is C++ normally used?

Normally (at least where we come from) people use CMake, a nice cross-platform build system, for C++ projects. Other somewhat common build systems for C++ include Autotools (aka configure+make, mostly Linux/Unix), qmake, and Visual Studio projects (Windows+Visual Studio only). These build systems are similar in the way they handle dependencies. Libraries needed by your projects are typically downloaded and installed system-wide, and not attached to any particular project (as they do in Java or JavaScript worlds). In Linux, macOS and MSYS2 you typically use the system package manager (e.g. ‘sudo apt install libopencv-dev’). For Windows+Visual Studio, you can use vcpkg. If a library is not in the package manager repo, you can download it by hand (as a binary), or, in the worst case, build from the source. By the way, in the latter case, we always install it in a user’s home directory in Linux (e.g. “/home/mickeymouse/opencv-cuda”), we never do “sudo make install”.

What is an installed C/C++ library (by ‘sudo apt install’ or otherwise)? It is a bunch of  headers (.h or .hpp files); and one or more static (.a/.lib) or more often dynamic (.so/.dll) library files. In any case, an “installed library” is compiled once, then used as a binary, which is a good idea, since building a large library like OpenCV, FFMpeg or Boost from the sources takes significant time even on modern PCs. As a C++ developer, you rarely (if ever) have to deal with building standard libraries from the source.

But how do you use installed libraries in your C++ project? First, your project must find the libraries. CMake has a find_package() command for CMake packages, and pkg-config packages can be found by both CMake and Autotools projects on Linux. Things are a bit worse in Windows, but CMake find_package() still mostly works, if used properly.

How does MediaPipe use C++? Part 1.

MP logic is very different. MP does not use CMake. It uses a different build system called Bazel. We’ll tell you in a moment what it is. MP also has tons of dependencies. Namely:

  • Source downloaded from github (Non-google): Bazel-skylib, EasyExif, pybind11, Ceres
  • Source downloaded from github (Google): Abseil, GoogleTest, Benchmark, GLog, GFlags, Protobuf, libyuv, AudioTools, TensorFlow
  • The choice between building from source or using system libraries: OpenCV, ffmpeg

Below we will explain the “downloading and building from source” part. It is practically impossible to build MP in any other way (e.g. with CMake). Maybe a C++ professional could solve this, given time, but the sheer number of dependencies would make it very hard. Definitely not a project for beginners.

What is Bazel?

Bazel is a multi-language build system, which Google uses for many C++ projects, MP included. Probably there are production-related reasons for this, but for us (we are not Google professionals) our experience with Bazel was predominantly negative.

A Bazel project root directory has a file named WORKSPACE, which can be empty. What is a minimal Bazel project? It has an empty WORKSPACE file and a subdirectory fun1. This subdirectory has a file hello.cpp with a project file called BUILD:

load(“@rules_cc//cc:defs.bzl”, “cc_binary”)

cc_binary(

name = “hello”,

srcs = [“hello.cpp”],

)

Note that a project has only one WORKSPACE file, but it can have multiple BUILD files, usually as a hierarchical subdirectory structure. To build the target hello, type (in the project root):

bazel build //fun1:hello

It builds the target and creates 4 directories, which are actually symbolic links to somewhere in ${HOME}/.bazel (tricky !): bazel-bin, bazel-out, bazel-hello and bazel-testlogs. Or, if you want to build and run, type:
bazel run //fun1:hello

How does Bazel treat dependencies? First, there are internal dependencies, other targets of the same project, this is not interesting. Second, there are external dependencies, both Bazel and non-Bazel. Bazel dependencies must be Bazel projects built from the source. Non-Bazel dependencies, in theory, can be the binary libraries, combinations of *.h+.so files. All external dependencies must be listed in the WORKSPACE file.

Here the trouble starts. First, Bazel cannot look for CMake packages. It cannot even find pkg-config packages (we saw a library on GitHub which is supposed to do this, but it did not work for us, at least with OpenCV). We don’t think Bazel can even use standard system paths for libraries in include files (in Linux), you must specify an exact path to each and every library in WORKSPACE and its headers. And even this is nontrivial. Just look at the third_party directory of the MediaPipe repo to see how ugly things can get.

The preferred way in the Bazel world (or at least for Google projects like MP), is to download each and every dependency as a source code (and a Bazel project), and include it as an external Bazel dependency. Bazel has a macro called http_archive() for downloading, but you still must supply an URL. No, there is no “Bazel code repo”, it’s not like Gradle for Java or PIP for Python. Bazel does not manage any “packages”, it can only download stuff from the internet, even CMake can do that (with probably less boilerplate code).

And even such a model does not work properly, as Bazel does not understand “dependency of dependency”. Suppose your project P depends on library A, which in turn depends on B, C, D, E, F, do you add A as the external dependency in P? No, you must add A, B, C, D, E, F, or otherwise P will not build. And don’t forget that building all your dependencies from the source takes time, to say the least, especially if your dependencies are large libraries like OpenCV.

Is there any reason for using Bazel in C++ projects? We did not see any. However, in production, it might be good to download all dependencies from the internet and not rely on the Linux version and APT package versions, for example. 

Another odd thing: suppose executable target A depends on a library target B. Then, if you build target A, Bazel compiles all source files (including the ones belonging to B) to .o, and links the executable A, but never actually links library B (as an .a or .so file). Only if you build target B explicitly, will the library be built.

Finally, how well is Bazel supported by IDEs? Our answer: Not at all. A CLion plugin was announced, but it is incompatible with recent CLion versions. VS Code plugin did not work either, giving very weird error messages, something about Android, while running on Linux desktop. We don’t know enough Bazel or VS Code to fix it. 

To summarize, while Bazel documentation says how great Bazel is, our impression is quite the opposite.

How does MediaPipe use C++? Part 2.

Disclaimer: When we say “impossible” in this chapter, it actually means “impossible, unless you are a highly skillful C++ professional ready to devote a lot of effort to the task”.

Google MediaPipe is a Bazel project. What does it mean? It means it cannot be installed with “sudo apt install libmediapipe-dev”. And it cannot be installed as a pre-built binary library (.h and .so files). Can you build it from the source? Again, the answer is no, at least if you want .h and .so files you can use in your project. So, for all practical purposes (see the disclaimer above), MP can be only used in Bazel C++ projects. Moreover, MP itself has to be built from the source.

How does MP handle dependencies? As we explained above, it downloads >10 dependencies from the internet as source Bazel projects. An exception is made only for OpenCV and FFMpeg, where you can choose between source and system libraries (in the latter case you must specify full paths). Can you use MP as an external Bazel dependency of your project? Basically no, or at least it is very hard (we saw an example in GitHub though). The reason is the “dependencies of dependency” issue, you will need to specify basically all MP dependencies in your project, and not only MP itself.

So the only way (at least for beginners) to use MP is to make your projects not only Bazel projects but parts of the MP project, located inside the mediapipe/ directory, just like MP examples. From our point of view, this is extremely ugly. And not using any IDE does not make coding in C++ any easier.

If this is not enough for you, there are many other ways MP complicates things unnecessarily. For example:

  1. You cannot build anything without the –define MEDIAPIPE_DISABLE_GPU=1 flag. The default is a GPU build that fails for rather obscure reasons. 
  2. MP examples use GLog logger a lot instead of cout and will not work without GLOG_logtostderr=1
  3. The same examples require command line arguments with paths to graph, and will not work if called from a different directory. 
  4. MP creates its own wrappers for OpenCV headers and other dependencies, instead of using these libraries as they are.

We promised the final verdict by the end of the series of articles, but actually, we can put it here: MediaPipe would be very nice, if not for Bazel. Bazel (and all related issues) makes you think twice before deciding to use MediaPipe in your C++ project. In particular, if something like GStreamer is suitable for you, it is a much better choice, as it does not require Bazel.

What about using non-C++ wrappers? As we explained before, writing custom calculators requires rebuilding MP from C++ sources. Once again, you will have to deal with Bazel, and also an additional complication of integrating Bazel with Python or Android or whatever.

Google Libraries

MP uses a lot of Google libraries and some non-google ones, which it builds from sources as Bazel projects. What are those libraries? A few Google examples:

  • TensorFlow: If you are reading this, you should know what it is 😉
  • GLog: A pretty standard logger, and probably the worst logger we have seen. By default, it logs to files in some obscure locations (instead of console), and it’s hard to override. 
  • GFlags: Google library for parsing command line arguments, and another reason why MP examples are so hard to read.
  • GTest: A well-known unit test library for C++.
  • Abseil: A Google’s answer to Boost, and a “thousand useful things for C++” type of library. It can be actually installed with apt and used in CMake projects (but not the latest version). It can be pretty nice, but as far as we know, MP uses only the error codes from Abseil.
  • Protobuf: The only library we genuinely liked. We devote a whole section to it.

Google Protocol Buffers (Protobuf)

What is Protobuf? It is a cross-language and cross-platform library from Google for class definition and serialization. Where is it used? TensorFlow and MediaPipe and probably many other things. 

What does it all mean? Let’s do a simple example. Suppose we want to define a data type (or “message” in the Protobuf lingo) Hero in hero.proto:

syntax = "proto3"; // Language version: proto2, proto3
package goblin;  // Becomes C++ namespace
message Hero{
	string name = 1;
	int32 age = 2;
}

“Package” corresponds to a python or Java package, or a C++ namespace. “proto3” is the language version, there are 2 and 3 (they are incompatible). “=1”, “=2” are NOT defaults, but the field unique IDs, they are compulsory. 

Next, we must compile the .proto file to the class definition of your language of choice. For C++, it is:

protoc --cpp_out=. hero.proto

It generates C++ files hero.pb.h and hero.pb.cc containing a C++ class Hero. It’s very important that Hero is not a “simple C++ data class of 2 fields”, but a monster class with lots of obscure methods that requires the Protobuf C++ library. However, it’s not a big problem, as Protobuf can be installed by APT and included in CMake projects easily. Then you can use this class in your own code, with getters and setters and such:

// Create a goblin::Hero object and set fields
goblin::Hero h1;
h1.set_name("Brianna");
h1.set_age(18);
// Can be copied by value (clone aka deep copy, expensive !)
goblin::Hero h2 = h1;
// Print it
cout << "h1: name=" << h1.name() << ", age=" << h1.age() << endl;
// Or like this
cout << h1.DebugString() << endl;

Classes like Hero (but not non-Protobuf classes) can be serialized in both binary and text formats. Such serialization is efficient, cross-language, cross-platform and immune to little/big-endian and 32/64-bit issues.

// Serialize to binary, then deserialize
string buf; // Here std::string is used for BINARY data !
bool ret = h1.SerializeToString(&buf);
goblin::Hero h2;
ret = h2.ParseFromString(buf);

// Serialize to text, then deserialize
string buf;
bool ret = google::protobuf::TextFormat::PrintToString(h1, &buf);
goblin::Hero h2;
ret = google::protobuf::TextFormat::ParseFromString(buf, &h2);
// Text format looks like this:
name: "Brianna"
age: 18

The binary serialization is, well, binary, even if it is contained in an std::string. Why use Protobuf? We think its potential is enormous. TensorFlow uses it to serialize models (.pb files). MediaPipe uses text format to define graphs. And you can use it in your own projects. Every time you see JSON, XML, YAML, TOML and such, Protobuf would probably be better. Binary serialization is efficient, while text serialization is human-readable, and good for e.g. config files.

Let’s now move to our next article and see how MediaPipe works in practice!

Down the Rabbit Hole: Our Journey to the Land of MediaPipe and Other Google Technologies

What is Google MediaPipe (MP) for Dummies?

In the ML/DL community you can often hear ”Nowadays you must know Google MediaPipe”, “It’s a cool framework”, and sometimes “It’s internally used by YouTube!” Videos with various computer vision tasks like this hand tracking often appear on LinkedIn and forums with the comment “This is MediaPipe”! At this point, we decided we could not ignore it anymore. So we packed our backpacks, said our goodbyes, and embarked on the journey to the Magical Land of MediaPipe and Google Technologies.

We quickly discovered that most people who praise MediaPipe on social media have no idea what it really is. “For Dummies” version: MediaPipe is a bunch of “solutions”, such as “Hand”, or “Face Mesh”. The table of all available solutions can be found here. As we can see, not all solutions are available for all platforms, although things are improving: this table nowadays has a few more checkmarks than it did half a year ago. But MediaPipe is not “solutions”. What is it really?

  • Fact #1: Google MediaPipe is a C++ library, other languages are wrappers around C++, with very limited functionality. If you want MediaPipe for real, you must use C++.
  • Fact #2: Google MediaPipe is a pipeline library. Look at the Wikipedia articles for Pipeline and related concepts of Dataflow- and Flow-Based Programming. Our previous blog post stressed the importance of pipelines for computer vision.

But what exactly is a pipeline? It is a number of Nodes organized as a Flow Graph. Data Packets (a data packet is a video frame, audio segment or some other data) run through the graph and are processed at the Nodes. Different nodes usually run on different CPU threads, so that they can utilize the available resources to the maximum. There are typically Buffers between nodes. For Real-Time Pipelines the buffers should have a limited capacity, and frames are lost if a buffer overflows. On the other hand, we want non-real-time pipelines (e.g. converting a VP9-encoded video file to H265) to be Deterministic: i.e. not-random, and with no frame loss.

  • Fact #3:  MP can process arbitrary data types in pipelines, although it has special type for Image and Audio data.

But what about MP Solutions? What do they have to do with pipelines? MP Solutions are basically just pre-trained TensorFlow Lite (TF Lite) models under the hood. MP graphs add a few minor extra blocks to the raw inference, such as Non-Maximum Suppression and results visualization, sometimes also detection+tracking logic. But basically very little is added to TF Lite. So when you hear “MediaPipe is amazing, both fast and accurate” people are actually talking about TF Lite and particular pre-trained models. MP Solutions are rather trivial to use, and well-documented. We will not discuss them anymore.  

  • Fact #4: MP uses TFLite or TF models for deep learning (DL), but it is in no way limited to DL. MP solutions are pre-trained TFLite models with some rather elementary pre- or post-processing. For the sake of DL, “MediaPipe” and “TFLite” are basically the same thing.

Can you do something similar with your own pre-trained TF Lite (or TF) networks? In theory, yes. In practice, the choice of standard pipeline building blocks (called Calculators in MP) is rather limited. Basically, any TFLite model can be plugged into the standard TfLiteInferenceCalculator, but MP might lack building blocks for pre/post-processing if your task is different from the tasks in the solutions. It is possible to write your own calculators, but only in C++.

What is Our Interest in MediaPipe?

We were interested mostly in MP as a universal pipeline C++ framework, and not in “solutions”. We wanted to see if MP was suitable for writing custom computer vision (CV) pipelines in C++ (see the end of this article series for the final verdict). In the process, we experimented with core MP C++ API a lot and wrote a tutorial: https://github.com/agrechnev/first_steps_mediapipe.

Can you use MP in languages other than C++ and platforms other than desktop? For solutions, yes. Python, JavaScript, Android (Kotlin/Java) and iOS (Swift). But once again, all these things are just wrappers around the C++ library. Presumably, they can be also used for a custom graph composed of standard MP calculators. However, any custom calculator must be written in C++. Moreover, if you use any custom calculators, you must (as far as we know) rebuild MP from the source, including the respective wrapper (Python, JavaScript, etc.). You must be a fluent MP C++ user in order to do that! So, for all practical purposes, MP is a C++ library, the wrappers are a joke. With this explained, we are not going to discuss any languages other than C++ in MP.

How does MP compare to another well-known pipeline library, GStreamer? Let’s have a look:

Part of, year of birth GNOME universe, 2001 Google universe, ~2019
Language C (GObject) + wrappers C++ + wrappers
Main Purpose Audio/Video conversion, filtering, resampling Audio/Video processing, usually with Deep Learning
Standard A/V codecs All you can think of: uses many plugins Limited: OpenCV for video, FFMpeg for audio
Buffering, flow control No buffering by default
Enable buffers by hand
Unlimited buffering by default
Enable flow control by hand
GPU, Neural nets Yes with DeepStream+ TensorRT, NVidia GPUs only. Yes, TensorFlow + TF Lite
Desktop use, docs Easy, good Hard, bad
Graph definition C code (hard) or text string (limited) ProtoBuf text string (easy)

In the following sections, we present our experience of designing pipelines with MediaPipe C++.

It-Jim’s 2021 Summer Internship on Computer Vision: an Overview

Another summer, another edition of our internship on computer vision to be proud of! This time we received well over 100 applications from more than 20 cities including Kyiv, Kharkiv, Lviv, Dnipro, Odesa, Mykolaiv, Vinnytsia, Uzhhorod, Poltava, Kremenchuk, Sumy, Zaporizhzhia, Kryvyi Pih, and Mariupol. What an impressive geography! Only three of the applicants made it to the ‘finals’. Curious what projects they worked on under the mentorship of It-Jim’s engineers? Let’s find out!

The Fifth Edition of It-Jim’s Internships

But first, let’s look at some more numbers. After pre-screening the list of candidates, we reached out to 75 of them and asked them to complete a couple of test assignments. Although only 25 participants sent in their solutions, 15 did so well that they made it to the next step: a technical interview with our engineers. This stage is always a little harsh: we’ve come such a long way together, yet the number of places is always limited and the majority of the candidates, unfortunately, will not receive positive answers. Three of the participants have eventually become our summer interns, and another one even became our trainee (and later a junior CV engineer, but that is a whole different story).

So what were the computer vision tasks our interns were working on for 4 weeks?

Project Zoo

One of the reasons for interviewing prospective interns is to understand their strengths and weaknesses and subsequently provide them with a project that is doable, a little challenging, but certainly educational and broadening their skills. 

This summer’s list of projects included:

  • Soccer video analytics: creating an app to help assess a soccer player’s agility during practice by tracking the player and the ball and counting the number of kicks the player takes during drills. 
  • Liveliness detection system: creating a solution that can detect that the facial verification system is being cheated by showing a photo of a person instead of a live face. 
  • Traffic statistics estimation: creating an algorithm that counts the number of cars and pedestrians crossing a certain line on the road.

Interns’ Solutions

Soccer video analytics

Demo of automatic kick counting

  • Tools and technologies:  OpenCV, deep learning, TensorFlow Lite, Kotlin

Liveliness detection system

Demo of a liveliness detection system

  • Tools and technologies: feature crafting, deep learning

Traffic statistics estimation

Demo of traffic statistics estimation

  • Tools and technologies: OpenCV, image processing, object tracking

Summary

Our interns say this program is one of the best ways to get commercial experience and try your hand at being a computer vision engineer. If you’re still wondering if this is the right path for you, remember that you can always try it first. For example, by joining us next winter 2022 for a computer vision internship. We are looking forward to receiving your application!

Computer Vision in Healthcare

Want to know what stands behind remote photoplethysmography (rPPG) and how to non-invasively monitor vital parameters such as heart rate and respiration, oxygen saturation, and blood pressure using just a phone camera?

During the event, our CEO Ievgen Gorovyi will dive into the details of developing a computer vision-based solution for such healthcare application.

📅 Join us on September 18 at 11:00 in Zoom meeting!

🎯 Participation is free by pre-registration 👉🏻 https://cutt.ly/mWT8uv0.

AI Ukraine Online Conference 2021

On October 30, the AI Ukraine Online Conference will take place. Since 2014, it has been gathering experts immersed in Data Science and Machine Learning.
Every year AI Ukraine brings together more than 900 participants from all over the world to increase (accumulate?) expertise, share experience, and take another step forward the future of emerging technologies.
The conference will be held online. In addition to three thematic streams, it will consist of Q&A sessions, interactives, and networking.

Applied Computer Vision Course

After several very successful editions of internships and schools on computer vision and lots of interviews for CV/ML/DL engineers’ positions at our company, we are super excited to announce that we are launching our course in October 2021! 🚀

10 weeks, 20 lessons, each one being a mixture of theory, enhanced with mathematics essentials for computer vision, and practical workshops showcasing the methods learned. Unlike many other courses, we are going to focus not only on DL methods but also on classical CV algorithms.

Our #ACV course will be ideal for:

✅ experienced software developers who want to switch to CV/ML/DL domain,

✅ 4- or 5-year students of technical specialties with a passion for computer vision,

✅ data scientists with little background in computer vision aiming to change that,

✅ anyone looking for an extensive and solid base in computer vision.

More details 👉 edu.it-jim.com/. Сontact Daryna Pesina should you have any questions.

Audio Processing Basics in Python

If you want to try some sound processing in Python (with neural network or otherwise) and don’t know where to start, then this article is for you. This post is for absolute beginners. 

What do we want? Basically 3 tasks.

  • Read and write audio files in different formats (WAV, MP3, WMA etc.).
  • Play the sound on your computer.
  • Represent the sound as a waveform, and process it: filter, resample, build spectrograms etc.

Intro

The sound is typically represented as a waveform: a float or integer (quantized) array representing sound signal A(t) over the discrete time variable t. It can have multiple channels for stereo, 5.1, etc.

Waveform, a typical representation of sound.

Image source.

In Python, the waveform can be numpy.ndarray or a similar format, e.g. torch.Tensor. Some libraries have their own waveform formats, which are usually easy to convert to numpy.ndarray if needed. The waveform has sampling rate  fs, a number of samples per second, e.g. 8k, 16k, 22k, 44k, 48k etc. The highest frequency represented by the waveform is fs/2. A waveform is useless if you don’t know fs, thus fs must always accompany a waveform. Sound-processing algorithms often require a fixed fs, thus if you have an input waveform of different fs, you must resample it first, i.e. interpolate the signal A(t) to a different sample rate. Resampling can be done externally (using ffmpeg command line tool or some other software), or internally in your code.

Most sound-processing libraries in Python (like almost everything in Python) are wrappers around C/C++ libraries. Sometimes installing a library with PIP (or CONDA) is not enough, it requires installing additional stuff system-wide, like “sudo apt install libsndfile1” on ubuntu. If something does not work, you can usually google an answer for your OS. 

There are lots and lots of audio file formats. One must understand the difference between container, a file format that contains one or more audio (or video) tracks, e.g. OGG, and the codec of each track, e.g. Vorbis, a codec often used in OGG files. Very few libraries strive to support all (or nearly all) existing codecs and file formats. The prominent cross-platform examples are FFMpeg and GStreamer (and to some extent libSoX), which rely on multiple codec-specific libraries and plugins. Other libraries which work with sound typically have a very limited choice of supported formats, such as uncompressed WAV, or sometimes OGG. Because of that, uncompressed WAV is often used in sound-processing applications, especially neural networks. Upside: it loads faster, and no resources are wasted on decoding a codec. Downside: it takes much more hard disk space compared to MP3, OGG or WMA.

Python Libraries for Audio Processing

Now let’s have a look at some particular Python libraries we tried.

Soundfile

A minimal library (based on sndfile C library, “sudo apt install libsndfile1”) for reading and writing uncompressed WAV files as numpy.ndarray plus fs waveforms. Code example:

import soundfile as sf
y, sr = sf.read('stella.wav')
print(y.shape, y.dtype, sr)
sf.write('out.wav', y, sr)

Librosa

This rather popular Python library has lots of sound processing, spectrograms and such. It can also read audio files using soundfile, and audioread. WAV and maybe OGG are supported, but not MP3 (tries to load it but fails). A Waveform is represented as numpy.ndarray plus fs. Librosa cannot play the sound. The saving function has been removed in recent versions (if you see it in old code, replace it with sf.write() ). File loading examples:


# Keep sf of the file
y, sr = librosa.load('stella.wav', sr=None)   
# Automatically resample to a desired fs
y, sr = librosa.load('stella.wav', sr=44100)
# Load the Nutcracker example
filename = librosa.example('nutcracker')
y, sr = librosa.load(filename, sr=None) 

Visualize the waveform with matplotlib:

librosa.display.waveshow(y, sr)
plt.show()

Or an STFT spectrogram in dB:

d = librosa.stft(y)
s_db = librosa.amplitude_to_db(np.abs(d), ref=np.max)
librosa.display.specshow(s_db)
plt.colorbar()
plt.show()

SoundDevice

But how can we play the sound? The simplest option is SoundDevice, based on PortAudio. Note: this is for python desktop, for Jupyter in Web Browser there is a Jupyter-specific Audio() function.

import sounddevice as sd
y, sr = librosa.load('stella.wav', sr=None)
# This is mono playback, stereo is a bit trickier
sd.play(y, sr)
sd.wait()

PyDub

But what if we want to read or write MP3 or WMA? Then we have no choice but to move to heavyweight stuff. The most user-friendly option is probably PyDub, based on ffmpeg (‘sudo apt install ffmpeg’). PyDub has its own format for waveforms, called AudioSegment, which contains raw waveform, fs and other metadata. It can also play the sound (including stereo).

import pydub
import pydub.playback
a = pydub.AudioSegment.from_mp3('song.mp3')
pydub.playback.play(a)

AudioSegment a can be easily converted to numpy if needed. Let’s play this with SoundDevice:

y = a.get_array_of_samples()
sr = a.frame_rate
# Returns array.array with interlaced left-right channels
# Convert to numpy and extract one channel
y = np.array(y)[::2]
print(type(y), y.shape, y.dtype, sr)
# Convert int16 to float32 and normalize
y = y.astype('float32') / 10000
y -= y.mean()
# Play with SoundDevice
sd.play(y, sr)
sd.wait()

TorchAudio

If you are using PyTorch in your code, you might prefer to use TorchAudio for everything. It uses SoX (good) or SoundFile (uncompressed WAV only) backends. It keeps waveforms in torch.Tensor. Loading and saving files:

import torchaudio
y, sr = torchaudio.load('song.mp3')
print(type(y), y.shape, y.dtype, y.device)
print(sr)
torchaudio.save('out.wav', y, sr)

Play this with sd (one of the 2 channels):

sd.play(y.numpy()[0], sr)
sd.wait()

TorchAudio also has many things like spectrograms, implemented via PyTorch (gradients and GPUs are supported) and pre-trained neural networks in torchaudio.models.

Other libraries

There are many other audio libraries for Python, including Python wrappers of heavyweight C libraries FFMpeg, GStreamer and LibSoX.

Summary

Use the following libraries for the tasks:

  • Read and write uncompressed WAVs: Soundfile, Librosa, TorchAudio
  • Read and and write other formats : PyDub, TorchAudio
  • Play sound on desktop: SoundDevice, PyDub
  • Classical audio processing: Librosa
  • Neural networks : TorchAudio