What is Google MediaPipe (MP) for Dummies?
In the ML/DL community you can often hear ”Nowadays you must know Google MediaPipe”, “It’s a cool framework”, and sometimes “It’s internally used by YouTube!” Videos with various computer vision tasks like this hand tracking often appear on LinkedIn and forums with the comment “This is MediaPipe”! At this point, we decided we could not ignore it anymore. So we packed our backpacks, said our goodbyes, and embarked on the journey to the Magical Land of MediaPipe and Google Technologies.
We quickly discovered that most people who praise MediaPipe on social media have no idea what it really is. “For Dummies” version: MediaPipe is a bunch of “solutions”, such as “Hand”, or “Face Mesh”. The table of all available solutions can be found here. As we can see, not all solutions are available for all platforms, although things are improving: this table nowadays has a few more checkmarks than it did half a year ago. But MediaPipe is not “solutions”. What is it really?
- Fact #1: Google MediaPipe is a C++ library, other languages are wrappers around C++, with very limited functionality. If you want MediaPipe for real, you must use C++.
- Fact #2: Google MediaPipe is a pipeline library. Look at the Wikipedia articles for Pipeline and related concepts of Dataflow- and Flow-Based Programming. Our previous blog post stressed the importance of pipelines for computer vision.
But what exactly is a pipeline? It is a number of Nodes organized as a Flow Graph. Data Packets (a data packet is a video frame, audio segment or some other data) run through the graph and are processed at the Nodes. Different nodes usually run on different CPU threads, so that they can utilize the available resources to the maximum. There are typically Buffers between nodes. For Real-Time Pipelines the buffers should have a limited capacity, and frames are lost if a buffer overflows. On the other hand, we want non-real-time pipelines (e.g. converting a VP9-encoded video file to H265) to be Deterministic: i.e. not-random, and with no frame loss.
- Fact #3: MP can process arbitrary data types in pipelines, although it has special type for Image and Audio data.
But what about MP Solutions? What do they have to do with pipelines? MP Solutions are basically just pre-trained TensorFlow Lite (TF Lite) models under the hood. MP graphs add a few minor extra blocks to the raw inference, such as Non-Maximum Suppression and results visualization, sometimes also detection+tracking logic. But basically very little is added to TF Lite. So when you hear “MediaPipe is amazing, both fast and accurate” people are actually talking about TF Lite and particular pre-trained models. MP Solutions are rather trivial to use, and well-documented. We will not discuss them anymore.
- Fact #4: MP uses TFLite or TF models for deep learning (DL), but it is in no way limited to DL. MP solutions are pre-trained TFLite models with some rather elementary pre- or post-processing. For the sake of DL, “MediaPipe” and “TFLite” are basically the same thing.
Can you do something similar with your own pre-trained TF Lite (or TF) networks? In theory, yes. In practice, the choice of standard pipeline building blocks (called Calculators in MP) is rather limited. Basically, any TFLite model can be plugged into the standard TfLiteInferenceCalculator, but MP might lack building blocks for pre/post-processing if your task is different from the tasks in the solutions. It is possible to write your own calculators, but only in C++.
What is Our Interest in MediaPipe?
We were interested mostly in MP as a universal pipeline C++ framework, and not in “solutions”. We wanted to see if MP was suitable for writing custom computer vision (CV) pipelines in C++ (see the end of this article series for the final verdict). In the process, we experimented with core MP C++ API a lot and wrote a tutorial: https://github.com/agrechnev/first_steps_mediapipe.
How does MP compare to another well-known pipeline library, GStreamer? Let’s have a look:
|Part of, year of birth||GNOME universe, 2001||Google universe, ~2019|
|Language||C (GObject) + wrappers||C++ + wrappers|
|Main Purpose||Audio/Video conversion, filtering, resampling||Audio/Video processing, usually with Deep Learning|
|Standard A/V codecs||All you can think of: uses many plugins||Limited: OpenCV for video, FFMpeg for audio|
|Buffering, flow control||No buffering by default
Enable buffers by hand
|Unlimited buffering by default
Enable flow control by hand
|GPU, Neural nets||Yes with DeepStream+ TensorRT, NVidia GPUs only.||Yes, TensorFlow + TF Lite|
|Desktop use, docs||Easy, good||Hard, bad|
|Graph definition||C code (hard) or text string (limited)||ProtoBuf text string (easy)|
In the following sections, we present our experience of designing pipelines with MediaPipe C++.