By Oleksiy Grechnyev, CV/ML engineer @It-Jim
2714

What is Google MediaPipe (MP) for Dummies?

In the ML/DL community you can often hear ”Nowadays you must know Google MediaPipe”, “It’s a cool framework”, and sometimes “It’s internally used by YouTube!” Videos with various computer vision tasks like this hand tracking often appear on LinkedIn and forums with the comment “This is MediaPipe”! At this point, we decided we could not ignore it anymore. So we packed our backpacks, said our goodbyes, and embarked on the journey to the Magical Land of MediaPipe and Google Technologies.

We quickly discovered that most people who praise MediaPipe on social media have no idea what it really is. “For Dummies” version: MediaPipe is a bunch of “solutions”, such as “Hand”, or “Face Mesh”. The table of all available solutions can be found here. As we can see, not all solutions are available for all platforms, although things are improving: this table nowadays has a few more checkmarks than it did half a year ago. But MediaPipe is not “solutions”. What is it really?

  • Fact #1: Google MediaPipe is a C++ library, other languages are wrappers around C++, with very limited functionality. If you want MediaPipe for real, you must use C++.
  • Fact #2: Google MediaPipe is a pipeline library. Look at the Wikipedia articles for Pipeline and related concepts of Dataflow- and Flow-Based Programming. Our previous blog post stressed the importance of pipelines for computer vision.

But what exactly is a pipeline? It is a number of Nodes organized as a Flow Graph. Data Packets (a data packet is a video frame, audio segment or some other data) run through the graph and are processed at the Nodes. Different nodes usually run on different CPU threads, so that they can utilize the available resources to the maximum. There are typically Buffers between nodes. For Real-Time Pipelines the buffers should have a limited capacity, and frames are lost if a buffer overflows. On the other hand, we want non-real-time pipelines (e.g. converting a VP9-encoded video file to H265) to be Deterministic: i.e. not-random, and with no frame loss.

  • Fact #3:  MP can process arbitrary data types in pipelines, although it has special type for Image and Audio data.

But what about MP Solutions? What do they have to do with pipelines? MP Solutions are basically just pre-trained TensorFlow Lite (TF Lite) models under the hood. MP graphs add a few minor extra blocks to the raw inference, such as Non-Maximum Suppression and results visualization, sometimes also detection+tracking logic. But basically very little is added to TF Lite. So when you hear “MediaPipe is amazing, both fast and accurate” people are actually talking about TF Lite and particular pre-trained models. MP Solutions are rather trivial to use, and well-documented. We will not discuss them anymore.  

  • Fact #4: MP uses TFLite or TF models for deep learning (DL), but it is in no way limited to DL. MP solutions are pre-trained TFLite models with some rather elementary pre- or post-processing. For the sake of DL, “MediaPipe” and “TFLite” are basically the same thing.

Can you do something similar with your own pre-trained TF Lite (or TF) networks? In theory, yes. In practice, the choice of standard pipeline building blocks (called Calculators in MP) is rather limited. Basically, any TFLite model can be plugged into the standard TfLiteInferenceCalculator, but MP might lack building blocks for pre/post-processing if your task is different from the tasks in the solutions. It is possible to write your own calculators, but only in C++.

What is Our Interest in MediaPipe?

We were interested mostly in MP as a universal pipeline C++ framework, and not in “solutions”. We wanted to see if MP was suitable for writing custom computer vision (CV) pipelines in C++ (see the end of this article series for the final verdict). In the process, we experimented with core MP C++ API a lot and wrote a tutorial: https://github.com/agrechnev/first_steps_mediapipe.

Can you use MP in languages other than C++ and platforms other than desktop? For solutions, yes. Python, JavaScript, Android (Kotlin/Java) and iOS (Swift). But once again, all these things are just wrappers around the C++ library. Presumably, they can be also used for a custom graph composed of standard MP calculators. However, any custom calculator must be written in C++. Moreover, if you use any custom calculators, you must (as far as we know) rebuild MP from the source, including the respective wrapper (Python, JavaScript, etc.). You must be a fluent MP C++ user in order to do that! So, for all practical purposes, MP is a C++ library, the wrappers are a joke. With this explained, we are not going to discuss any languages other than C++ in MP.

How does MP compare to another well-known pipeline library, GStreamer? Let’s have a look:

Part of, year of birthGNOME universe, 2001Google universe, ~2019
LanguageC (GObject) + wrappersC++ + wrappers
Main PurposeAudio/Video conversion, filtering, resamplingAudio/Video processing, usually with Deep Learning
Standard A/V codecsAll you can think of: uses many pluginsLimited: OpenCV for video, FFMpeg for audio
Buffering, flow controlNo buffering by default
Enable buffers by hand
Unlimited buffering by default
Enable flow control by hand
GPU, Neural netsYes with DeepStream+ TensorRT, NVidia GPUs only.Yes, TensorFlow + TF Lite
Desktop use, docsEasy, goodHard, bad
Graph definitionC code (hard) or text string (limited)ProtoBuf text string (easy)

In the following sections, we present our experience of designing pipelines with MediaPipe C++.

Down the Rabbit Hole: Our Journey to the Land of MediaPipe and Other Google Technologies
Tagged on: