Video is an extremely popular way to represent information. Indeed, sometimes it is enough to watch a short clip instead of long listening or reading about complicated technical concepts. From a user’s point of view, a video is just a sequence of images followed one-by-one with a very short inter-frame interval. Typically it has around 30 frames per second (FPS). However, many things are left inside the box. In this article, we focus on how to build an efficient real-time video streaming pipeline.
Building Efficient Real-Time Video Pipelines
Did you ever write your own video player? What about a media server? Or a real-time video processing pipeline?
For most people in the world, the answer is “no”. Probably even among the readers of this blog. Our experience says that many people underestimate the difficulties involved here and are up to nasty surprises when they try to implement some computer vision (CV) in real-time. “Real-time” happens when you receive your frames from a camera or network stream (as opposed to a pre-recorded video file).
Novice computer vision engineers typically learn their craft on individual images. In rare cases when the time dimension is required (tracking, optical flow), they usually work on pre-recorded videos. Then they think: What can go wrong with real-time? I just get frames from the camera and do with them the fancy computer vision that I usually do? A schematic C++ code they imagine looks like this:
cv::VideoCapture cap(cv::CAP_ANY); while (true) { cv::Mat frame; cap.read(frame); process_somehow(frame); send_somewhere(frame); }
Is it how a good real-time CV works? No! Let’s dive deeper.
Where is the Frame Loss?
When junior CV engineers try to do something with a camera, our first question is “where is the frame loss here?”. This question usually surprises people: “No, I do not want to lose any frames”. This is wrong. If your camera produces 30 FPS, few CV algorithms (and especially not neural networks) can process a frame in 33 milliseconds. And then you typically want to stream, show on screen, or write the result somewhere. This also takes time. Even if your computer is fast enough, there are always slower devices (such as embedded ones) or computers overloaded with some background tasks. So frame loss is inevitable and unavoidable. And because of the frame loss, you can never rely on a steady FPS from a camera.
POP QUIZ: Where is the frame loss in the piece of code above? Think before reading the answer.
Now here is the answer: the frame loss is at the line “cap.read(frame);”. This is a synchronous waiting call “give me the next frame when ready”. If you take too long processing frame 1, the subsequent frames will be lost until you reach the read() call again. Luckily for us, OpenCV VideoCapture does not try to keep multiple frames in a buffer. You can try to guess what would happen if it did. Hint: Nothing good.
Again, frame loss means that there is no reliable fixed FPS and that the difference between timestamps of two subsequent frames differs. It does not matter if your CV algorithms process each frame individually. It is usually not critical for optical flow either. However, if you do some signal processing in the time domain, or possibly parameterized motion models, then frame loss becomes critical. What is the solution? You must record the (original) timestamp of each frame, and if you need a signal with the regular FPS, resample the original video to a desired regular timestamp grid.
Threads and Buffers to the Rescue
Does the code above work correctly? Yes. Is it efficient? Definitely not. Note how it does all operations strictly sequentially. This includes an input (get a frame from the camera) and output (visualize the result on the screen, write it to a file, or send it to the network). Such a sequential pipeline does not utilize (or at least does not utilize efficiently) multithreading on multiple CPU cores (and possibly GPU also).
But the sequential pipeline has an even more striking defect. Imagine for a moment that you want to process your frame in the cloud, then send the result back to the edge device. Processing on the server can be very fast, but the internet connection has a lag, sometimes up to half a second or more (in two directions). With a sequential pipeline, you will wait for half a second for the answer from the server before processing the next frame. The result is 2 FPS or less, while with a proper pipeline you can have 30 FPS with the in-cloud processing! This illustrates the difference between throughput (FPS) and latency (lag or delay, the time it takes to fully process one input frame). For the sequential pipeline, there is a limit throughput <= 1 / latency, but for better pipelines, this is not so.
© www.freepik.com
Fig. 1. Car assembly line.
This is similar to a car assembly line (Fig. 1). The sequential pipeline (Fig. 2) means that only one car is being assembled at a given time. Imagine an almost empty factory building, with one very lonely car traveling the assembly line. Only when this car is finished, the next car can start. That would not be a terribly efficient assembly line! But we all know that is not how car factories work in real life. In reality, multiple cars move along the assembly line, one after another. The same principle applies to serious real-time computer vision (Fig. 3). Different stages in the pipeline (“actions”) take place in different threads (running on different CPU cores, or possibly on GPU). Frames travel along the pipeline like cars on the assembly line, from thread to thread. While thread 3 processes frame 7 (for example), thread 2 can process frame 8 at the same time, and thread 1 can process frame 9. The “actions” include different computer vision operations that are executed sequentially, for example, object detection and rendering of some graphics. This also includes video encoding+decoding, BGR<->YuV conversions, CPU<->GPU data transfer, etc.
There is, however, a subtle difference. On the assembly line, cars travel at a fixed speed (throughput) and each assembly operation takes a standard “one step” time (or less). In video pipelines, it is extremely difficult to maintain such a standard FPS, and computer vision operations can take different time on different frames. So, what is usually done? The threads on the pipeline are connected by buffers (queues). The buffers have a maximal size they are not allowed to exceed. If the buffer is overfilled, the frame is lost (something that we would not want on a car assembly line!). Thus, if we have a bottleneck in the pipeline (thread with long processing time), the frames are automatically lost in the buffer just before this thread.
Fig. 2. A sequential pipeline. A frame from the camera travels through the pipeline (action 1, action 2, action 3) and finally goes to “Output” (e.g. visualization on screen). Only when processing this frame is finished, we can receive the next frame from the camera.
This basic pipeline architecture (threads connected with buffers) is found behind the hood in every media player or server, in Youtube, Zoom, Skype, and Netflix. And if you want your video pipeline to work properly, you better implement this architecture too. Or use a ready-made tool (see below). Note that buffers introduce latency. On the other hand, they ensure smooth playback. There is always a smoothness vs latency balance, you cannot have both. If you want a low-latency real-time pipeline, keep your buffers as small as possible.
Fig. 3. A multithreaded buffered pipeline. Threads are connected via buffers.
One thing is important. Never ever build an unlimited buffer, without a size limit. It will grow infinitely (while generating a rapidly increasing lag), fill all RAM, and eventually crash your computer. This is not a purely theoretical possibility. Frame loss is the safety valve in your pipeline, preventing it from exploding like an overheated steam engine! When using higher-level libraries and frameworks, watch out, they might implement their own buffers. Always understand how library functions work, read the documentation carefully. For example, the read() method of cv::VideoCapture provides the next camera frame, but what exactly does it mean? It is a combination of grab() and retrieve(). grab() grabs the last camera frame (or waits for the next one), while retrieve() decodes it to the BGR format if needed. There is no buffer anywhere. Lucky for you, you cannot shoot yourself in the foot with OpenCV. But suppose some other hypothetic camera library implemented its own unlimited buffer, what then? Then we would crash the computer by grabbing frames too slowly.
Note: The often-used logic “Send frame to the engine if the engine is available. If the engine is busy, drop the frame.” can be viewed as a very rudimentary buffer with a maximum size 0. The proper buffers are more flexible than that.
A Note on Asynchronous Programming
Asynchronous programming is a hot thing nowadays, especially in web programming, but also in mobile and desktop GUI. What does it mean? A synchronous operation means that you request some action and wait for it to finish. For example, the above-mentioned read() method of cv::VideoCapture waits for the next video frame to arrive. An asynchronous operation means that you request something and provide a callback function which will be called when the operation is finished. This is like your boss telling you “Do something, then text me when you are ready”. Of course, your boss will not wait for you to finish, (s)he will do some other work. In particular, in web and mobile, cameras and video streams typically work this way. You have to provide a callback cb() which is called when the next frame arrives. What does it mean?
Attentive readers might notice that this logic is mathematically not well defined. What happens if frame 2 arrives when frame 1 is still being processed (callback did not return)? Different libraries behave differently. Always understand how yours does. The library might simply lose a frame, this is good. Or it can implement its own buffer. Or, pretty often, the callback for frame 2 will be called anyway in another CPU thread, while frame 1 is still being processed. The last option is interesting. Many CV algorithms (optical flow, tracking, etc.) require frames to arrive purely sequentially, one after another. The algo will go crazy if you try to run it for two frames simultaneously in different threads, crashing, throwing an exception, or, worse, behaving erratically. Even single-frame algos (like object detection neural network) will eventually crash your device if you try to run many frames simultaneously in different threads! Such a situation happens all the time in real life when some web or mobile developer simply puts your algorithm in a callback without thinking about pipelines or buffers at all. The correct solution is to put a buffer between the callback and algorithms. The callback should simply put a frame into the buffer, a fast operation (in general, callbacks should NOT contain any heavy operations), while the CV algorithm in another thread reads frames from the buffer. It ensures the proper sequence of frames, and of course the buffer should have a maximum size and frame loss as usual. You have to implement the buffer yourself.
Decoding, Encoding and YuV
How to decode and encode videos? At least in Linux, there are different libraries for different audio and video codecs (libx264, libvpx etc.) with different obscure APIs. Is there a unified approach for all codecs? Yes, there are a few options. OpenCV (it uses ffmpeg under the hood) can handle simple cases, but it is vastly insufficient for serious projects. FFmpeg and GStreamer are two principal choices, at least on linux and cross-platform. Of course, they also exist on Windows, MacOS and even on mobile. You should definitely master these two libraries if you do video pipelines.
Most video codecs are not working with BGR or RGB images, instead they use various versions of YuV, including YuV420p, NV12 and NV21. If you want RGB, you will normally have to convert it yourself. OpenCV can handle a few versions of YuV, and libswscale (a part of FFmpeg) can handle them all. Note that YuV<->RGB conversions are pretty expensive, especially on 4K images. You should avoid them if possible. For example, if your CV algorithm processes grayscale images, you do not need RGB and can work on YuV directly (as the grayscale frame is always a part of YuV frame).
What about hardware-accelerated encoders/decoders, such as the ones available on Nvidia GPUs (including Jetson Xavier, but NOT the ones in laptops) and Raspberry Pi? FFmpeg and GStreamer can generally handle those, but sometimes it requires building the library from the source (FFmpeg on Raspberry Pi, which is a big pain). There are also native APIs for Nvidia (NVENC/NVDEC) and for Raspberry Pi (MMAL, OpenMAX). You might run into issues with hardware encoders/decoders. For example, in one project we figured out that the Nvidia H264 decoder produces only NV12 (and not the regular YuV420). Also, some hardware encoders do not repeat PPS/SPS packets (headers) of H264, which causes bad issues with streaming.
Let us cheat!
Despite your best efforts, you may find that your pipeline’s output looks ugly, in terms of throughput (FPS), latency (lag), and stability. For example, if some neural network takes 0.5 seconds per inference, you will get a 2 FPS video with over 0.5s lag. Ouch. Then how comes all commercial products, including mobile, browser and embedded apps all look so beautiful? First, they optimize everything that can be optimized. Second (and we are revealing to you the biggest secret in the industry!) they cheat. Everybody does. By “cheating” we mean an optimization which radically changes the entire pipeline logic to produce a visually pleasing output. A few examples:
- Show every frame (keep full FPS), process only some of them. In the example above, the output video of 2 FPS is very ugly. So, send only 2 frames per second to the slow neural network (which can do e.g. object detection), but send every single input frame (30 FPS) to the output video. This is essentially a pipeline with branches as opposed to the sequential one. A massive frame loss happens on the detection branch (as the detector is slow), but not on the visualization branch. But how do we visualize detected objects in every frame, when detection happens only twice per second? You can use the last detected position. Or, better, look at cheat # 2.
- If detection is slow, interpolate, smooth or track. When doing cheat #1, you can interpolate/extrapolate object locations between the “detection” frames, or apply some kind of smooth motion models with velocity parameters. Even when detection is fast, a good motion model gives a much smoother visual motion of objects. The true tracking involves using optical flow and similar approaches to track each object once it is detected.
- Prefer zero lag and compensation. Visible lag makes things ugly. If the camera feed on your smartphone’s screen is 0.5 second delayed, people tend to notice this. Thus when using cheat # 1, it is better to visualize the frame immediately, not waiting for the results of object detection. Most real-time apps (especially mobile) work like this. Of course, this kills the synchronization between frame and detection result. You might notice that detection results are lagging half a second behind the frame, since they were detected on an earlier frame. Bad. But this is a necessary evil. If the entire camera video lags, it is visually much worse than if a little bounding box lags. What you can try, is to compensate for the lag. If you have some motion model for the object, you can just go 0.5s back in time. This works reasonably well, but only when the object moves predictably, and not when a new object is just detected.
When you think that your app looks poor compared to existing ones, remember that true CV professionals are masters of cheating 🙂
Efficient Pipelines Summary
- Do not use sequential single-thread pipelines
- Frame loss is inevitable
- There is no stable predictable FPS, if you need one, resample
- Build a pipeline with threads and buffers
- Never do an unlimited buffer/queue
- Do not put heavy operations into asynchronous callbacks
- Asynchronous callbacks must not run sequentially
- Use FFmpeg, GStreamer or other software for encoding/decoding
- Codecs always use YuV, many versions of it
- Avoid costly YuV<->RGB conversions if possible
- Don’t reinvent the wheel, use GStreamer!
- Or Nvidia DeepStream for a GPU only pipeline … which can run out of GPU RAM
- Cheat #1: Show every frame (keep full FPS), process only some of them
- Cheat #2: If detection is slow, interpolate, smooth or track
- Cheat #3: Prefer zero lag and compensate
Conclusions
We have addressed some practical aspects of video processing pipelines and shaded the light for people who might think that this is a trivial process. We will publish the blog post on video streaming shortly. Thanks for reading!