Deep learning (DL) and neural networks are extremely widespread in different computer vision (CV) applications. Indeed, many typical problems (like object recognition or semantic segmentation) are effectively solved by convolutional neural networks (CNNs). In this article, we are going to discuss how to utilize CNNs on embedded devices.
Neural Networks, Training and Inference
Neural networks today are ubiquitous. In particular, it is hard to imagine computer vision without them. The networks used in CV are typically convolutional (which means they rely heavily on convolutional layers) and deep (meaning the total amount of layers is large, often in the hundreds), thus “deep learning“. The architectures of the modern state of the art (SOTA) neural networks are getting more and more sophisticated, which often means they are slow and require a lot of resources to operate, although networks specially designed to be lightweight (like MobileNet series) are not uncommon.
Before we go ahead, let’s highlight the differences between the CNN training and inference. The former is typically done on powerful hardware starting from your desktop and up to the GPU cluster in the cloud (AWS, MS Azure, etc.) Technically, this is a process of optimization of hyperparameters (typically, millions). And the inference is an actual porting and running the pre-trained network. Sounds like an easy task, right? However, it’s not the case in practice.
Deploying Neural Networks in Production
So, suppose you trained a neural network in PyTorch or Tensorflow and you are happy with its performance. How do you deploy it in production (in your own C++ or Python code)? It is not that trivial, as beginner’s level DL tutorials and manuals usually avoid the issue. For example, PyTorch documentation hardly touches this at all, while Tensorflow documentation advertises some exotic commercial services on google cloud. So, the question is “how can I infer a neural network in my own C++ or python code?”. There are many possibilities, for example:
- Frugally-deep: C++ only, lightweight CPU-only, header-only, Eigen-based
- Google: Tensorflow (Lite) : C++/Python, CPU/GPU
- Facebook: libtorch/PyTorch : C++/Python, CPU/GPU
- Microsoft: ONNX runtime : C++/Python, CPU/GPU
- Nvidia: TensorRT : C++/Python (3.6 only), Nvidia GPU only
Let us discuss them in turn:
State Of The Art Curse and Model Zoos
Not all neural networks can be deployed outside of the training framework. The ones using only the most standard building blocks, such as convolution or max pooling, are usually OK, however many networks in Github do something rather exotic, or even have custom modules written in C++/CUDA.
This can be called “State Of The Art Curse”. The networks on the cutting edge of deep learning (recent arrivals on Github and top scores in PapersWithCode) are the most likely ones to use some fishy stuff and be thus undeployable. However, customers all the time fail to understand that and make demands of sorts “Take network X, which is the SoA, and deploy it on Jetson Nano using TensorRT”, which is usually impossible. Roughly speaking, there are 3 levels of “undeployable”:
- Fully undeployable. The model cannot even be exported into ONNX, UFF or TorchScript. This usually happens when the model has a custom CUDA code or custom Python code. Note that this often can be circumvented with a lot of effort (like including the custom CUDA modules into TensorRT deployment).
- TensorRT incompatible. Some common ONNX operations are not yet supported in TensorRT, like image resize (torch.nn.functional.interpolate). This also depends on the TensorRT version (e.g. transformers first appeared in TensorRT 7).
- Accelerator incompatible. Deep Learning Accelerators (Nvidia DLA, Google TPU, Intel VPU) are extremely restrictive in what they can do. Nearly every modern neural network fails to deploy without modifications. More on this below.
A related concept is the “Model Zoo”. If you see “Model Zoo” somewhere, it means you have been cheated. Why? “Model Zoo” means that the company that created a DL accelerator or framework kindly provides you with a few models you are invited to use. This usually means that any other model (like a fancy SOTA network from Github) is not going to work. As DL engineers, we want to deploy any network, especially newer and better ones and do not want a very limited selection of some simple and usually outdated stuff like YOLO2.
Deep Learning on Embedded/Single Board (and other) Devices
On Embedded/Single board devices you have a choice between Nvidia Jetson devices (Fig. 1) and other devices, the latter category includes devices with DL accelerators (Fig. 2).
Fig. 1. Nvidia devices: Jetson Nano, Jetson Xavier (single-board and embedded versions).
Nvidia Jetson Devices: CUDA-Based Deep Learning and TensorRT
The Nvidia Jetson series are the only single-board/embedded devices with Nvidia GPU. Nvidia GPU means that you have CUDA and you can do deep learning in more or less the same way as on a cloud instance or a desktop Linux PC: you have Tensorflow, PyTorch, TensorRT, etc. The preferred framework for inference is TensorRT, as it is normally much faster than the alternatives. But if some network is really TensorRT-incompatible, you can always fall back to Tensorflow (Lite), PyTorch/Libtorch or ONNX runtime. This is the major advantage of having an Nvidia GPU. With DL accelerators (discussed below) you are extremely limited in what software you can use and what networks you can run. All things considered, if you have to run a network “on the edge” in production, we would suggest an Nvidia device (e.g. Xavier) to the devices with accelerators shown in Fig. 2 if the budget allows it. Note that Jetson Nano has a very low power while being equipped with an Nvidia GPU. Thus, it is better to use it for academic purposes.
In comparison, the more expensive Xavier devices have a pretty decent GPU. Once again: never ever try to train neural networks on these devices, they are only for inference.
Fig. 2. Devices with DL accelerators: Google Coral, Google Coral Stick, Intel Neural Compute Stick 2 (aka Movidius).
Now it is time to say a few more words on TensorRT. TensorRT is Nvidia’s highly optimized neural network inference framework which works on Nvidia GPUs only. It is a C++ library (python wrapper is available for python 3.6 only but not python 3.8), so you will have to know C++ well and you will have to write quite a bit of boilerplate C++ code to get your network up and running. That’s right, the command line tool trtexec is good for testing only, for deployment you will have to write your own C++ code. Probably even two pieces of C++ code: for engine creation and for inference. The inference takes place entirely on the GPU and is using GPU RAM so you will need to know basic CUDA programming as well.
The workflow of TensorRT is like this.
- First, you need to create a network description, a graph consisting of standard TensorRT layers (this is not unlike the Tensorflow graph). You can create it by hand in your C++ code, but most often people import (“parse”) an existing network in ONNX, UFF or Caffe format using the respective parser, a library separate from the main TensorRT.
- Second, you build a TensorRT runtime engine (also known as “plan”), which optimizes your network for a particular GPU. This process can take some time (sometimes over 10 minutes). The engine can be serialized to disk (saved as a *.plan file) for later inference, but you need to keep in mind that it will not work on a different GPU model.
- Once you created an engine or loaded it from the *.plan file, you create a TensorRT execution context next (a rather thin wrapper around the engine). More than one context can be created from one engine. If any dimensions, including the batch size, were left unspecified (“dynamic”) in the engine, they must be fixed when the context is created.
- Finally, you use the execution context to run the network (inference) as many times as you like. The input and output data is stored in the GPU RAM, and the inference itself is typically enqueued in a CUDA stream.
One of the best things about TensorRT is that it can speed up your network by using FP16 and INT8 precision instead of the default FP32, provided that your GPU supports these operations (the better ones do). FP16 and INT8 use the tensor cores of the GPU, which are different from the regular CUDA cores. INT8 inference requires calibration, i.e. running the network on a bunch of input images to determine the numerical scale of each layer. TensorRT even supports the Nvidia Deep Learning Accelerator (DLA) on Xavier devices, more on that below.
TensorRT Issues
Of course, nothing is ideal. And we’ve faced a lot of challenges while using the Tensor RT engine.
Boilerplate Code:
As mentioned already, you need a lot of boilerplate C++ code. You must create a logger and manually perform the 4 steps outlined above. Also, you have to do a lot of pre-/post-processing on the images all by yourself:
- Load images, or receive it from camera/video, convert them from BGR to RGB (if using the BGR-loving OpenCV).
- Convert pixels from UINT8 to FP32 and normalize the image (so that the mean and standard deviation of the pixel intensities over a large dataset are approximately zero and one respectively). Most neural networks only operate on normalized images.
- Convert from “channels last” format (where the color index is the last) to the “channels first” format of TensorRT. Suppose our image had dimensions 480x640x3, it should be converted to 3x480x640.
- Combine many images to a batch (if batch size > 1). Using a batch size larger than 1 can speed up your inference if you need to process many images. With the batch size of 8, the final dimension of our input tensor will be 8x3x480x640.
- Finally, copy the image from CPU to GPU RAM with the function cudaMemcpyAsync() or the like before inference.
And if the output of your neural network (inference result) is also an image, you have to perform all these steps in the reverse order on the network output.
Network Compatibility:
Many network operations are incompatible with TensorRT. The most striking example is the image resize operation. For integer upsampling only (e.g. 2x), this issue can be circumvented by clever tricks (See Nvidia RetinaNet code for details). And of course, if your model utilizes a custom CUDA code, things become more complicated. It is possible to use custom plugins in TensorRT, but there is no way such a network can be represented as ONNX, at least not the complete network. In the same Nvidia RetinaNet, only part of the network is exported as ONNX, and custom layers are added to it in the C++ code when the TensorRT network is constructed. Nvidia RetinaNet, by the way, is a pretty good (though rather advanced) TensorRT example.
ONNX Parsing and TensorRT Version Incompatibility:
You might have heard that different versions of TensorRT, especially 6x and 7x are seriously incompatible. Is it so? Yes and no. The main difference is actually not in the TensorRT itself but in the Microsoft ONNX parser (which is open-source code from Microsoft and not part of the TensorRT itself). ONNX parser is the most widely used to convert PyTorch networks to TensorRT.
The issue arises from the way TensorRT handles the batch dimension. The recommended modern way is the “explicit batch dimension”, where the batch dimension is the first dimension of all network tensors, including inputs and outputs. It can be dynamic (unspecified when constructing the engine, and fixed only when creating the execution context). However, there is also a legacy way “implicit batch dimension”, when the batch dimension is not part of the input tensor dimensions and can be set to any value at the inference time, only the maximum batch size must be specified at the engine creation time. For example, if your network processes RGB 640×480 images, the input tensor dimension will be 8x3x480x640 with an explicit batch dimension (and batch size 8), and 3x480x640 without.
Another problem might appear in the ONNX parsing. For some reason, Microsoft decided to use the implicit batch dimension only for TensorRT 6 and explicit batch dimension only for TensorRT 7. It means that the C++ code for TensorRT 6 and 7 will always be incompatible (although experienced people can get around that with C++ preprocessor directives and/or if statements)! Even exporting ONNX from PyTorch is different: for TensorRT 7, you must specify a fixed batch size explicitly or otherwise declare it as “dynamic”, while for TensorRT 6, exporting with a batch of 1 will suffice. Sounds complicated? It is. If you are a TensorRT newbie just making your first baby steps, expect spending many hours frustrated and confused with the explicit/implicit batch dimension issue.
There are many other parsing issues as well. For example, TensorRT 6 usually cannot parse ONNX files created by recent PyTorch versions, so you will have to export on an ancient PyTorch 1.2 (if your network works there). We used a separate Docker container with TensorRT 6 and PyTorch 1.2. While PyTorch networks can only be converted to TensorRT via ONNX, for Tensorflow the UFF format is more popular. Moreover, Tensorflow advertises using TensorRT directly from TensorFlow, but it requires a particular (and usually outdated) TensorRT version and is less flexible, so we suggest using UFF or ONNX instead.
FP16 and INT8 Issues:
For quite some time we could not make INT8 work at all, even on the simplest possible example. We put all the required flags in the code, and TensorRT optimizer just produced an engine with the FP32 layer. We tried everything, and it just did not work. It took us ages to realize that the problem was exactly that we used the simplest possible example. It turns out TensorRT optimizes the entire network and will switch INT8 on only for the layers where it is available and only if it speeds things up. It means in practice that you need at least two convolutional layers in a row, and with ReLU activations to see INT8.
There are many other issues with INT8 and FP16. They require tensor cores, and not all GPUs have those (Xavier devices and newer GeForces do). Network accuracy can degrade significantly compared to FP32, especially for INT8. INT8 requires calibration, and you must write your own C++ calibrator class. The good news is that you can calibrate once, save the calibration table, then use it every time you build an INT8 engine. Building an engine with FP16 and especially INT8 takes much longer time than FP32. Still, about 3x acceleration of network inference is worth it.
A Note on Docker:
On a Linux PC, you can use Docker to keep different incompatible versions of CUDA, TensorFlow, TensorRT, PyTorch, etc. on the same computer. In particular, TensorFlow 1x and 2x are seriously incompatible, just as TensorRT 6x and 7x are, and every version of TensorFlow and TensorRT requires a very particular version of CUDA and CuDNN. On Jetson devices, you are unfortunately stuck with versions provided with your JetPack (Ubuntu-like Linux for Jetson). Different versions of JetPack have different TensorRT versions, but reinstalling JetPack is not easy (as discussed in the blog post).
We strongly recommend you export ONNX or UFF on your PC, and then use it to build the TensorRT engine on a Jetson. You should also write your C++ TensorRT code on a PC before trying it on a Jetson, optionally using a Docker container with the same TensorRT version you have on the Jetson.
A Note on Parallelization:
TensorRT optimizes an engine to infer as fast as possible. This means loading all GPU cores to a maximum and that you cannot gain anything by inferring two or more networks in parallel. And you cannot limit the number of CUDA cores used by TensorRT, it always uses them all. You might think that if one network uses FP32 (CUDA cores), while the other uses FP16/INT8 (tensor cores) they might run in parallel. We tested this a lot, it does not work either.
To summarize: Do not try to infer two or more TensorRT neural networks simultaneously.
Nvidia Deep Learning Accelerator (DLA)
You might have heard of Nvidia Deep Learning Accelerator (DLA), the open-source architecture from Nvidia for all deep learning accelerators. You can use it via the native DLA API or via TensorRT, the second option is preferred.
Sounds quite promising, isn’t it? However, there are even more problems in practice in comparison with the above considered TensorRT engine. Here are some of the bottlenecks:
- First of all, DLA is supported on a Jetson Xavier only. Also please note, that tensor cores and DLA are two different things!
- DLA supports only FP16 and INT8 (not FP32), and the accuracy of INT8 inference drastically drops for all networks we tried (compared to INT8 inference on the GPU).
- The word “accelerator” is misleading, DLA is in fact very slow compared to Xavier GPU, FP16 on DLA is about as slow as FP32 on GPU.
- As with all DL accelerators, the selection of supported layers is very limited. For example, DLA does not have Constant layers. The deconvolution layer exists in theory, but without any padding, which is unsuitable for encoder-decoder networks (semantic segmentation, optical flow, etc.), which always use deconvolutions with padding.
- When some layer is not available on DLA, TensorRT will run it on the GPU instead (“GPU fallback”). Expect a lot of those.
- You might think that you can run a network on DLA (or even two on two DLA cores of Xavier) while having the third one running on the GPU? This does not work either. DLA networks work fine (and you can run two on two DLA cores), but the GPU is almost completely blocked by them for some reason. We tested the 2 DLA networks + 1 GPU networks. Both DLA networks ran at full speed, but the GPU network was slowed down 3 times or more.
In other words, like most DL accelerators (see below), Nvidia DLA suffers from the tragic State Of The Art Curse in its utmost severity. Most existing networks, and especially those fancy 2020 SOTA networks from Github, cannot run on the pure DLA.
Deep Learning on Non-Nvidia Devices
For non-Nvidia devices, you have 3 options for neural network inference:
- CPU inference
- Non-Nvidia GPU inference
- Deep learning accelerators
CPU inference is rather trivial and only suitable for very lightweight neural networks. If you do not want to write the code yourself, you can always use CPU versions of Tensorflow Lite, Libtorch or ONNX runtime, or perhaps something like Frugally Deep.
While there have been some attempts to implement neural networks on non-Nvidia GPUs (including the ones in Raspberry Pi, and also AMD and Intel), they are very experimental and often unfinished and lack the standardization and sophistication of the Nvidia-GPU frameworks like PyTorch or TensorRT.
The third option is more interesting. There are a number of devices on the market specially designed for deep learning, and they are equipped with DL accelerators. This includes Google Coral series (available as single-board and USB stick) and Intel Neural Compute Stick Series (previously known as Movidius) (Fig. 2). They are often combined with other devices, for example, a Google or Intel USB stick is plugged into a Raspberry Pi. Google Coral devices have a DL accelerator called Edge TPU, a smaller cousin of Google TPU used in Google cloud. It is used via a special version of Tensorflow Lite, and your network needs conversion before you can run it on the TPU. Intel’s accelerator is called Intel VPU, and it is accessible through a special software named OpenVINO Toolkit. Again, network conversion is needed. Technically, two Nvidia DLA cores in Jetson Xavier is also a DL accelerator similar to TPU and VPU, and even tensor cores in modern Nvidia GPUs function in a similar way.
All DL accelerators work in precision FP16 or INT8. They do not support FP32. And the older models typically only support INT8. They are highly optimized for a few standard operations, but their instruction set is very limited. DL accelerators typically come with a “Model Zoo”, a small number of outdated neural networks. You can be pretty sure that no network from Github will work, at least not without serious modification.
It is often discussed in the Deep Learning community whether DL accelerators are a good or a bad thing. DL is a very active field of computer science at the moment, with new network architectures appearing every day. Every year or so the whole field changes beyond recognition. New building blocks (network layers) appear all the time and gradually become popular, for example, Transformers. New ideas are usually tested with custom CUDA plugins first, as they are not yet available in any framework. Frameworks like PyTorch try their best to incorporate new ideas quickly. However it is much harder to design a new piece of hardware, thus DL accelerators always fall years behind. For this reason, it is often asked if they have any usefulness for the DL community at all, at least until the field stops growing that quickly.
My answer is the following. If you want to run something very simple from the zoo (like YOLO) on the edge, and if you don’t care at all about State Of The Art, DL accelerators are for you. However, if you like to fool around with different new and fancy neural networks, then you must not stray from the Nvidia camp, and if you ever get ready for deployment, choose a Jetson Xavier or TX2 or something like it.
Summary
Deep neural nets are very exciting but you need to know how to cook them right. Especially, when your target hardware has limited capabilities and performance requirements are strict. We’ve shared some practical insights around the topic. So, what’s next?
Part 3 of the series is going to deal with video streaming and efficient video pipelines. Stay tuned!