This blog post covers some important aspects of deploying and running classical computer vision algorithms as well as convolutional neural networks in a web front-end. Please make sure you have read the first part of the blog post. This will definitely help you to follow all technical aspects much easier.
Emscripten for Computer Vision
How can you pass an image or a video frame from JS to C++ and back? We’ll give a minimal example. Suppose you have an image in an <img> tag. First, you have to copy it to RGBA pixels (only RGBA format is supported, not RGB !) via a <canvas> tag:
const img = document.getElementById('myImg'); const canvas = document.getElementById('myCanvas'); const ctx = canvas.getContext('2d'); const w = img.width; const h = img.height; canvas.width = w; canvas.height = h; ctx.drawImage(img, 0, 0); const data = ctx.getImageData(0, 0, w, h).data; // Uint8ClampedArray const bSize = data.byteLength; // == 4*w*h
Next, you have to send the Uint8ClampedArray object data to C++. However, C++ cannot access JS objects directly (at least, not efficiently). They are not part of the C++ memory, which itself is only a part of the JS memory. Some copying is unavoidable. Let’s copy data to the C++ heap:
const dataPtr = Module._malloc(nBytes); // C++ malloc const dataHeap = new Uint8ClampedArray(Module.HEAP8.buffer, dataPtr, nBytes); dataHeap.set(data); // Copy data -> dataHeap
Here dataHeap is a view object for the C++ data.
Now we can finally call the C++ code to do something to the image. We pass dataPtr, a pointer to data on the C++ heap. No result is returned here, but the image can be modified in-place:
Module._process(dataPtr, nBytes, w, h);
Finally, let’s show the result on the canvas and free the C++ buffer:
ctx.putImageData(new ImageData(dataHeap, w, h), 0, 0);
It’s very important that C++ has no garbage collections, so if you use malloc(), you must free() afterwards, otherwise there is a memory leak! Memory leaks are extremely evil. You might not notice them in a minimal demo, but they will kill a real project.
Emscripten and OpenCV
Traditional CV algorithms in C++ typically use OpenCV. Can we build it with emscripten and use it in our custom C++ projects? Yes, but with a few caveats.
First, the emscripten build of OpenCV uses a custom build script build_js.py. Unfortunately, it’s made for an ancient emscripten version (2.0.10) and doesn’t work with modern ones. You have two choices. You either use version 2.0.10 and miss the features and optimizations of modern emscripten versions; or hack the build script to make it modern-version compatible, which is not easy.
Second, this build script builds Asm.JS by default; you will have to specify the –build_wasm option for a WASM build, this is important.
Third, we are not sure this build is optimal. In particular, it is probably single-thread. You can dig into this stuff if you want, but it is not going to be easy.
Once the build process is finished, you will have a build directory with a lot of useful stuff, like lib and include directories, and also .cmake files. Ignore bin/opencv.js; we are not going to use that. You use OpenCV in your C++ code just like you would on a desktop platform. In particular, cmake is able to find OpenCV with find_package(), provided that you specify the option -DOpenCV_DIR=<path>, where <path> is the full path to the OpenCV emscripten build directory (the one with .cmake files). You can pass RGBA images to C++ as explained above and convert them to a cv::Mat inside your C++ code.
But what can you do with opencv.js? First, it cannot be used in any way from your custom C++ code, thus it is pretty useless from where we stand. Second, the file opencv.js is the project of the same name (OpenCV.js), which exports a number of OpenCV functions and classes to be used directly from JS (probably via embind or something similar). As it happens, the emscripten C++ build of OpenCV (the lib directory), the thing that we want, is merely a byproduct of OpenCV.js build process from the point of view of the OpenCV team. The official OpenCV documentation does not even mention C++ emscripten usage, it presents OpenCV.js only. Calling opencv functions from JS is not very interesting from our point of view, plus OpenCV.js is inconvenient and poorly documented compared to OpenCV C++ or Python API. It’s much more interesting to build CV C++ algorithms in emscripten. Such C++ algorithms, if written well, are cross-platforms, and can be developed on desktop and later ported to mobile, front-end or embedded.
Is it possible to have both? Can we create a custom C++ code, which also exports some OpenCV stuff like cv::Mat to JS? Probably yes, with some effort, but for beginners it is much simpler to call OpenCV stuff from C++ only, and pass images from JS to C++ and back as explained in the previous chapter.
How slow is OpenCV emscripten, compared to desktop OpenCV on the same computer? It depends on the OpenCV function, but here is an example. We run Lucas-Kanade sparse optical flow cv::calcOpticalFlowPyrLK() for 400 points, and the same parameters, on the same laptop both on desktop and web browser. Our results:
|Native C++ (desktop)||WASM, Chrome||WASM, Firefox|
|~ 1 ms||~ 24 ms||~ 90 ms|
24-90 times, not a small difference! That is what we meant before about “custom algorithms being slow”!
Disclaimer: This applies to the default opencv WASM build with emscripten 2.0.10. It is probably single-thread. A better optimization is likely possible if you really dig into the problem, but it’s far from trivial. As a result, the web browser on your modern computer is slow ‘Like Raspberry Pi 1’ as far as CV algorithms are concerned, thus only the most lightweight ones can be successfully deployed in a web browser.
Deep Learning in a Web Browser
Nowadays, CV is mostly about neural networks, at least if you get your information from blogs and youtube channels. Can you deploy neural nets in a web browser? And how efficient is it? Short answers are: “yes”, and “very inefficient”.
All serious neural nets use GPU (or sometimes TPU). Can a web browser use GPU? Yes, but only in the form of WebGL (web OpenGL) and not CUDA. You probably have never heard of neural networks using OpenGL on desktop, only CUDA, right? Do you wonder why? The answer is obvious: OpenGL is made for 3D rendering, not numerical calculations, and is very inefficient for neural networks compared to CUDA (on the same GPU). You’ll see some examples below. Likewise, CPU inference (in WASM) is slower than the machine-native CPU code.
Which DL frameworks are available for the web browser? We know two: TensorFlow.JS (Google) and ONNX Runtime Web (Microsoft). Both frameworks support webgl (default) and CPU inference.
TensorFlow.JS is “TensorFlow for the web”, with a JS API similar (but not identical !) to python TF+keras. It has its model format (BIN+JSON), different from TF and Keras models. It is a relatively heavyweight library with lots of utilities. Apparently, you can even train networks in a browser. Needless to say, when we first looked at TensorFlow.JS, we were somewhat surprised. We expected a minimalistic TFlite (like on mobile platforms) but instead found something heavyweight and completely original. TFlite API also exists for the web, but if we are not mistaken, it requires full TF.js anyway. Supposedly TF and Keras models can be converted to TF.js format, but it does not always work in practice; plus we had to edit the JSON file by hand to make anything work.
The good thing about TF.js is that it has a lot of auxiliary stuff. For example, you can create tensors from HTML <img> and <canvas> elements (automatically converting RGBA to RGB !). You also have a numpy-like tensor algebra which you can use for operations like normalization, image resize, or data type conversion. The problematic thing is that in TF.js (when using WebGL), you have to release all tensors by hand (tf.dispose()) or with the special wrapper tf.tidy(), otherwise you’ll get a catastrophic GPU RAM leak!
The other framework ONNX Runtime Web is pretty much the opposite. It is small, compact, minimalistic, and only supports ONNX format. It is good for deploying PyTorch networks (and nowadays, almost all modern neural nets are in PyTorch), as most PyTorch networks can be converted to ONNX, but not every ONNX can be further converted to TF. ONNX Runtime Web does not have tensor algebra, so you will have to implement all auxiliary operations (normalization, type conversion, RGBA->RGB) yourself (in pixel-wise JS loops) or use some other libraries.
The worst thing about ONNX Runtime Web is that it does not work. Or, rather, the original version 1.8.0 does (and the older ONNX.js), but all subsequent versions do not. The bugs are somewhere in WebGL shaders, since WASM inference works correctly. For some networks, the result is OK (e.g., torchvision ResNet 50), but for others (ResNet 18), it is completely crazy! What is the big difference between ResNet 50 and 18? Unfortunately, we didn’t have time to investigate deeper.
The most amazing thing is that several ONNX Runtime Web versions were released after 1.8.0, and they are all broken. Did nobody notice it?
For both frameworks, there is a common WebGL issue. It takes a long time to compile WebGL shaders. Thus, the very first “warm up” inference can take a few seconds. The following ones are fast, but only if the input tensor size does not change. If the network has a dynamic-sized input and the input size changes, the shaders will be recompiled. This issue is unavoidable, but a clever web developer can mask the webgl warmup with web page loading or something like that.
Finally, the speed. While we did not perform any formal test, here is what we got very roughly on the laptop with GeForce 1660 GPU (Note: unlike CPU and CUDA, WebGL inference times fluctuate wildly, even after the warmup), all on ResNet 50 from either torchvision or Keras.
|PyTorch||5.5 ms||60-70 ms|
|ONNX runtime||15 ms||50-350 ms||1000 ms|
*Timing per one inference.
From what we see, WebGL (GPU) inference in a browser is about 15 times slower than native CPU, and about 50 times slower than CUDA. Speaking of the native CPU, ONNX runtime is way faster than PyTorch or Keras; we did not previously know that. These numbers mean that only relatively lightweight neural networks can be successfully executed in a browser unless you want inference time of many seconds.
Besides, there is a question of neural network size (the total size of their parameters in e.g. PTH or ONNX file). Modern neural networks are typically hundreds of megabytes or even gigabytes in size. The largest size practical for the front-end is perhaps about 20 Mb if you don’t want your webpage to load forever. Such super-small models are not easy to find. Please don’t expect to deploy a model from some 2022 state-of-the art paper in a web browser!
Other Technologies in the Web Browser
We’ll mention very briefly a few other web technologies which can be relevant to CV.
WebGL is available for 3D graphics, and it’s one of the “fast” technologies. Few people, however, would want to use WebGL directly. There are several convenient 3D graphics libraries that use WebGL, the most popular is Three.JS. Even Unity engine is available for the web (as an official Unity platform), based on WASM + WebGL.
WebXR is available for VR and AR (the previous specification WebVR has been deprecated and removed). But you cannot try it on your PC. WebXR requires an actual VR device, like Oculus Quest 2. On smartphones, it can do VR by showing two images on the screen, which can be viewed in 3D if you have a VR headset for your phone. Finally, it can do AR on your phone (no additional headsets required), but only if you have ARKit/ARCore, and still not all phones in existence have those. Maybe in a year or two, it will become widely available.
To Web or Not To Web?
Finally, we are ready to give the final answer to the question, “Should you put the CV algorithms on the front end?”. The answer may be different. If your CV algorithm is really lightweight, you can run it in a web browser. Otherwise, be ready for playing with your favorite neural net or heavy custom pipeline. It is much more efficient (10-50) times to run stuff on a native platform (Intel, ARM) compared to the browser. Thus, you need to always consider writing mobile applications or at least client-server web ones to control the distribution of the computational resources for heavyweight CV algorithms.
Can things get better in the future? Will the “native” and “web” worlds somehow converge?
On one hand, there are big challenges. Flashy demos for some new “fast” technologies can look very cool, but, as explained above, if we want original CV stuff, we need to write custom algorithms, which are in the “slow” category. And it is likely that WASM will always be slower than the native CPU. Neural network inference in the web browser is currently very slow compared to the native platform, but this can be easily fixed by creating new web technologies (think e.g. native-CPU ONNX runtime built into all browsers). On the other hand, browser platform is very important for rather popular Metaverse and the extended reality (XR) concepts, thus there is a strong motivation for improvement.
The world of the web tends to develop slowly (adoption of new web specifications take years), but it is likely that web browser will become a mature platform in the long term (10-20 years). We are cautiously optimistic about it.