By Oleksiy Grechnyev, CV/ML engineer @It-Jim
199

Are you interested in Computer Vision (CV)? Probably yes, if you are reading this. If you read CV tutorials, you might have noticed that most of them are in Python. This applies to both traditional CV (without neural networks) and, even more, to deep learning (neural networks). Occasionally, CV tutorials use C++ instead of Python, but any other programming languages are very rare. The fact is, Python is known as THE language for research and development, math, CV, ML, DL, education, and a quick prototyping.

But what if we want to deploy our CV algorithm somewhere, i.e., to use it in real life? Then very often we will find python impossible or very difficult to use. C++ is better: it is available almost everywhere. Android and iOS platforms have their own languages: Kotlin/Java and Swift/Objective C, respectively, while Web Browsers have JavaScript (JS). But all three platforms (Android, iOS and Web) can integrate C++ code as well.

This article will introduce Computer Vision in a web browser for dummies. Note: we are talking specifically of web browsers (front end), not web servers (back end)! First, we want to run a C++ code in a browser, in particular a C++ code that uses the OpenCV library, the most popular tool in the computer vision community. Second, we want to run some neural networks in a web browser.

Disclaimer: Our background is in CV and ML, not web development. We might still be missing some technical details on the web side of things. And we deliberately skip many topics which belong to the “pure front end” and not CV, such as JS modules, bundlers, frameworks, DOM model, etc. However, minimal knowledge of JS and front-end development will definitely help you understand this article better.

CV in Front-End or Back End?

Before we start, there is an important question. Should we run CV algorithms in the front end (browser) or the back end (server)? The back end can give you much higher computational power. But you’ll pay for it, and if a million users use your website, the cost adds up quickly. If you want to take the back-end approach, there are other things to consider.  

The first is latency (delay). It often takes time for the data to travel through the internet, often up to half a second or even more. There are both physical reasons for the latency, like the finite speed of light (especially relevant for satellite connections), and other reasons, like the large number of switches and routers which retranslate your signal (never instantly). This is usually not a problem when you want to process a single image. Video is another matter. The latency doubles if the signal has to go two ways (from browser to server and back). Add to this the latency of the algorithms themselves (on the server), and you can get a pretty noticeable delay (e.g., of the order of a second), which will make your beautiful web application not so beautiful anymore.

The second problem is video streaming. Many people think it’s trivial, but it’s not. Web browsers are not made for streaming. Especially not for real-time streaming. 

But wait, what about YouTube or Netflix? It is not the same at all. They are not real-time (in fact, with quite a lot of buffering) and go one way from server to browser. Basically, the only “good” streaming option for the browser is WebRTC, but it’s made for browser-to-browser P2P connections and is extremely painful to implement on the server side. It also thinks in terms of “streams” and is not very friendly to any CV algorithms processing video frame-wise. Other options like WebSockets are much more server-friendly and programmer-friendly. Still, they use very inefficient video codecs like MJPEG (since they cannot access the built-in video codecs of the browser). And finally, two-directional video streaming can be simply problematic on slow networks due to the network load.

CV in the front end has its downsides. As we’ll see below, a web browser is simply not a powerful enough platform for heavyweight algorithms. It is especially true for neural networks. Typical modern neural networks are both too large (hundreds of megabytes) and too slow to be deployed in a browser.

To summarize this chapter:

  • For heavyweight algorithms, there is no choice: back end.
  • For processing a single image (e.g., applying effects to a photo), back end is OK.
  • For real-time video processing, it’s very hard to create a working solution with a back end, use the front end if possible.
  • Desktop or mobile apps utilize your hardware much more efficiently than the web browser, giving you 10-100 times more power (see examples in the following chapters). 

Web Browser as a Virtual Platform

While you probably did not think about it, a web browser is a platform, much like Android, iOS or Raspberry Pi. But it runs on your PC or phone, so it is a virtual device on your host system, which can be compared to an Android emulator or perhaps some PlayStation or Nintendo emulator. Web Browser would NEVER run a third-party machine code of your host machine architecture (Intel or ARM), except in browser plugins, which seem to be dying out nowadays. Instead, it has a virtual engine, or actually, more like two virtual engines (for modern browsers): JS engine and WASM engine.

JS engine runs JS code natively, without a compiling step (actually, there is a JIT compiling and many optimizations under the hood). The other engine is WASM (for WebAssembly), a machine code type. C++ and other languages can be compiled into WASM and executed in a web browser. However, WASM is not the machine language of your host. Thus, the same C++ code runs many times slower in a web browser than the same C++ code built for the host and always will in the future. By the way, if you ever hear of something called “Asm.JS,” just ignore it; it is deprecated since WASM was introduced.

As a virtual platform, the Web Browser has a lot of artificial limitations, motivated mainly by web security. Compared to other platforms (desktop, mobile, Raspberry Pi, etc.) it is probably the most painful platform to develop for.

  • You cannot access your host file system. Files can be accessed only via file chooser dialogs.
  • Many things, including the camera, require HTTPS. HTTPS is only possible if you have a certificate. 
  • Cannot autoplay videos with sound.
  • Mobile browsers have no console, and a localhost server is typically not available; thus, developing and testing for mobile browsers is much harder compared to desktop ones.
  • Cannot access different remote servers freely due to CORS issues.

Browsing Fast and Slow

One can say that a web browser turns your latest expensive PC or phone into a Raspberry Pi 1. However, it’s only half true. Some things are pretty fast in a browser. These are the things implemented as a part of the browser code itself, written in C++ and well-optimized. In contrast, any custom code of yours, whether in JS or WASM, will run pretty slowly.

Fast (or at least reasonably fast) operations, built-in into the browser:

  • Image operations using <img> and <canvas> tags, including JPEG and PNG codecs.
  • Video and audio operations using <video>, <audio> tags, including a number of codecs.
  • 3D graphics with WebGL.
  • Audio/Video streaming with WebRTC.
  • Presumably (but I didn’t test myself) WebXR.

Slow: Any custom algorithms written by you, or part of third-party JS or C++ libraries:

  • Any custom code in JS.
  • Any custom code in WASM (usually compiled from C++).
  • It includes OpenCV operations.
  • Neural Network inference.
  • Any JS library, installed by NPM or otherwise.
  • Any cross-platform C/C++ libraries compiled to WASM, like ffmpeg.js.

The fast operations are regularly advertised and showcased in beautiful demos. However. In real life, if you want to create something original, fast operations are often not enough. At some point, you will have to write your own custom algorithms in either JS or C++, and they will be very slow.

The fast operations tend to be extremely restrictive, often surprisingly so. Take for example the <video> tag. It is probably something like the combination of VideoCapture and VideoWriter of OpenCV, right? Nope! In fact, the <video> tag is almost useless for CV applications for the following reasons:

  • It cannot break the video into individual frames (no “next frame” callback).
  • Seeking a given position is supposed to exist but does not seem to work on Chrome.
  • It is strictly real-time and cannot wait while you process a frame (you cannot process a video file slowly). 
  • It cannot encode a sequence of frames into a video stream.

Basically, the <video> tag is good at only one thing: showing a video in your browser. Not for CV. WebRTC, by the way, has similar limitations and can hardly be connected to any front-end CV. What to do then if you want to process a video? None of the options are perfect, but still:

  • You can try to use the <video> tag anyway, but you might lose frames and have an irregular FPS with possible non-deterministic outputs. 
  • A library ffmpeg.js can be used, but it’s a part of the “slow world”, heavyweight and complicated.
  • New WebCodecs API as an option, however  it’s still not widely supported (nor is it easy).

How to Compile C++ to WASM Using Emscripten

WASM is one of the official LLVM (aka clang) toolchains. Does it mean that you can compile LLVM-supported languages, such as C and C++, into WASM? In theory, yes. In practice, some additional code is needed for a user-friendly porting of C++ code into a web browser. Such functionality is provided by emscripten, which is basically LLVM wasm toolchain with a good port of the C++ standard library, plus some extras, including some web-only functions and macros usable from C/C++ via emscripten.h. OpenGL C++ API is also available (implemented via WebGL).

To compile a C++ code with emscripten, use em++ instead of g++, or, for a real project, use wrappers like emcmake, emconfigure, emmake. For example, to build a cmake project:

mkdir build
cd build
emcmake cmake ..
emmake make

In theory, you can port (almost) any C++ code to a web browser using emscripten. In practice, there are many technical details. Instead of an executable, emscripten generates a pair of JS+WASM files (e.g., hello.js, hello.wasm). From the point of view of cmake or make, they are ‘executable’, not ‘library’. That word applies to static C++  *.a libraries that can also be built by emscripten. However, our ‘executable’ might not even have the main() function, and it can contain other functions callable from JS, so from the point of view of JS, the JS+WASM pair works library-like, as a JS script with functions. Emscripten builds a JS script by default; if you want to build a JS module, use “-s EXPORT_ES6=1 -s MODULARIZE=1” flags. It’s not very convenient to access your browser window from WASM (with a few exceptions, like WebGL). So most often, WASM code contains some algorithms called from a GUI written in JS. We assume such architecture in the following. Here is a minimal emscripten C++ example. Let’s call it mymul.cpp:

#include <iostream>
#include <emscripten.h>
extern "C" { 
       EMSCRIPTEN_KEEPALIVE double mymul(double x, double y){ 
            double z = x * y; 
            std::cout<<"C++:"<<x<<"*"<<y<<"="<<z<<std::endl;
            return z; 
       } 
} 

It’s a pretty standard C++ code, but a couple of things need explaining. First, the EMSCRIPTEN_KEEPALIVE macro ensures that function mymul() is not removed during linking, and thus it can be used from JS. Second, extern “C” ensures that the function mymul() is called _mymul, without C++ name mangling. By the way, where does cout print to? It’s the browser console. Let’s compile this code:

em++ -o mymul.js mymul.cpp

We get two output files: mymul.js and mymul.wasm. Now you can import mymul.js via the <script> tag in your webpage and use the function  Module._mymul(), or simply _mymul(), in your JS code, for example, _mymul(7.1, 3.0). 

It is an example of calling a C++ function directly from JS. This is the simplest, and most reliable way, but the argument types are very limited: a primitive or a C++ pointer (treated by JS as a number). There are other options. Emscripten functions cwrap()/ ccall() give some support for JS arrays and strings, and embind (a cousin of nbind and pybind) is a powerful framework that wraps various C++ types, including classes, into JS. Such higher-level options can seem attractive, but if you do not understand their mechanics fully, you can easily get a C++ memory leak (which is very bad) or unnecessary copying of a large array (which is slightly bad).

There are some more technical details you should know about emscripten:

  • C++ heap: The default C++ heap is very small. You will likely have to increase it, or alternatively to allow automatic growth.
  • C++ exceptions: Disabled by default for the sake of performance. You can allow C++ exceptions explicitly via either JS exceptions (slow) or WASM native exceptions (new, experimental).
  • Files and console: Remember that the browser cannot access your host filesystem. Standard streams like cout and cerr use browser console.
  • C++ threads: Originally, WASM was strictly single-thread. Nowadays, you can use multiple threads via WebWorkers, but have to enable such options explicitly.
  • Emscripten runtime loading: C++ functions are not available immediately on webpage load, you’ll have to wait until the Emscripten runtime initializes. JS modules can do this in a more controlled way.

In the follow-up article, we are going to dive deeper into practical aspects of running classical computer vision algorithms as well as convolutional neural networks in a web front-end.

Computer Vision in a Web Browser: Basics
Tagged on: