admin, Author at it-jim ― page 6

Writings on the Wall: Recognizing Speech on Spectrograms

Posted on June 29, 2022 by admin

If you’ve ever come close to anything related to audio or other signal processing, you likely already know about spectrograms. Those fancy-looking and usually colorful plots are commonly used to represent a spectrum’s change over time. But can they provide us with some higher-level information about, let’s say, human speech? What if I told you that one could effectively get a transcript of a speech recording just from its spectrogram? Well, if you think that this is rather an exaggeration, you’re absolutely right. Yet, recognizing certain phonemes and even making educated guesses about specific words based only on their spectrograms is perfectly possible. Thus, let’s dive deeper into this topic and learn a thing or two about human speech on our way.

Power-Source-Filter Model

A common way to represent human speech is a so-called Power-Source-Filter model. The Power here refers to the lungs where an air flow originates, vocal cords are the Source of vibrations and everything above them (the vocal tract) serves as the Filter for those vibrations.

We can ignore the Power component for our current goal and focus only on the Source-Filter part. Using more accurate terms than just “vibrations,” the Source produces harmonic waves with a fundamental frequency depending on the voice pitch. The Filter then either amplifies or suppresses specific harmonics. Peaks on the filter’s frequency response are called formants and are denoted as F1, F2, etc. (from lower to higher frequency).

The Filter is considered linear, i.e. a current sample is approximated as a weighted sum of n previous samples. Given a speech recording, one can estimate coefficients of the Filter using a Linear Predictive Coding (LPC) technique and then use them to find the frequency response curve. We need this curve (specifically its formants) to help us recognize certain phonemes.

Read also:

Audio Processing Basics in Python

Vowels

Phoneticians distinguish a set of 8 “cardinal vowels”, with each one being defined by a specific position of a tongue’s highest point while pronouncing it:

If we plot the highest point positions for each cardinal vowel together, they’ll form a specific figure:

If we make the same plot for frequencies of the first two formants (F1 and F2), it will look remarkably similar:

The match isn’t perfect, of course (just as my pronunciation of the cardinal vowels, from which the formants were obtained), but it is still close enough. It leads to a couple of conclusions. First, even though the model with just the linear filter might look over-simplified, it bears direct correspondence with movements of the vocal tract. Second, the frequencies of the formants (usually two or three) are unique for each vowel and can be used to distinguish them.

To observe this, we can create a plot of a speech recording that is similar to a spectrogram but with the Filter’s frequency responses used as its columns instead of spectrums. Formants on this kind of plot are seen as bright horizontal lines. If we build it for a recording of several different vowels, it is evident that formants are indeed uniquely positioned for each of them:

Let us remember this plot for a future reference and move on to consonants.

Consonants

Unfortunately, there is no unique descriptor for each consonant, unlike formants for vowels. Instead, we can categorize consonants and use this classification to narrow down a list of possible options when trying to recognize a particular phoneme.

To analyze consonants, we need to pronounce them between two vowels, which makes them better defined on spectrograms. So, all examples were pronounced with two [a] sounds, like [apa], [ada], etc.

Arguably the most important category split is voiced and voiceless consonants. While pronouncing voiced ones, vocal cords still vibrate; thus, we can observe some harmonics. During voiceless ones, the vibration is absent, and harmonics are entirely interrupted. As evident from the following plot, while all consonants do look like “gaps” between vowels, voice ones ([b] and [d]) still leave some harmonics uninterrupted:

Fricatives can be recognized by a characteristic noise. Furthermore, the distribution of the noise along the spectrum can help to distinguish them from each other:

The frequency response can be helpful for consonants too. For instance, nasal consonants have a specific noise that is better observed on this kind of plot:

Trilled consonants ([r] in this case) can be easily spotted too by a very characteristic vertical pattern:

Some other features can help recognize consonants; however, they are more advanced and often harder to spot, so we’ll leave them out of scope for now.

Reading Words

Now, when we’ve learned to recognize different phonemes, why not try to do something more remarkable, like reading an actual word from a spectrogram? Here is one, with its spectrogram and corresponding frequency response plots:

We can immediately identify three separate vowels. Just by looking at the reference of different vowels that we’ve prepared earlier, we can pick the ones that look the most similar:

The second noticeable thing is three fricatives that can be identified by their noise using another reference from earlier:

Now we have just three missing phonemes. The first one can be easily recognized on the frequency response plot as a trilled consonant, with [r] being the only possible option in English. The second one is somewhat hard to identify, so we’ll skip it. Finally, the last missing one can also be identified on the frequency response plot as a nasal consonant (either [n] or [m]). So, here are our final predictions:

We still have one unknown consonant and ambiguity regarding another one, yet what we’ve discovered is enough to “brute force” the word, which is obviously “frequency”.

Conclusions

So, we’ve learned to recognize some phonemes on spectrograms. That is something you could brag about to a very limited number of people who would actually consider it cool but are there any practical applications to all this knowledge?

First, if you’re building any kind of speech processing pipeline with spectrograms as its inputs, you now know about features to look for and can tune spectrogram parameters to highlight them better. Or you can even use frequency responses for additional features. Also, if you have a speech-generating model (especially a black box one, like a neural network) and its output sounds wrong, you could compare its spectrogram to an actual speech and try finding the source of your troubles. And finally, what we’ve discussed in this post is present in many classic speech processing methods. Linear Predictive Coding, for example, is used for voice compression (like earlier versions of GSM), speech synthesis, speech encryption, audio codecs, etc. And it is always good to know the basics, even when working with much more advanced stuff.

Computer Vision in a Web Browser: Practical Examples

Posted on June 21, 2022 by admin

This blog post covers some important aspects of deploying and running classical computer vision algorithms as well as convolutional neural networks in a web front-end. Please make sure you have read the first part of the blog post. This will definitely help you to follow all technical aspects much easier.

Emscripten for Computer Vision

How can you pass an image or a video frame from JS to C++ and back? We’ll give a minimal example. Suppose you have an image in an <img> tag. First, you have to copy it to RGBA pixels (only RGBA format is supported, not RGB !) via a <canvas> tag:

const img = document.getElementById('myImg');
const canvas = document.getElementById('myCanvas');
const ctx = canvas.getContext('2d');
const w = img.width; const h = img.height;
canvas.width = w; canvas.height = h;
ctx.drawImage(img, 0, 0);
const data = ctx.getImageData(0, 0, w, h).data; // Uint8ClampedArray
const bSize = data.byteLength; // == 4*w*h

Next, you have to send the Uint8ClampedArray object data to C++. However, C++ cannot access JS objects directly (at least, not efficiently). They are not part of the C++ memory, which itself is only a part of the JS memory. Some copying is unavoidable. Let’s copy data to the C++ heap:

const dataPtr = Module._malloc(nBytes); // C++ malloc
const dataHeap = new Uint8ClampedArray(Module.HEAP8.buffer, dataPtr, nBytes);  
dataHeap.set(data);  // Copy data -> dataHeap

Here dataHeap is a view object for the C++ data.

Now we can finally call the C++ code to do something to the image. We pass dataPtr, a pointer to data on the C++ heap. No result is returned here, but the image can be modified in-place:

Module._process(dataPtr, nBytes, w, h);

Finally, let’s show the result on the canvas and free the C++ buffer:

ctx.putImageData(new ImageData(dataHeap, w, h), 0, 0);
Module._free(dataPtr);

It’s very important that C++ has no garbage collections, so if you use malloc(), you must free() afterwards, otherwise there is a memory leak! Memory leaks are extremely evil. You might not notice them in a minimal demo, but they will kill a real project.

Emscripten and OpenCV

Traditional CV algorithms in C++ typically use OpenCV. Can we build it with emscripten and use it in our custom C++ projects? Yes, but with a few caveats.

First, the emscripten build of OpenCV uses a custom build script build_js.py. Unfortunately, it’s made for an ancient emscripten version (2.0.10) and doesn’t work with modern ones. You have two choices. You either use version 2.0.10 and miss the features and optimizations of modern emscripten versions; or hack the build script to make it modern-version compatible, which is not easy.

Second, this build script builds Asm.JS by default; you will have to specify the –build_wasm option for a WASM build, this is important.

Third, we are not sure this build is optimal. In particular, it is probably single-thread. You can dig into this stuff if you want, but it is not going to be easy.

Once the build process is finished, you will have a build directory with a lot of useful stuff, like lib and include directories, and also .cmake files. Ignore bin/opencv.js; we are not going to use that. You use OpenCV in your C++ code just like you would on a desktop platform. In particular, cmake is able to find OpenCV with find_package(), provided that you specify the option -DOpenCV_DIR=<path>, where <path> is the full path to the OpenCV emscripten build directory (the one with .cmake files). You can pass RGBA images to C++ as explained above and convert them to a cv::Mat inside your C++ code.

But what can you do with opencv.js? First, it cannot be used in any way from your custom C++ code, thus it is pretty useless from where we stand. Second, the file opencv.js is the project of the same name (OpenCV.js), which exports a number of OpenCV functions and classes to be used directly from JS (probably via embind or something similar). As it happens, the emscripten C++ build of OpenCV (the lib directory), the thing that we want, is merely a byproduct of OpenCV.js build process from the point of view of the OpenCV team. The official OpenCV documentation does not even mention C++ emscripten usage, it presents OpenCV.js only. Calling opencv functions from JS is not very interesting from our point of view, plus OpenCV.js is inconvenient and poorly documented compared to OpenCV C++ or Python API. It’s much more interesting to build CV C++ algorithms in emscripten. Such C++ algorithms, if written well, are cross-platforms, and can be developed on desktop and later ported to mobile, front-end or embedded.

Is it possible to have both? Can we create a custom C++ code, which also exports some OpenCV stuff like cv::Mat to JS? Probably yes, with some effort, but for beginners it is much simpler to call OpenCV stuff from C++ only, and pass images from JS to C++ and back as explained in the previous chapter.

How slow is OpenCV emscripten, compared to desktop OpenCV on the same computer? It depends on the OpenCV function, but here is an example. We run Lucas-Kanade sparse optical flow cv::calcOpticalFlowPyrLK() for 400 points, and the same parameters, on the same laptop both on desktop and web browser. Our results:

Native C++ (desktop)	WASM, Chrome	WASM, Firefox
~ 1 ms	~ 24 ms	~ 90 ms

24-90 times, not a small difference! That is what we meant before about “custom algorithms being slow”!

Disclaimer: This applies to the default opencv WASM build with emscripten 2.0.10. It is probably single-thread. A better optimization is likely possible if you really dig into the problem, but it’s far from trivial. As a result, the web browser on your modern computer is slow ‘Like Raspberry Pi 1’ as far as CV algorithms are concerned, thus only the most lightweight ones can be successfully deployed in a web browser.

Deep Learning in a Web Browser

Nowadays, CV is mostly about neural networks, at least if you get your information from blogs and youtube channels. Can you deploy neural nets in a web browser? And how efficient is it? Short answers are: “yes”, and “very inefficient”.

All serious neural nets use GPU (or sometimes TPU). Can a web browser use GPU? Yes, but only in the form of WebGL (web OpenGL) and not CUDA. You probably have never heard of neural networks using OpenGL on desktop, only CUDA, right? Do you wonder why? The answer is obvious: OpenGL is made for 3D rendering, not numerical calculations, and is very inefficient for neural networks compared to CUDA (on the same GPU). You’ll see some examples below. Likewise, CPU inference (in WASM) is slower than the machine-native CPU code.

Which DL frameworks are available for the web browser? We know two: TensorFlow.JS (Google) and ONNX Runtime Web (Microsoft). Both frameworks support webgl (default) and CPU inference.

TensorFlow.JS is “TensorFlow for the web”, with a JS API similar (but not identical !) to python TF+keras. It has its model format (BIN+JSON), different from TF and Keras models. It is a relatively heavyweight library with lots of utilities. Apparently, you can even train networks in a browser. Needless to say, when we first looked at TensorFlow.JS, we were somewhat surprised. We expected a minimalistic TFlite (like on mobile platforms) but instead found something heavyweight and completely original. TFlite API also exists for the web, but if we are not mistaken, it requires full TF.js anyway. Supposedly TF and Keras models can be converted to TF.js format, but it does not always work in practice; plus we had to edit the JSON file by hand to make anything work.

The good thing about TF.js is that it has a lot of auxiliary stuff. For example, you can create tensors from HTML <img> and <canvas> elements (automatically converting RGBA to RGB !). You also have a numpy-like tensor algebra which you can use for operations like normalization, image resize, or data type conversion. The problematic thing is that in TF.js (when using WebGL), you have to release all tensors by hand (tf.dispose()) or with the special wrapper tf.tidy(), otherwise you’ll get a catastrophic GPU RAM leak!

The other framework ONNX Runtime Web is pretty much the opposite. It is small, compact, minimalistic, and only supports ONNX format. It is good for deploying PyTorch networks (and nowadays, almost all modern neural nets are in PyTorch), as most PyTorch networks can be converted to ONNX, but not every ONNX can be further converted to TF. ONNX Runtime Web does not have tensor algebra, so you will have to implement all auxiliary operations (normalization, type conversion, RGBA->RGB) yourself (in pixel-wise JS loops) or use some other libraries.

The worst thing about ONNX Runtime Web is that it does not work. Or, rather, the original version 1.8.0 does (and the older ONNX.js), but all subsequent versions do not. The bugs are somewhere in WebGL shaders, since WASM inference works correctly. For some networks, the result is OK (e.g., torchvision ResNet 50), but for others (ResNet 18), it is completely crazy! What is the big difference between ResNet 50 and 18? Unfortunately, we didn’t have time to investigate deeper.

The most amazing thing is that several ONNX Runtime Web versions were released after 1.8.0, and they are all broken. Did nobody notice it?

For both frameworks, there is a common WebGL issue. It takes a long time to compile WebGL shaders. Thus, the very first “warm up” inference can take a few seconds. The following ones are fast, but only if the input tensor size does not change. If the network has a dynamic-sized input and the input size changes, the shaders will be recompiled. This issue is unavoidable, but a clever web developer can mask the webgl warmup with web page loading or something like that.

Finally, the speed. While we did not perform any formal test, here is what we got very roughly on the laptop with GeForce 1660 GPU (Note: unlike CPU and CUDA, WebGL inference times fluctuate wildly, even after the warmup), all on ResNet 50 from either torchvision or Keras.

	PC		Browser (FireFox)
	CUDA	CPU	WebGL	WASM
Keras	–	90 ms
PyTorch	5.5 ms	60-70 ms
ONNX runtime		15 ms	50-350 ms	1000 ms
TFJS			260 ms

*Timing per one inference.

From what we see, WebGL (GPU) inference in a browser is about 15 times slower than native CPU, and about 50 times slower than CUDA. Speaking of the native CPU, ONNX runtime is way faster than PyTorch or Keras; we did not previously know that. These numbers mean that only relatively lightweight neural networks can be successfully executed in a browser unless you want inference time of many seconds.

Besides, there is a question of neural network size (the total size of their parameters in e.g. PTH or ONNX file). Modern neural networks are typically hundreds of megabytes or even gigabytes in size. The largest size practical for the front-end is perhaps about 20 Mb if you don’t want your webpage to load forever. Such super-small models are not easy to find. Please don’t expect to deploy a model from some 2022 state-of-the art paper in a web browser!

Other Technologies in the Web Browser

We’ll mention very briefly a few other web technologies which can be relevant to CV.

WebGL is available for 3D graphics, and it’s one of the “fast” technologies. Few people, however, would want to use WebGL directly. There are several convenient 3D graphics libraries that use WebGL, the most popular is Three.JS. Even Unity engine is available for the web (as an official Unity platform), based on WASM + WebGL.

WebXR is available for VR and AR (the previous specification WebVR has been deprecated and removed). But you cannot try it on your PC. WebXR requires an actual VR device, like Oculus Quest 2. On smartphones, it can do VR by showing two images on the screen, which can be viewed in 3D if you have a VR headset for your phone. Finally, it can do AR on your phone (no additional headsets required), but only if you have ARKit/ARCore, and still not all phones in existence have those. Maybe in a year or two, it will become widely available.

To Web or Not To Web?

Finally, we are ready to give the final answer to the question, “Should you put the CV algorithms on the front end?”. The answer may be different. If your CV algorithm is really lightweight, you can run it in a web browser. Otherwise, be ready for playing with your favorite neural net or heavy custom pipeline. It is much more efficient (10-50) times to run stuff on a native platform (Intel, ARM) compared to the browser. Thus, you need to always consider writing mobile applications or at least client-server web ones to control the distribution of the computational resources for heavyweight CV algorithms.

Can things get better in the future? Will the “native” and “web” worlds somehow converge?

On one hand, there are big challenges. Flashy demos for some new “fast” technologies can look very cool, but, as explained above, if we want original CV stuff, we need to write custom algorithms, which are in the “slow” category. And it is likely that WASM will always be slower than the native CPU. Neural network inference in the web browser is currently very slow compared to the native platform, but this can be easily fixed by creating new web technologies (think e.g. native-CPU ONNX runtime built into all browsers). On the other hand, browser platform is very important for rather popular Metaverse and the extended reality (XR) concepts, thus there is a strong motivation for improvement.

The world of the web tends to develop slowly (adoption of new web specifications take years), but it is likely that web browser will become a mature platform in the long term (10-20 years). We are cautiously optimistic about it.

Computer Vision in a Web Browser: Basics

Posted on June 15, 2022 by admin

Are you interested in Computer Vision (CV)? Probably yes, if you are reading this. If you read CV tutorials, you might have noticed that most of them are in Python. This applies to both traditional CV (without neural networks) and, even more, to deep learning (neural networks). Occasionally, CV tutorials use C++ instead of Python, but any other programming languages are very rare. The fact is, Python is known as THE language for research and development, math, CV, ML, DL, education, and a quick prototyping.

But what if we want to deploy our CV algorithm somewhere, i.e., to use it in real life? Then very often we will find python impossible or very difficult to use. C++ is better: it is available almost everywhere. Android and iOS platforms have their own languages: Kotlin/Java and Swift/Objective C, respectively, while Web Browsers have JavaScript (JS). But all three platforms (Android, iOS and Web) can integrate C++ code as well.

This article will introduce Computer Vision in a web browser for dummies. Note: we are talking specifically of web browsers (front end), not web servers (back end)! First, we want to run a C++ code in a browser, in particular a C++ code that uses the OpenCV library, the most popular tool in the computer vision community. Second, we want to run some neural networks in a web browser.

Disclaimer: Our background is in CV and ML, not web development. We might still be missing some technical details on the web side of things. And we deliberately skip many topics which belong to the “pure front end” and not CV, such as JS modules, bundlers, frameworks, DOM model, etc. However, minimal knowledge of JS and front-end development will definitely help you understand this article better.

CV in Front-End or Back End?

Before we start, there is an important question. Should we run CV algorithms in the front end (browser) or the back end (server)? The back end can give you much higher computational power. But you’ll pay for it, and if a million users use your website, the cost adds up quickly. If you want to take the back-end approach, there are other things to consider.

The first is latency (delay). It often takes time for the data to travel through the internet, often up to half a second or even more. There are both physical reasons for the latency, like the finite speed of light (especially relevant for satellite connections), and other reasons, like the large number of switches and routers which retranslate your signal (never instantly). This is usually not a problem when you want to process a single image. Video is another matter. The latency doubles if the signal has to go two ways (from browser to server and back). Add to this the latency of the algorithms themselves (on the server), and you can get a pretty noticeable delay (e.g., of the order of a second), which will make your beautiful web application not so beautiful anymore.

The second problem is video streaming. Many people think it’s trivial, but it’s not. Web browsers are not made for streaming. Especially not for real-time streaming.

But wait, what about YouTube or Netflix? It is not the same at all. They are not real-time (in fact, with quite a lot of buffering) and go one way from server to browser. Basically, the only “good” streaming option for the browser is WebRTC, but it’s made for browser-to-browser P2P connections and is extremely painful to implement on the server side. It also thinks in terms of “streams” and is not very friendly to any CV algorithms processing video frame-wise. Other options like WebSockets are much more server-friendly and programmer-friendly. Still, they use very inefficient video codecs like MJPEG (since they cannot access the built-in video codecs of the browser). And finally, two-directional video streaming can be simply problematic on slow networks due to the network load.

CV in the front end has its downsides. As we’ll see below, a web browser is simply not a powerful enough platform for heavyweight algorithms. It is especially true for neural networks. Typical modern neural networks are both too large (hundreds of megabytes) and too slow to be deployed in a browser.

To summarize this chapter:

For heavyweight algorithms, there is no choice: back end.
For processing a single image (e.g., applying effects to a photo), back end is OK.
For real-time video processing, it’s very hard to create a working solution with a back end, use the front end if possible.
Desktop or mobile apps utilize your hardware much more efficiently than the web browser, giving you 10-100 times more power (see examples in the following chapters).

Web Browser as a Virtual Platform

While you probably did not think about it, a web browser is a platform, much like Android, iOS or Raspberry Pi. But it runs on your PC or phone, so it is a virtual device on your host system, which can be compared to an Android emulator or perhaps some PlayStation or Nintendo emulator. Web Browser would NEVER run a third-party machine code of your host machine architecture (Intel or ARM), except in browser plugins, which seem to be dying out nowadays. Instead, it has a virtual engine, or actually, more like two virtual engines (for modern browsers): JS engine and WASM engine.

JS engine runs JS code natively, without a compiling step (actually, there is a JIT compiling and many optimizations under the hood). The other engine is WASM (for WebAssembly), a machine code type. C++ and other languages can be compiled into WASM and executed in a web browser. However, WASM is not the machine language of your host. Thus, the same C++ code runs many times slower in a web browser than the same C++ code built for the host and always will in the future. By the way, if you ever hear of something called “Asm.JS,” just ignore it; it is deprecated since WASM was introduced.

As a virtual platform, the Web Browser has a lot of artificial limitations, motivated mainly by web security. Compared to other platforms (desktop, mobile, Raspberry Pi, etc.) it is probably the most painful platform to develop for.

You cannot access your host file system. Files can be accessed only via file chooser dialogs.
Many things, including the camera, require HTTPS. HTTPS is only possible if you have a certificate.
Cannot autoplay videos with sound.
Mobile browsers have no console, and a localhost server is typically not available; thus, developing and testing for mobile browsers is much harder compared to desktop ones.
Cannot access different remote servers freely due to CORS issues.

Browsing Fast and Slow

One can say that a web browser turns your latest expensive PC or phone into a Raspberry Pi 1. However, it’s only half true. Some things are pretty fast in a browser. These are the things implemented as a part of the browser code itself, written in C++ and well-optimized. In contrast, any custom code of yours, whether in JS or WASM, will run pretty slowly.

Fast (or at least reasonably fast) operations, built-in into the browser:

Image operations using <img> and <canvas> tags, including JPEG and PNG codecs.
Video and audio operations using <video>, <audio> tags, including a number of codecs.
3D graphics with WebGL.
Audio/Video streaming with WebRTC.
Presumably (but I didn’t test myself) WebXR.

Slow: Any custom algorithms written by you, or part of third-party JS or C++ libraries:

Any custom code in JS.
Any custom code in WASM (usually compiled from C++).
It includes OpenCV operations.
Neural Network inference.
Any JS library, installed by NPM or otherwise.
Any cross-platform C/C++ libraries compiled to WASM, like ffmpeg.js.

The fast operations are regularly advertised and showcased in beautiful demos. However. In real life, if you want to create something original, fast operations are often not enough. At some point, you will have to write your own custom algorithms in either JS or C++, and they will be very slow.

The fast operations tend to be extremely restrictive, often surprisingly so. Take for example the <video> tag. It is probably something like the combination of VideoCapture and VideoWriter of OpenCV, right? Nope! In fact, the <video> tag is almost useless for CV applications for the following reasons:

It cannot break the video into individual frames (no “next frame” callback).
Seeking a given position is supposed to exist but does not seem to work on Chrome.
It is strictly real-time and cannot wait while you process a frame (you cannot process a video file slowly).
It cannot encode a sequence of frames into a video stream.

Basically, the <video> tag is good at only one thing: showing a video in your browser. Not for CV. WebRTC, by the way, has similar limitations and can hardly be connected to any front-end CV. What to do then if you want to process a video? None of the options are perfect, but still:

You can try to use the <video> tag anyway, but you might lose frames and have an irregular FPS with possible non-deterministic outputs.
A library ffmpeg.js can be used, but it’s a part of the “slow world”, heavyweight and complicated.
New WebCodecs API as an option, however it’s still not widely supported (nor is it easy).

How to Compile C++ to WASM Using Emscripten

WASM is one of the official LLVM (aka clang) toolchains. Does it mean that you can compile LLVM-supported languages, such as C and C++, into WASM? In theory, yes. In practice, some additional code is needed for a user-friendly porting of C++ code into a web browser. Such functionality is provided by emscripten, which is basically LLVM wasm toolchain with a good port of the C++ standard library, plus some extras, including some web-only functions and macros usable from C/C++ via emscripten.h. OpenGL C++ API is also available (implemented via WebGL).

To compile a C++ code with emscripten, use em++ instead of g++, or, for a real project, use wrappers like emcmake, emconfigure, emmake. For example, to build a cmake project:

mkdir build
cd build
emcmake cmake ..
emmake make

In theory, you can port (almost) any C++ code to a web browser using emscripten. In practice, there are many technical details. Instead of an executable, emscripten generates a pair of JS+WASM files (e.g., hello.js, hello.wasm). From the point of view of cmake or make, they are ‘executable’, not ‘library’. That word applies to static C++ *.a libraries that can also be built by emscripten. However, our ‘executable’ might not even have the main() function, and it can contain other functions callable from JS, so from the point of view of JS, the JS+WASM pair works library-like, as a JS script with functions. Emscripten builds a JS script by default; if you want to build a JS module, use “-s EXPORT_ES6=1 -s MODULARIZE=1” flags. It’s not very convenient to access your browser window from WASM (with a few exceptions, like WebGL). So most often, WASM code contains some algorithms called from a GUI written in JS. We assume such architecture in the following. Here is a minimal emscripten C++ example. Let’s call it mymul.cpp:

#include <iostream>
#include <emscripten.h>
extern "C" { 
       EMSCRIPTEN_KEEPALIVE double mymul(double x, double y){ 
            double z = x * y; 
            std::cout<<"C++:"<<x<<"*"<<y<<"="<<z<<std::endl;
            return z; 
       } 
}

It’s a pretty standard C++ code, but a couple of things need explaining. First, the EMSCRIPTEN_KEEPALIVE macro ensures that function mymul() is not removed during linking, and thus it can be used from JS. Second, extern “C” ensures that the function mymul() is called _mymul, without C++ name mangling. By the way, where does cout print to? It’s the browser console. Let’s compile this code:

em++ -o mymul.js mymul.cpp

We get two output files: mymul.js and mymul.wasm. Now you can import mymul.js via the <script> tag in your webpage and use the function Module._mymul(), or simply _mymul(), in your JS code, for example, _mymul(7.1, 3.0).

It is an example of calling a C++ function directly from JS. This is the simplest, and most reliable way, but the argument types are very limited: a primitive or a C++ pointer (treated by JS as a number). There are other options. Emscripten functions cwrap()/ ccall() give some support for JS arrays and strings, and embind (a cousin of nbind and pybind) is a powerful framework that wraps various C++ types, including classes, into JS. Such higher-level options can seem attractive, but if you do not understand their mechanics fully, you can easily get a C++ memory leak (which is very bad) or unnecessary copying of a large array (which is slightly bad).

There are some more technical details you should know about emscripten:

C++ heap: The default C++ heap is very small. You will likely have to increase it, or alternatively to allow automatic growth.
C++ exceptions: Disabled by default for the sake of performance. You can allow C++ exceptions explicitly via either JS exceptions (slow) or WASM native exceptions (new, experimental).
Files and console: Remember that the browser cannot access your host filesystem. Standard streams like cout and cerr use browser console.
C++ threads: Originally, WASM was strictly single-thread. Nowadays, you can use multiple threads via WebWorkers, but have to enable such options explicitly.
Emscripten runtime loading: C++ functions are not available immediately on webpage load, you’ll have to wait until the Emscripten runtime initializes. JS modules can do this in a more controlled way.

In the follow-up article, we are going to dive deeper into practical aspects of running classical computer vision algorithms as well as convolutional neural networks in a web front-end.

Computer Vision in the Food Domain

Posted on May 24, 2022 by admin

Surprising but true: according to market research, customers prefer apples with a maximum diameter of 75 to 80 mm 🍏 Now you know 🙂 People would obviously struggle to accurately evaluate fruits’ size with their naked eyes. In contrast, computer vision (CV) systems can measure the precise diameter of an apple in the blink of an eye, literally.

CV systems can collect and process a variety of parameters, including size, weight, shape, texture, color, and much more. So how exactly are these systems used in the food domain today? Let’s find out.

AI-based apple sorting machine – demo source

Where and How Vision Can Help: Use-Cases and Advantages

When it comes to the food and beverage segment, it is more common to hear the term “machine vision” (MV) than computer vision. What is the difference?

Though the essential components of vision-based systems are generally the same (digital cameras and image processing software), CV and MV are different terms for overlapping technologies. MV systems traditionally work in manufacturing and practical applications for quality control, inspection, and guidance. At the same time, CV systems are self-contained and do not require the use of a larger machine system, as they go way beyond image processing. In CV terms, an image doesn’t even have to be a photo or a video; it could be an ‘image’ from a thermal or infrared sensor, motion detector, or other sources.

The current trends and benefits of using vision systems for the food can be summarized as follows:

As you can see, there is a lot to do. While it may appear that most active development is reserved for industry, smart food technology is becoming increasingly accessible to end users. Let’s focus now on the most popular such examples.

How to Cook This Dish or A few Words about Cross-Modal Recipe Retrieval

The recommendation of recipes along with food might be the next “Shazam” for food, but, unfortunately, it still seems technically challenging. The problem of recipe retrieval comes from two aspects. First, current food recognition technology can only scale up to a few hundreds of categories, making it impractical to recognize tens of thousands of food categories. Second, even within a single food category, recipe variants may differ in ingredient composition. Finding the best-match recipe thus requires ingredient knowledge, which is a fine-grained recognition problem.

A good run-time example is the Vivino app, the label scanner, which can bring up all the information you need about the wine with a simple photo of a bottle. If you’re trying to make a snap decision in a bottle shop or supermarket, you can find out if the bottle you’re holding is a good deal or if it has the type of smoothness or dryness you’re looking for in a wine. Another plus is that it enables price comparison.

Vivino app – source

Creating New Recipes Based on Consumers’ Trends and Preferences

Today, consumers are increasingly looking for a variety of tasty options for healthy eating. To meet these expectations, entire menus must be reinvented, making it challenging to create new recipes constantly. Fortunately, this problem is now solvable.

The Foodpairing application enables analyzing and determining the compatibility of various food ingredients or discovering your flavor and creating new recipes. It has emerged as a result of multi-disciplinary knowledge from flavor science, food science experts, AI/ML domain, and consumer research. Even if you are too far from the art of cooking, try to play with a variety of interesting and tasty combinations for fun 😉

Image source

Food Tracking

Food image recognition apps may help improve your food ration by utilizing AI to tell you exactly the nutritional value of what is on your plate. Simply take a picture of your meal, and a food recognition platform will tell you exactly what it contains, including the main ingredient, side dishes, and even sauces.

Such programs can estimate portion sizes, nutrition, and calories, which is ideal for those who care about their health and keep their bodies in good shape. For example:

Real-time detection mode (left) and nutrition analysis from the local gallery (right)
on the FoodTracker app – source

To Sum Up

As it is in many other industries, AI is making huge waves in the food and beverage field. More and more companies recognize the potential of vision-based systems to improve efficiency and profitability, reduce losses, and protect against supply chain disruptions. This has resulted in the increased adoption of smart technologies in food production. And while it is having a significant impact in the industry, we are still in the early stages of its application as the end-users. Due to the costs associated with their implementation, such technologies are currently used primarily by large manufacturers. However, it is unavoidable that AI will one day become ubiquitous throughout the industry and more accessible to everyone.

Computer Vision for Beginners

Posted on April 13, 2022 by admin

We introduce a FREE 4-week Computer Vision Course. Based on our previous practices and taught by our leading engineers, it will give you a sound knowledge base both in classical computer vision algorithms and deep learning approaches.

Lecturers: Pavlo Vyplavin, CTO, and Yuriy Chirka, Head of ML.

Registration lasts until April 22, 2022: https://bit.ly/3O8hpnu

Who would benefit from this course?

Students who have had their studies cut short, teachers, programmers, or anyone interested in computer vision, machine learning, and deep learning but have not yet given it a try.

When, where and how?

The course will begin on April 26, 2022 (Tuesday) with an introductory lecture. Technical lectures will be held every Thursday at 16:00 and will last 1.5-2 hours each. The last lecture will take place on May 19, 2022.

Computer Vision for Beginners

Posted on April 13, 2022 by admin

After settling at new places and putting everything on the working track, we understood something was missing: we hadn’t done any educational activities for a long time…

Computer Vision for beginners

While a new edition of our trainee program is paused, we decided to offer everyone a free 4-week course on computer vision. Based on our previous practices and taught by our leading engineers, it will give you a sound knowledge base both in classical computer vision algorithms and deep learning approaches.

Who would benefit from this course?

Students who have had their studies cut short, teachers, programmers, or anyone interested in computer vision, machine learning, and deep learning but have not yet given it a try.

What, when, where, and how?

The course will start on April 26, 2022, with a 1-hour opening lecture followed by four 2-hour technical lectures on Thursdays until May 19, 2022. All the events will start at 4 PM UTC+3.

After registration and no later than April 24, you will receive a link to join the event. The course will be in Ukrainian and Russian.

This time, there won’t be any selection process; the course is open to everyone. We will assign a couple of home tasks during the course and invite those of our participants who provide us with promising solutions for an interview for a trainee position.

Support people in need

While the course is entirely free, we kindly ask you to support Ukrainians who have suffered from war or fearless volunteers helping civilians all across our country. We will later provide more information and gladly double your donations.

Apply before April 22

Please fill in the form below before April 22, 2022. There, we ask about your technical background to tailor the course to most listeners.

It-Jim Receives Platinum Award from Kharkiv IT Cluster 🏆

Posted on January 21, 2022 by admin

Our team has been recognized by Platinum Award from Kharkiv IT Cluster for successful contribution to Kharkiv IT Ecosystem development 🏆

The award ceremony took place on December 20, 2021, as part of the Kharkiv IT Cluster’s general meeting for members and partners. The event was dedicated to the presentation of the Kharkiv IT community’s achievements in 2021, and it brought together leaders and partners from approximately 100 IT companies.

The award depends on how the company joins the cluster’s activities in different directions. Participants and partners received awards from Kharkiv IT Cluster in the following nominations:

🔷 Participation in charitable initiatives within the IT4Life project,
🔷 Activity in educational projects,
🔷 Awards for personal contribution to the improvement of the region.

We are proud to be a part of the local IT community and are encouraged to continue the trend of working on new projects that improve our world 💪

Computer Vision Trainee Program

Posted on December 14, 2021 by admin

The program is a perfect match for:
🎓 engineering students
💻 software developers who want to switch to the CV / ML / DL domain

It lasts up to 2 months and gives you a chance to work on a real CV project under the personal guidance of an @It-Jim expert.

At the end of the program, successful candidates will have the opportunity to continue working at the company 🤝

🔥 Apply with the form by December 27, 2021