A student that is about to graduate from the university and has not yet made his mind about the career he wants to pursue. A freelancer who has already tried a lot of technologies, knows how he wants to develop himself further and would like to gain experience of working in a company. An experienced IT engineer totally burnt out and aiming to find a new passion. A professional architect (not software one) switching his career entirely. 

Do they have anything in common? Yes. All of them have successfully finished It-Jim’s winter internship on computer vision and are now one step closer to achieving their career goals in computer vision and machine learning domains.

The Second Edition of It-Jim’s Winternship

Have you ever noticed that the words “winter” and “internship” are made for each other? Or is it just us? The second edition of what we now call “winternship” on computer vision turned out a huge success. We started the campaign in the middle of December 2020, and in three and half weeks we received a truly overwhelming number of applications – 164, a 5x increase compared to the last year! Could we be any happier? Unlike the first edition, this time we had to switch the format to online only due to the ongoing epidemic situation. Yet, we even benefited from it. The geography of applicants this time was quite impressive: Kharkiv, Kyiv, Lviv, Dnipro, Odessa, and many other cities in and outside of Ukraine. 

Was it a fun ride selecting four best candidates out of 164 applications? Most definitely, yes. Especially when there were only 4 spots available. After first filtering of the list, we have sent test tasks to 120+ applicants to further narrow down the circle of candidates. Out of 53 participants who tried to solve those tasks we chose 13 persons for interviewing. Finally, here there were: a student, a freelancer, an IT switcher, and an architect… 4 (w)interns. 4 projects to work on. 4 mentors to guide. 4 weeks to go. 

Winternship 2021 Stats

The Rainbow of Projects

Each of the interns worked on their own project under the guidance of It-Jim’s mentors. We wanted projects that both solved unusual computer vision tasks and were challenging enough for interns, so we opted for the following ones:

  • creating the Shazam-like application: a program that is able to recognize audio or a sound within few seconds,
  • real-time background replacement in images: a module that segments foreground from background in a webcam stream
  • extraction of heart rate from the mobile phone selfie camera: a solution which performs a sort of magic and estimates the heart rate from a video only
  • floor segmentation application: a module that automatically located the floors of an arbitrary shape and locates them based on camera images and iPhone’s LiDAR’s raw data.

Let’s now dive into their realization.

Music Recognition

In this project, we wanted to compare two approaches, namely spectrogram analysis and deep learning, for the task of music recognition. 

The shazam-like algorithm had the following pipeline: 

  • getting a spectrogram of a song using the Fourier transform, 
  • creating fingerprints for the recorded sample: finding the frequency peaks in the spectrogram, grouping them to the target zones and pairing with anchor points
  • matching the fingerprints from the unknown sample against a set of fingerprints derived from the music database.

Song spectrogram and its target zone example

The deep learning approach to song similarity estimation had the following steps:

  • We extracted compact 128-number embeddings for each song using Siamese neural network.
  • Triplet loss was used within the training pipeline (analogy with face recognition)
  • Random crops from spectrograms were used as input dataset
  • 10+ different classifiers were applied for actual song recognition

As a result, our intern has demonstrated real-time music recognition for a moderate song dataset. This indicates that audio processing is also about computer vision.

  • Tools and technologies: signal processing, classical computer vision, machine learning, deep learning.

Monodepth Background Replacement

The goal of the project was to develop software that would replace the background behind a person on a video. The core idea was to utilize a monocular depth estimation model to calculate a depth map that can later be used to split the object and background

  • Initially, our intern started with the DenseDepth model trained with the NYU dataset. The DenseDepth model application frame-by-frame gave around 16 FPS only which was not enough for real-time video processing. 
  • We have easily achieved 30+ FPS via incorporation of the optical flow between adjacent frames to track movements and keep the consistency of the depth map
  • To get a binary mask, our intern added two options: to select a manual threshold or apply Otsu method which does the job automatically. 
  • Finally, some post-processing was applied to filter out the false contours from the background.

An alternative solution was based on the U2-Net model that is usually used for salient object detection. Trained on the Supervisely Person dataset, it managed to give more accurate background replacement compared to the depth estimation approach (see figure below). Also, unlike depth estimation models, it correctly handled cases when a person walked away from a camera view, leaving just an empty background. 

After a couple of experiments, our intern provided a real-time demo for background replacement.

Depth maps predicted by AdaBins trained on NYU (left) and background replacement based on predictions of U2-Net trained on Supervisely Person dataset (right)

  • Tools and technologies: Python, OpenCV, PyTorch, classical computer vision, deep learning.

Extraction of Heart Rate from the Frontal Camera

The goal of this project was to calculate the heart rate from a video of a person’s face using remote photoplethysmography (rPPG). The latter is based on the blood volume changes in tissue due to the cardiac activity that affects the optical characteristics of reflected light. A proper heart rate measurement is possible only in the case of efficient capturing the changes of red, green, and blue color components. Generally, the rPPG framework consisted of the following steps:

  • Face and ROI (region of interest) detection.
  • Facial landmarks extraction and tracking.
  • PPG signal extraction and processing. After locating the ROI, a single RGB signal was extracted by averaging pixel values over the region. Additional filtering and processing were applied to extract photoplethysmographic information from the raw PPG signal. 
  • dynamic heart rate estimation from the signal spectrogram

As a result, our intern has provided a real-time demo performing the heart rate estimation from a webcam.

Example of ROI detection (left) and heart rate extraction from the PPG signal (right)

  • Tools and technologies: OpenCV, Python, signal processing, classical computer vision

Floor Segmentation

The goal of the final fourth project was to create a pipeline allowing the automatic floor segmentation and replacement in indoor images. We have provided both conventional images from the mobile camera as well as raw data from iPhone’s Pro 12 LiDAR. 

Initially, our intern tried to cluster the images using K-means based on pixel coordinates and color components in HSV, RGB and Lab spaces.

As for LiDAR data, we recalculated the raw data into a proper 3D point cloud using the camera intrinsics. Secondly, a RANSAC algorithm was used for the plane fitting. 

Finally,  a basic fusion scheme was applied to combine the inliers from the RANSAC outputs with merged image clusters.

Depth maps and floor segmentation results

  • Tools and technologies: OpenCV, Python, classical computer vision, sensor fusion 


Why do people search for internships? Because it is a perfect way of getting to work on real industrial projects and gaining first-hand experience and mentorship from the experts. It is also a good try-out of the specific field and helps to make one’s mind about the future profession. 

Why do we have our internship program? Because we aim at making more people fall in love with computer vision and sharing the knowledge with future generations of engineers. 

We organize internships at least once a year. If you missed the last edition, don’t worry: a new type of internship, trainee program, is coming very soon 😉


It-Jim’s 2021 Winter Internship on Computer Vision: an Overview