GStreamer C++ Tutorial

Posted on November 14, 2022 by admin

In the previous article, we’ve learned what GStreamer is and its most common use cases. Now, it’s time to start coding in C++. This tutorial does not replace but rather complements the official GStreamer tutorials. Here we focus on using appsrc and appsink for custom video (or audio) processing in the C++ code. In such situations, GStreamer is used mainly for encoding and decoding of various audio and video formats.

GStreamer C++ Basics

GStreamer C++ API is introduced rather well in the official tutorial, I’ll give only a very brief introduction before focusing on appsrc and appsink, the most important topic of interest to us. Our tutorial can be found here. In our code, we use C++, not C. Also, unlike the official tutorial, we are not too eager to use GLib functions like g_print().

Let’s get going. Our first example, fun1, is an (almost) minimal C++ GStreamer example. Before doing anything with GStreamer, we have to initialize it:

gst_init(&argc, &argv);

It loads the whole infrastructure like plugin registry and such. But why does it need pointers to argc, argv? You can put nullptr, nullptr if you really want to. But honestly providing your command line arguments allows gst_init() to parse GStreamer-specific flags. For example, I always add –gst-debug-level=2 to the command line in order to log warnings and errors to the console (there’s no logging by default). Interestingly, GStreamer removes all its flags from argc, argv, so that you can later parse the remaining arguments.

Next, we create a pipeline from a string

string pipelineStr = “videotestsrc pattern=0 ! videoconvert ! autovideosink”;

GError *err = nullptr;

GstElement *pipeline = gst_parse_launch(pipelineStr.c_str(), &err);

checkErr(err);

MY_ASSERT(pipeline);

Where MY_ASSERT is my assertion macro (like CV_ASSERT, never ever use the C++ assert statement !), and checkErr is my function that checks a GError object for errors, see the code for details. Checking for errors is important, to catch any typos in the pipeline string, linking failures etc. GStreamer is heavily based on GLib, especially on the GObject framework (a part of GLib), a pure C object-oriented framework. All GStreamer entities are GObject objects and they are handled as raw pointers. This may seem ugly compared to modern C++, but there is nothing I can do about it (as gstreamermm is now dead).

Now we created the pipeline, we should play it

MY_ASSERT(gst_element_set_state(pipeline, GST_STATE_PLAYING));

Is this all? Not yet. If we try to run the code at this point, it will simply run until the end of the main() function and shut down together with GStreamer, which didn’t even have time to start the pipeline properly. We must wait for the pipeline to finish. The simplest code for this is:

GstBus *bus = gst_element_get_bus (pipeline);

GstMessage *msg = gst_bus_timed_pop_filtered (bus, GST_CLOCK_TIME_NONE,

GstMessageType(GST_MESSAGE_ERROR | GST_MESSAGE_EOS));

gst_message_unref(msg);

gst_object_unref(bus);

GStreamer bus is a messaging system of a pipeline, which sends messages. Here we wait indefinitely for an error or end of stream (EOS), ignoring all other messages. Our further examples like fun2 demonstrate processing all messages in a loop, and eventually in a separate thread.

You might have asked: If our main() function is not blocked when the pipeline is running, then where does it run? In the other threads of course! GStreamer is multi-threaded and reasonably thread-safe (you can call the GStreamer function from different threads). There is NO such thing as GStreamer main loop. This can sound confusing, as many codes from the official tutorial use a GLib main loop. You absolutely don’t have to. The only point of this “main loop” is to block while watching the bus. As we watch the bus ourselves, we don’t need it. And it’s perfectly fine to use C++ threads with GStreamer, even though they didn’t exist when GStreamer was created (as they map into the same OS threads). GStreamer can also run several pipelines simultaneously if your PC is powerful enough for it.

Side note: The multi-threaded GStreamer philosophy is the opposite to the one of typical GUI libraries like Gtk+ or Qt, which run GUI strictly in a single thread with an event-processing main loop. GStreamer can be successfully combined with these libraries (see e.g. a Gtk+ example in the GStreamer tutorials), but this definitely goes beyond the scope of this article.

We are almost done with fun1. Now let’s exit the program cleanly by stopping and releasing the pipeline:

gst_element_set_state(pipeline, GST_STATE_NULL);

gst_object_unref(pipeline);

I remind you that C and C++ do not have proper garbage collection, thus memory leaks are always a big danger, often underestimated by people with backgrounds in other languages. And being a C library, GStreamer does not use nicer C++ features like shared_ptr, but has its own version of reference counting, thus “unref”. GStreamer memory management is confusing, and leaks are a persistent risk. The general rule is like this: if you don’t need myBanana anymore, try:

gst_banana_unref(myBanana);

If no such function, try

gst_object_unref(myBanana);

If the code does not work, then you shouldn’t unref myBanana for some reason.

This is it for the minimal example. It wasn’t very hard, was it? If you want to know more about GStreamer in C++, read the official tutorial and our other examples like fun2 and capinfo. There are tons of other things, like creating a pipeline programmatically (not from a string), dynamic and on-request pads, working with caps and pads, etc.

GStreamer C++ appsink and OpenCV Example (Video 1)

But what if we want to process each video frame in our own C++ code, not in some standard GStreamer elements? There are two ways to do this:

You can write your own element. This is hard for beginners, and I will not teach you this.
Use appsrc and appsink to move data back and forth between pipeline and our C++ code. This is what we will do.

We start with an appsink video example, video1. We want to decode a video file with GStreamer into raw data, and then visualize each frame with OpenCV’s imshow(). We’ll walk through the code briefly (see video1.cpp in our repo for details). The pipeline is given by the string:
filesrc location=<…> ! decodebin ! videoconvert ! appsink name=mysink max-buffers=2 sync=1 caps=video/x-raw,format=BGR
Wow, appsink has a lot of options! Let’s examine them all:

name=mysink : We have given our element a name so that we can find it.
caps=video/x-raw,format=BGR : Caps are vital. Here we specify that we want a BGR raw video signal.
sync=1 : We synchronize the data to play at the 1x speed. Try sync=0 for fun! Note: true==1, false==0.
max-buffers=2 : Unlike most GStreamer elements, appsrc and appsink have their own queues. They can take a lot of RAM. This is an example of reducing the queue size. Only two frames are to be kept in memory, after that appsink basically tells the pipeline to wait, and it waits. Don’t try to reduce queues that much for branched pipelines!

If you need “global data” for a GStreamer pipeline it’s a good idea to create a structure for it, so that we will supply the data (as a pointer) to the callbacks if needed. In our case, all we need is the pipeline and the appsink element.

struct GoblinData {

GstElement *pipeline = nullptr;

GstElement *sinkVideo = nullptr;

};

We create an instance of this structure in main(), create the pipeline, and find the appsink by its name (“mysink”):

GoblinData data;
string pipeStr = “filesrc location=” + fileName + ” ! decodebin ! videoconvert ! appsink
name=mysink max-buffers=2 sync=1 caps=video/x-raw,format=BGR”;

GError *err = nullptr;

data.pipeline = gst_parse_launch(pipeStr.c_str(), &err);

checkErr(err);

MY_ASSERT(data.pipeline);

data.sinkVideo = gst_bin_get_by_name(GST_BIN (data.pipeline), “mysink”);

MY_ASSERT(data.sinkVideo);

Next, we play the pipeline:

MY_ASSERT(gst_element_set_state(data.pipeline, GST_STATE_PLAYING));

Now, we have to wait for the bus, which we now put into a separate thread, see the code for details:

thread threadBus([&data]() -> void {

codeThreadBus(data.pipeline, data, “GOBLIN”);

});

You can extract data from appsink by using either signals or direct C API, we chose the latter. We process data in a separate thread which we now start.
thread threadProcess([&data]() -> void {

codeThreadProcessV(data);

});

Finally, we wait for the threads to finish and stop the pipeline:

threadBus.join();

threadProcess.join();

gst_element_set_state(data.pipeline, GST_STATE_NULL);

gst_object_unref(data.pipeline);

Everything interesting happens in the function codeThreadProcessV(). It has an endless loop for (;;) { … } , which we will eventually break out of. What’s in the loop?

First, we check for EOS:

if (gst_app_sink_is_eos(GST_APP_SINK(data.sinkVideo))) {

cout << “EOS !” << endl;

break;

}

Next we pull the sample (a kind of data packet) synchronously, waiting if needed. For raw video, a sample is one video frame:

GstSample *sample = gst_app_sink_pull_sample(GST_APP_SINK(data.sinkVideo));

if (sample == nullptr) {

cout << “NO sample !” << endl;

break;

}

Now, we want to know the frame size. It turns out, that the sample actually has caps (don’t confuse it with the pad caps), and we can find the frame size in there:

GstCaps *caps = gst_sample_get_caps(sample);

MY_ASSERT(caps != nullptr);

GstStructure *s = gst_caps_get_structure(caps, 0);

int imW, imH;

MY_ASSERT(gst_structure_get_int(s, “width”, &imW));

MY_ASSERT(gst_structure_get_int(s, “height”, &imH));

cout << “Sample: W = ” << imW << “, H = ” << imH << endl;

Next, we extract a buffer (a lower-level data packet) from the sample. Note: in GStreamer slang, a “buffer” always means a “data packet”, and never ever a “queue”!

GstBuffer *buffer = gst_sample_get_buffer(sample);

Still, we don’t have a pointer to the raw data. For that we need a map:

GstMapInfo m;

MY_ASSERT(gst_buffer_map(buffer, &m, GST_MAP_READ));

MY_ASSERT(m.size == imW * imH * 3);

Now we can finally read the raw data (BRG pixels) via the pointer m.data. But we want to process the frame in OpenCV, so we wrap it in a cv::Mat.

cv::Mat frame(imH, imW, CV_8UC3, (void *) m.data);

Warning! Such a cv::Mat object does not copy the data, so if you want cv::Mat to persist when the GStreamer data packet is no more, or if you want to modify it, then clone it. Here we don’t have to (but we DO clone in video3). Now we can do anything we want with the cv::Mat image, but in this example, we just display it on the screen:

cv::imshow(“frame”, frame);

int key = cv::waitKey(1);

Now, we release the sample, and check if the ESC key was pressed:

gst_buffer_unmap(buffer, &m);

gst_sample_unref(sample);

if (27 == key)

exit(0);

We’re done with this frame, ready for the next one. In this example, we saw how to receive GStreamer video frames from appsink, and convert them into OpenCV images via the sample -> buffer -> map -> raw pointer -> Mat route.

GStreamer C++ appsrc and OpenCV Example (Video 2)

Now, the appsrc example, video2. Here we want to do the opposite to video1: read a frame from a video file with OpenCV’s VideoCapture and send it to the GStreamer pipeline to display on the screen with autovideosink. The pipeline is:

appsrc name=mysrc format=time caps=video/x-raw,format=BGR ! videoconvert ! autovideosink sync=1

The option format=time refers to timestamp format, NOT the image format from the caps! It is not required for video, but for some reason, it is required for audio appsrc, which will fail otherwise with rather obscure error messages (took me once a long time to figure this out).

This pipeline looks nice, but unfortunately, it will not work. If we try to play it, GStreamer will complain about the frame size. Indeed, we did not specify the frame size (width+height) in the appsrc caps, and it does not have a default one, so there is no way it can negotiate a frame size with the downstream pipeline. But we don’t know the frame size until we open the input file with OpenCV! How to solve this predicament? One could in principle defer creating the pipeline until we know the frame size, but it turns out that it is enough to defer playing it. This is exactly what we do in the function codeThreadSrcV(). In this function, we first open the input file with OpenCV and get the frame size and FPS:

VideoCapture video(data.fileName);

MY_ASSERT(video.isOpened());

int imW = (int) video.get(CAP_PROP_FRAME_WIDTH);

int imH = (int) video.get(CAP_PROP_FRAME_HEIGHT);

double fps = video.get(CAP_PROP_FPS);

MY_ASSERT(imW > 0 && imH > 0 && fps > 0);

Next, we create proper caps for our appsrc and set them with the g_object_set():

ostringstream oss;

oss << “video/x-raw,format=BGR,width=” << imW << “,height=” << imH <<
“,framerate=” << int(lround(fps)) << “/1”;

cout << “CAPS=” << oss.str() << endl;

GstCaps *capsVideo = gst_caps_from_string(oss.str().c_str());

g_object_set(data.srcVideo, “caps”, capsVideo, nullptr);

gst_caps_unref(capsVideo);

Now we can finally play the pipeline and start the infinite loop over frames:

MY_ASSERT(gst_element_set_state(data.pipeline, GST_STATE_PLAYING));
int frameCount = 0;

Mat frame;

for (;;) {
…
}

Inside the loop, we wait for the next frame from VideoCapture:

video.read(frame);

if (frame.empty())

break;

We create a GStreamer buffer and copy the data there, again using the raw pointers frame.data and m.data:

int bufferSize = frame.cols * frame.rows * 3;

GstBuffer *buffer = gst_buffer_new_and_alloc(bufferSize);

GstMapInfo m;

gst_buffer_map(buffer, &m, GST_MAP_WRITE);

memcpy(m.data, frame.data, bufferSize);

gst_buffer_unmap(buffer, &m);

Now we have to set up the timestamp. This is important because otherwise GStreamer would not be able to play this video at the 1x speed:

buffer->pts = uint64_t(frameCount / fps * GST_SECOND);

Finally, we “push” this buffer into our appsrc:

GstFlowReturn ret = gst_app_src_push_buffer(GST_APP_SRC(data.srcVideo),
buffer);

++frameCount;

Once we have exited the loop (upon the end-of-file), we want to shut down the pipeline gracefully by sending it an end-of-stream message.

gst_app_src_end_of_stream(GST_APP_SRC(data.srcVideo));

And now look at the code we described so far, and tell me: Is it good? It will run successfully if we start it, or at least seem to. But it has a serious flaw. Can you spot it? Pause for a moment and think carefully before reading any further.

You're thinking, right?

The answer is down below ⬇

Now, the answer. The VideoCapture decodes the video file as fast as it can, which can be quite fast on modern computers. However, our GStreamer pipeline is slow due to the sync=1 options (1x playback). But the pipeline will not signal our C++ code to slow down, the frame loop will run fast pushing more and more frames into the appsrc built-in queue, taking a lot of RAM, and possibly even crashing the application if the video is long enough.

This flaw (which is not obvious at all for beginners, by the way, did you guess it?) show how tricky designing pipelines (especially real-time ones) is, and how you should plan ahead and not code thoughtlessly. What is the solution? It’s obvious, we want the pipeline to signal when it wants data and when it doesn’t. Let’s register a couple of GLib-style signal callbacks on appsrc signals:

g_signal_connect(data.srcVideo, “need-data”, G_CALLBACK(startFeed), &data);

g_signal_connect(data.srcVideo, “enough-data”, G_CALLBACK(stopFeed), &data);

Since GLib is C and not C++, we cannot use lambdas or std::function in callbacks, only good old functional pointers. We supply the pointer &data to our data structure to make it usable by the callback functions. The callback functions simply set a single data flag:

static void startFeed(GstElement *source, guint size, GoblinData *data) {

using namespace std;

if (!data->flagRunV) {

cout << “startFeed !” << endl;

data->flagRunV = true;

}

}

static void stopFeed(GstElement *source, GoblinData *data) {

using namespace std;

if (data->flagRunV) {

cout << “stopFeed !” << endl;

data->flagRunV = false;

}

And now, we check this flag at the frame-processing loop and wait if the pipeline tells us to:

if (!data.flagRunV) {

cout << “(wait)” << endl;

this_thread::sleep_for(chrono::milliseconds(10));

continue;

}

Beautiful, isn’t it? Now we learned how to use appsrc in addition to appsink and move the data both ways. While there is no direct connection between OpenCV classes and GStreamer (at least not without third-party plugins), we can easily move the data around using raw pointers and a few lines of code. Who needs the ready-made code, when you can write your own?

More GStreamer C++ appsink + appsrc + OpenCV Examples

My tutorial has a few more examples for you which I will list very briefly.

video3: This is like video1 and video2 combined. Here we have two pipelines, one with appsink (Goblin), the other one with appsrc (Elf) : We decode a video file with Goblin pipeline, process each frame with OpenCV, then send the frame to Elf pipeline to display it. This is the typical example of “decoding, then encoding with GStreamer”.
audio1: The same with audio (no OpenCV in this code).
av1: The same with both audio and video.

Conclusion

In this series of articles, I have introduced GStreamer, explained why it is important, and then showed how it can be used for computer vision and audio processing. Enjoy GStreamer!

GStreamer for Computer Vision and Audio Processing

Posted on November 14, 2022 by admin

You might have heard of something called “GStreamer”. I know what you think. This is some old and boring geek-and-nerd stuff from Linux, right? But what is it? What is the use of GStreamer? If we want computer vision or audio (speech, music) processing, can GStreamer help us?

In this article, I’ll try to answer these questions. This article is beginner-level and assumes no or little previous experience with GStreamer. But I assume that you are interested in computer vision and/or audio processing and know at least a little bit of C++ (for this GStreamer tutorial).

What Is the GStreamer Library?

So, what is GStreamer? The official documentation calls it an “open-source multimedia framework” and gives the following definition:

GStreamer is a library for constructing graphs of media-handling components. The applications it supports range from simple Ogg/Vorbis playback, audio/video streaming to complex audio (mixing) and video (non-linear editing) processing.

Applications can take advantage of advances in codec and filter technology transparently. Developers can add new codecs and filters by writing a simple plugin with a clean, generic interface.

Wikipedia gives the following definition:

GStreamer is a pipeline-based multimedia framework that links together a wide variety of media processing systems to complete complex workflows. For instance, GStreamer can be used to build a system that reads files in one format, processes them, and exports them in another. The formats and processes can be changed in a plug-and-play fashion.

GStreamer supports a wide variety of media-handling components, including simple audio playback, audio and video playback, recording, streaming and editing. The pipeline design serves as a base to create many types of multimedia applications such as video editors, transcoders, streaming media broadcasters and media players.

GStreamer is over 20 years old, and might not be the current “hot topic”. However, as we will see below, it’s very important for computer vision, especially at the “professional” and “deployment” levels, when you progress beyond toy demos and suddenly start to discover that the “real world is not that simple”.

GStreamer is a part of the GNOME project (like in the “GNOME desktop”), and while I (as an experienced Linux user) personally strongly prefer the KDE desktop to GNOME desktop, GNOME libraries are very nice. Note that GStreamer is also used by the Qt GUI library and thus KDE desktop.

Some people would think that the word “stream” in GStreamer means network streaming. This is not so. Its primary function is to build local pipelines. However, GStreamer does have plugins for network streaming protocols like RTSP and it is frequently used for designing RTSP server or client applications.

Languages and Platforms

GStreamer is most commonly found on Linux. However, it’s a cross-platform C library available on all major platforms (Windows, MacOS, Android, iOS, etc.). Note that “Linux” includes “web back end”, “embedded” and “single board” (Raspberry Pi and friends) among other things. The only platform I couldn’t find a pre-built GStreamer for is Web Browser (WASM). Not that it’s impossible in theory, but probably nobody wanted such a heavyweight monster on a very restrictive WASM platform. GStreamer is a huge framework, it has tons of dependencies, and you should never try to build it from the source unless you have no choice.

GStreamer’s native language is C (not C++). It can be directly called from C++ or Objective C. For other languages (Python, Java, etc.), you have a choice of either adding some C++ to your code via language interfaces (such as pybind11 or JNI) or using a GStreamer wrapper for your language. They generally exist but might be out of date and not support the latest GStreamer versions.

A C++ wrapper, gstreamermm, with nice C++ classes, used to exist, but unfortunately, it is not supported anymore.

Python GStreamer wrappers are popular, but again, compared to C/C++, they tend to use outdated versions (sometimes even 0.1 instead of 1.0).

To get the most out of GStreamer, and to understand it fully, you should use it in C or C++ (and not Python or other languages). This is what we will do in this C++ tutorial.

What Is the Use of GStreamer?

GStreamer has many uses, but we are interested in computer vision and audio processing, right? How can GStreamer help us?

Imagine the following situation. You are writing an application that processes a video or audio. You need some library that would encode and decode audio and video in various codecs and formats so that you can process raw video or audio in your code. Or maybe you want something even funnier, like integrating your algorithms with RSTP streaming, web back end, or building sophisticated real-time pipelines. Can we do that?

Wait, weren’t there C libraries for different codecs, like libx264 or libaac? Yes, but there are dozens of codecs and containers, each with its own library, with its own unique API, often clunky and unlike all other APIs. Unfortunately, the end users tend not to care about this fact, they expect your application to work with any audio or video format that exists and will be really surprised and frustrated if it doesn’t. Do you really want to code the low-level logic of decoding various file formats with about 20 various libraries like libx264? Probably not. What we want is “one library to rule them all”. We want a C/C++ library that would work with a large number of audio and video formats and codes. This is harder than people often think.

Before giving you the answer, let’s mention a few options that do NOT work:

OpenCV is not a good choice, see the section on GStreamer and OpenCV below.

Beginners often tend to avoid this issue by preprocessing the input data. For example, you can open your input video file in a video editor (or, for nerds, with ffmpeg in terminal), extract the audio track as an uncompressed WAV, and then read your video with OpenCV, and your audio with libsndfile. Is it possible? Yes, and it can be sometimes justified for early R&D work. Is it a good idea? Definitely not if you want a finished product or a nice demo.

Often people try to use FFMpeg (or GStreamer) in terminal, shell scripts, python system() function, or pipes like ffmpeg <some options? | python3 mycode.py, but this is really not much different from the previous option.

Now, the options that DO work.

GStreamer vs FFmpeg

Some operating systems (like Android and Windows) have their own OS-specific codecs API, often somewhat limited in formats supported, but in the worlds of Linux and cross platform there are basically only two good choices: FFmpeg and GStreamer. And “cross platform” means you can port your software everywhere (nice !), while, once again, “Linux” means “back end+embedded” (plus I just love Linux and use it for work). Nowadays, with things like AWS and Azure and Docker, Linux finally really moved from the geek-land into the mainstream.

So, FFmpeg and GStreamer, but which of the two is better? Both libraries will do the job. Both libraries are “umbrellas” over multiple low-level libraries like libx264. Both libraries support (at least in theory) various hardware video accelerators and hardware-oriented specs like Video4Linux 2 (used for e.g. camera feed). And GStreamer is not independent of FFmpeg, in fact, it uses FFMpeg for some codecs (“av” prefix in GStreamer element names like avdec_h264 means FFmpeg).

The two libraries have, however, rather different philosophies. FFmpeg has only low-level encoding-decoding operations, while GStreamer allows you to design and play sophisticated media pipelines. Both are very nice, definitely try FFmpeg (C API) if you haven’t already, but this article is about GStreamer. How do I choose between the two? If you only want encoding-decoding and you are prepared to micromanage the whole pipeline (no easy task!), choose FFmpeg. If you want a pipeline-building library, definitely choose GStreamer. Also, GStreamer has many nice extras, from RTSP streaming to video special effects.

To summarize, there are the main reasons to use GStreamer in your computer vision or audio processing code:

Encoding and decoding a great number of audio and video formats (practically all that exist)
Building sophisticated media pipelines
Using GStreamer extras (network streaming, filters, media playback, etc.)
Using GStreamer-based third-party frameworks like Nvidia DeepStream or GstInference

Interlude: on Codes and Containers

Audio and video tracks found in media files and streams are typically highly compressed using codecs, such as H265, VC9, or AC3. Encoded data is created from the raw data using encoders, and converted back to raw with decoders.

However, what if we want to put several media tracks into a single file? For example, one video track, several audio tracks (in different languages), and subtitles. Then you will need containers (or formats) such as AVI, QuickTime or MKV. Containers are created by muxers (which join media tracks), while the reverse operation of unpacking a container into separate tracks is performed by demuxers. Most modern media file formats (except for a few simplest ones: WAV, MP3) are containers.

Please do not mix up codecs with containers, they are two different things! For example, OGG is a container, while Vorbis is the codec most often used in OGG. An AVI container can contain a video in H264 or H265, and an audio in AC3 or AAC, or many other codecs.

How Do I Learn GStreamer?

Start with the official documentation. Seriously. It has a tutorial and a manual. There is nothing better. However, it only briefly touches on the topic which is of utmost importance to us: appsrc and appsink elements, or “Short-cutting the pipeline”. You can find numerous examples with appsrc and appsink on GitHub, but I didn’t find any good introductory tutorial on this topic, vital for audio and vision. Thus I wrote my own GStreamer tutorial in C++, and I will briefly cover it in the last section of this article. It also includes various appsrc and appsink examples, including “GStreamer+OpenCV” examples, showing how to use GStreamer and OpenCV in the same code.

GStreamer Pipeline Tutorial

How Does GStreamer Work?

It is covered pretty well in the official tutorial, so I’ll give only a very brief introduction. The basic GStreamer object is a pipeline (Fig. 1).

Fig. 1. GStreamer pipeline, from the official tutorial

It is built from elements (large boxes in Fig. 1), the GStreamer LEGO blocks, which have input-output ports called pads (small blue boxes). The pads can be linked together. The pipeline has a state (PLAY, PAUSE, READY, NULL, VOID_PENDING). When the pipeline is playing, it does so automatically, in multiple threads created by the GStreamer library.

When you try to link two pads, they negotiate, i.e. try to agree on a common data format, fixing all the little details like frame size, fps, etc. If they fail, the pipeline gives an error. Negotiation in GStreamer is based on capabilities or caps, for example (Note: they are NOT MIME types !):

video/x-raw,format=BGR,width=720,height=576

audio/x-raw,format=S16LE,layout=interleaved

for RAW (unencoded) video or audio data respectively. If negotiation fails, you can often fix it by inserting intermediate elements such as videoconvert, audioconvert and audioresample.

GStreamer in Terminal

While for “serious” GStreamer usage you need C or C++, nothing stops you from trying it out using GStreamer console tools. It will help you understand GStreamer basics and learn pipeline syntax and common elements. If you are reading this, we strongly encourage you to install GStreamer on your computer, download a few small audio and video file samples, and try out examples from this chapter. It’s good fun! Once again, the official documentation covers the “console GStreamer” rather well, so I will briefly show a few examples of my own that I find illustrative. The main tools are:

gst-launch-1.0 : Create and launch a GStreamer pipeline, our main tool
gst-play-1.0 : Play a media file (a minimal video player)
gst-inspect-1.0 : Inspect available GStreamer plugins
gst-discoverer-1.0 : Examine a media file, print information on codecs, etc.

Without further ado, let’s buy popcorn and start playing with GStreamer. gst-launch-1.0 receives a single argument: a text string describing the GStreamer pipeline. The syntax is simple: a number of GStreamer elements with optional options (pun intended). The neighboring elements are separated with either exclamation sign ‘!’ when they are linked, or space ‘ ‘, when they are not.

The simplest pipeline uses playbin, a high-level media playback element:

gst-launch-1.0 playbin uri=file:///home/seymour/Videos/suteki.mp4

It needs a URI (network URL or a full path to a file).

Elements audiotestsrc and videotestsrc create simple test videos. Elements autoaudiosink and autovideosink play the video (screen window) and audio (speakers) respectively on your computer. On some platforms they could be restricted in caps they accept, so it’s always a good idea to put conversion elements in the middle:

gst-launch-1.0 audiotestsrc ! audioconvert ! audioresample ! autoaudiosink

gst-launch-1.0 videotestsrc ! videoconvert ! autovideosink

gst-launch-1.0 videotestsrc pattern=18 ! videoconvert ! autovideosink

Conversion elements can do simple format conversions like YuV to RGB video, or int16 to float32 audio, for raw audio or video only (NOT codecs). audioresample can resample the audio to a new sampling rate (e.g. from 16000 to 44100 Hz).

The GStreamer pipeline can be visualized using GraphViz software. Type in the console:

GST_DEBUG_DUMP_DOT_DIR=. gst-launch-1.0 videotestsrc pattern=18 ! videoconvert ! autovideosink

It will create a number of .dot files in the current directory (‘.’). Choose the one named “….PAUSED_PLAYING.dot”. The result is shown in Fig. 2 (such figures tend to be cluttered with details).

Fig. 2. A GStreamer pipeline visualized by GraphViz.

Branched Pipelines

You can create branched pipelines in GStreamer. The first type of branching happens when you duplicate a data stream with the tee element:

gst-launch-1.0 videotestsrc ! videoconvert ! tee name=t ! queue ! autovideosink t. ! queue ! autovideosink

It creates two windows with identical videos. Note how we name the tee element as t (any name could be used instead of t, e.g. cyberdemon), and then put space and not ! after autovideosink (no linking), then start another branch with t. (go back to the element named t, and try to link its other still unlinked pad). This pipeline is shown in Fig. 3.

Fig. 3. A branched pipeline visualized by GraphViz.

Another type of branching happens if an element has two or more source (output) pads with different media tracks. For example, let’s take high level decoding elements decodebin and uridecodebin. They behave similarly, except that decodebin receives data from a sink (input) pad, while uridecodebin receives data from a URI. So the two lines are very similar

uridecodebin uri=<file name>

and

filesrc location=<file name> ! decodebin

except the first one requires a full path. Let’s try to play a media file with uridecodebin:

gst-launch-1.0 uridecodebin uri=file:///home/seymour/Videos/suteki.mp4 name=u ! audioconvert ! audioresample ! autoaudiosink u. ! videoconvert ! autovideosink

Once again, you have two branches after uridecodebin:

uridecodebin ! audioconvert ! audioresample ! autoaudiosink
! videoconvert ! autovideosink !

This pipeline behaves similarly to playbin. uridecodebin is a high-level element, which automatically creates a sub-pipeline with appropriate demuxer and decoders.

Can we go to a really low level? Yes, but there is usually no need to. We can inspect our file with gst-discoverer-1.0 or ffplay. If we know that suteki.mp4 is a QuickTime file with AAC audio and H264 video, we can then play it with:

gst-launch-1.0 filesrc location=suteki.mp4 ! qtdemux name=d ! avdec_h264 ! queue ! videoconvert ! autovideosink d. ! avdec_aac ! queue ! audioconvert ! audioresample ! autoaudiosink

Here the two branches are:

filesrc ! qtdemux ! avdec_h264 ! queue ! videoconvert ! autovideosink

! avdec_aac ! queue ! audioconvert ! audioresample ! autoaudiosink

We see demuxer and decoders, and a new element queue. Note that the queue in GStreamer is called queue, while the word “buffer” means something completely different (I’ll get back to it eventually). It’s always a good idea to use queue in branched pipelines to avoid a possible deadlock, when synchronizing tracks on playback or especially muxing, as GStreamer does not check for deadlocks.

Now let’s try encoding. Now you have no choice but to go to the low level: encoders+muxer, sometimes also parser.

Video:

gst-launch-1.0 videotestsrc ! videoconvert ! x264enc ! avimux ! filesink location=out.avi

gst-launch-1.0 videotestsrc ! videoconvert ! x265enc ! h265parse ! matroskamux ! filesink
location=out.mkv

gst-launch-1.0 videotestsrc ! videoconvert ! vp9enc ! webmmux ! filesink location=out.webm

Audio:

gst-launch-1.0 audiotestsrc ! audioconvert ! wavenc ! filesink location=out.wav

gst-launch-1.0 audiotestsrc ! audioconvert ! lamemp3enc ! filesink location=out.mp3

gst-launch-1.0 audiotestsrc ! audioconvert ! vorbisenc ! oggmux ! filesink location=out.ogg

gst-launch-1.0 audiotestsrc ! audioconvert ! avenc_wmav2 ! asfmux ! filesink
location=out.wma

And now the hardest case. Let’s decode and re-encode:

gst-launch-1.0 filesrc location=zoryana.webm ! decodebin name=d ! queue ! audioconvert ! avenc_aac ! avimux name=m ! filesink location=out.avi d. ! queue ! videoconvert ! x264enc ! m.

What was that? A pipeline with splitting and merging branches!

filesrc ! decodebin ! queue ! audioconvert ! avenc_aac ! avimux ! filesink

! queue ! videoconvert ! x264enc !

Pipeline Tricks, GStreamer Real-Time vs. Offline Pipeline

There are a couple of extra tricks when designing pipelines. If there are multiple pads, the first suitable one is linked. This is not always desired. We can specify the pad name explicitly for linking, but only if we know the pad’s name, e.g. video_0 in demuxers:

gst-launch-1.0 filesrc location=zoryana.webm ! matroskademux name=d d.video_0 ! vp9dec ! videoconvert ! autovideosink

The second trick is the caps filter. If we write caps instead of an element between the two ! signs, then we force the negotiation process to only accept caps compatible with the specified caps. We can often use it to control elements that we cannot control directly, for example:

gst-launch-1.0 videotestsrc ! video/x-raw,format=BGR,width=1024,height=768 ! videoconvert ! autovideosink

Here the caps filter affects the negotiation between videotestsrc and videoconvert. videotestsrc cannot be programmed directly, but it is rather flexible at negotiations, and here we force it to produce 1024×768 BGR video. Similarly, It can be used to explicitly control the conversion elements, if we want to convert the media into a different sampling rate, frame size etc. Later we will work with appsrc and appsink. They can be configured with either direct caps (preferred) or a caps filter.

The final trick is the sync option, present in most sinks, including autovideosink and appsink. Try the following pipeline:

gst-launch-1.0 filesrc location=suteki.mp4 ! decodebin ! videoconvert ! autovideosink sync=true

This is the default. sync=true means that autovideosink plays the video stream at the 1x speed, provided that it has correct timestamps, actually in this case autovideosink sets the pace of the entire pipeline, as decodebin could decode the file much faster on modern computers. This is the GStreamer way of creating a real-time pipeline.

Now try to change to sync=false and see what happens (laughing smiley) !

If, on the other hand, we used a filesink, like in the re-encoding example above, it has sync=false by default. The pipeline plays as fast as it can (depending on processing speed), usually much faster than 1x. This is the offline file processing, GStreamer way.

The sync option is important for appsink, depending on our computer vision application, both choices make perfect sense.

Read also:

Practical Aspects of Real-Time Video Pipelines

In this section, I will cover a few topics which are neither “GStreamer in terminal” (previous section) nor “GStreamer in C/C++” (next section).

Does OpenCV Use GStreamer? A Tricky Relationship between the Two Libraries

Remember, I promised to explain why OpenCV, a popular computer vision library, is not good for reading and writing video files with VideoCapture and VideoWriter respectively. First, and this is the main reason, OpenCV cannot work with audio tracks at all. Second, it is rather inflexible, for example, try to encode into memory and not a disk file, you cannot! Third, depending on how OpenCV is built, it might have very limited codec support or none at all. There are no guarantees. For example, on modern Ubuntu, apt-installed OpenCV (for C++) is pretty good, while pip-installed Python OpenCV has very limited encoding capabilities.

How does OpenCV work with videos? It uses various backends, which at least for Linux usually means (surprise, surprise) either FFmpeg or GStreamer. And a couple of years ago Ubuntu switched from FFMpeg to GStreamer (1:0 for the latter !). If you are using OpenCV video I/O, you are actually using FFmpeg or GStreamer, why not cut the middle person?

There is another topic worth mentioning here, GStreamer in OpenCV. If (and only if) OpenCV was built with GStreamer, you can use GStreamer pipeline strings instead of file names in OpenCV VideoCapture and VideoWriter, terminated with appsink or appsrc respectively. It is discussed a lot in places like Stack Overflow, however, I don’t find the idea especially good. While it can slightly expand OpenCV powers with things like RTSP, you still cannot have audio or pipelines with multiple sources/sinks or anything complicated.

A much better way to combine GStreamer with OpenCV (in my opinion) is presented in the next section. With appsink and appsrc, you can move the raw pixels back and forth between your C++ code and GStreamer pipeline. Once the frame is in your C++ code, you can do anything you want with it. For example, wrap it with OpenCV’s cv::Mat, process it with OpenCV, and send the result back to GStreamer. Or, run a neural network inference or any computer vision code you want.

GStreamer and Deep Learning

Nowadays, “computer vision” and “audio processing” very often means “deep learning”. Owing to Deep Learning (DL) popularity, a number of plugins and frameworks have been proposed to run a neural network inference within the GStreamer pipeline.

Nvidia Deepstream
https://developer.nvidia.com/deepstream-getting-startedThis is an Nvidia GPU-only Video Deep Learning framework based on GStreamer. Apart from neural network inference with TensorRT, it also supports Nvidia accelerated encoding+decoding, with an option to run the entire pipeline on the GPU. It is Linux-only and requires strict CUDA and CuDNN versions, better run it in Docker if you want to try. It also runs on Nvidia Jetson devices.

GSTInference
https://nnstreamer.ai/
A GStreamer framework based on R2Inference, which supports inferences with a number of DL frameworks, such as TFLite.

NNStreamer
https://github.com/RidgeRun/gst-inference
Another GStreamer DL framework, this one supposedly runs PyTorch models also.

Intel DL Streamer aka Gst-Video-Analytics
https://github.com/dlstreamer/dlstreamer
Another framework, this time from Intel. It is friendly to Intel stuff like VPU and OPENVINO, and also OpenCV.

Note that you don’t have to use any of these frameworks to do neural network inference, you can always move data to your code with appsink and appsrc, and run the inference yourself in your own C++ code (or even in Python code for this matter), enjoying the total programmatic control over how you do the inference and visualization.

GStreamer vs Google MediaPipe

Here I compare GStreamer to another pipeline library, Google MediaPipe. I happened to play with both libraries in C++, and previously wrote a MediaPipe article (part1, part2, part3) in this blog. Let us now compare the two libraries (this is partly my subjective experience). At first glance, the two libraries are similar, as they are both multi-thread pipeline libraries. However, if you dig deeper, you will see numerous differences.

Background: GStreamer is old and time-tested, part of GNOME. MediaPipe is relatively new, developed by Google.
Main Goal: Rather different. MediaPipe is mostly about Deep Learning, while GStreamer is mainly about playback, streaming and re-encoding media.
Deep Learning: MediaPipe can do Deep Learning with TensorFlow (Lite). It also has a number of pre-trained TensorFlow Lite-based “solutions”, and many people actually believe (completely mistakenly) that the “solutions” IS MediaPipe. GStreamer can only do DL with third-party frameworks.
Traditional audio and video processing (resampling, reencoding, resizing, filtering): GStreamer does these things much better and has a vast array of standard elements.
Languages: C with GObject for GStreamer, C++ for MediaPipe. Bindings for a few other languages are available, but you’ll need C++ to get the most out of both frameworks.
Platforms: All common platforms for MediaPipe, except WASM for GStreamer. However, you’ll have to build MediaPipe from the source in order to use it in C++.
Data: MediaPipe: handles arbitrary data, but two or three special classes are available for images and audio. GStreamer: Highly specialized for audio+video via the caps system.
Data formats and negotiation: GStreamer: a sophisticated caps system and a wide variety of formats. MediaPipe: very few formats and negotiation is virtually non-existent.
Codecs and containers: GStreamer: Pretty much all codecs and containers that exist are supported via plugins. MediaPipe: Limited support based on OpenCV + FFmpeg.
Video with audio tracks: GStreamer: It’s easy to read a video file and split it into audio + video data within the same pipeline, same with writing files. MediaPipe: I am not sure if it’s possible at all with standard calculators, probably not. In other words, it does not qualify as “one library to rule them all”.
Network streaming: GStreamer: has plugins for network streaming. MediaPipe: Does not (if I remember correctly).
Pipeline definition: GStreamer: text string or C++ code. MediaPipe: ProtoBuf text string.
Internal structure: MediaPipe is generally simpler and easier to understand and to micromanage and to write custom “calculators” (similar to GStreamer elements). For GStreamer, it is much harder to go “under the hood” and write custom elements. However, you can use apprsc and appsink, as explained in this article.
Timestamps and synchronization and real-time vs offline: In my opinion, MediaPipe does these things in a clearer and simpler way (offline by default), while in GStreamer default behavior depends on the sink used.
Queues: MediaPipe by default uses unlimited queues at each pipeline link. In GStreamer, you have to always add queue elements manually, with a few exceptions like apprsc and appsink. GStreamer is prone to deadlocks if you are not careful.
Documentation and tutorials: Good for GStreamer, bad for MediaPipe. MediaPipe documentation touts the “solutions” and largely ignores the C++ API .
Bazel factor: MediaPipe requires Bazel to build itself AND your project and it is pretty much incompatible with the “normal” C++ world of CMake and make and apt-installed libraries. This is very inconvenient, and seriously limits the possible uses of MediaPipe. In contrast, GStreamer is easy to install (with apt in Ubuntu) and perfectly friendly to CMake and make and other build systems.

All things considered, GStreamer is much easier to use in C++ projects due to the horrific “Bazel factor”. Otherwise, their goals and typical use cases are rather different.

Let’s Sum Up

So far, we have covered what GStreamer is, how it works, its use cases, and how to run it in the terminal. We have also explained the relationship between GStreamer and OpenCV, what options there are to run a neural network inference within the GStreamer pipeline and compared it with the Google MediaPipe library. Now, let’s do some coding – follow us to the GStreamer C++ tutorial!

GoodFirms: It-Jim Thrives by Focusing on the Intellectual Processing of Visual Information and Technical Solutions

Posted on November 7, 2022 by admin

It-Jim, founded in 2015 by a scientist, is now an R&D firm with 100+ successful projects in its portfolio and 10+ Ph.D.s on the team. The team offers consulting services and technical solutions in computer vision, image and signal processing, machine and deep learning, and augmented and mixed reality.

The company effectively caters to the needs of businesses from various industries and uses cutting-edge technologies to help them grow, thanks to experts in various disciplines such as physics, mathematics, radars, and biophysics on board.

What is unique about the team? A thorough understanding of image and signal processing theory, as well as advanced programming abilities. A combination of classical computer vision methods with various types of machine learning algorithms and cutting-edge deep learning architectures – this is exactly what is needed to deliver the best solution for a given problem based on available hardware and infrastructure. The experts develop a custom methodology for each client that perfectly meets the requirements and business needs, ensuring the robust performance of ML pipelines in production everywhere: mobile and embedded devices, cloud, and so on.

As a machine learning company, It-Jim has run 50+ ML and DL projects and constantly applies the latest achievements and state-of-the-art DL architectures in their research.

Thus, the team’s use of a pool of techniques to build various image processing solutions qualifies It-Jim as one of the top Artificial Intelligence companies in Ukraine on GoodFirms.

About the Author

Working as a Content Writer at GoodFirms, Anna Stark bridges the gap between service seekers and service providers. Anna’s dominant role is to figure out company achievements and critical attributes and put them into words. She strongly believes in the charm of words and leverages new approaches that work, including new concepts that enhance the firm’s identity.

Read also:

It-Jim's blog on computer vision

Talented Researchers Form the Backbone of It-Jim’s Incredible Computer Vision and AI Offerings: Goodfirms Interview

Posted on October 25, 2022 by admin

It-Jim is a renowned artificial intelligence solutions provider offering a wide variety of services, including computer vision, image processing, signal processing, machine learning, and augmented and mixed reality solutions. So far, the company has 100+ successful projects in its portfolio and a highly efficient team comprising 10+ PhDs.

The GoodFirms team interviewed Ievgen Gorovyi, the CEO at It-Jim, to learn more about the company and its values.

“It-Jim is a Ukraine-based company with expertise in visual intelligence and signal processing solutions. The company’s experts possess Ph.D. degrees in various mathematical disciplines,” shared the CEO Ievgen Gorovyi. “We provide technical consulting, R&D, and custom software development services for image and video analysis issues.”

The Commencement Story

“After finishing my Ph.D. in image and signal processing, I started my career as a freelancer,” Ievgen reveals. However, his ambitions grew over time, and he built a company with a group of highly talented and intelligent people with an absolute focus on computer vision.

Ievgen’s team of scientists and developers is highly experienced in analyzing and researching and is backed by complex problem-solving abilities. They focus on quality solutions for multiple platforms and hardware, including mobile devices, embedded boards, and cloud-based distributed systems.

Core Focus: Strategy Development

Regarding his role in the company, Ievgen shares that as a CEO, he focuses mainly on the strategic part: business development, anticipating the company’s growth path, analyzing trends, and more form an integral part of his profile that helps shape their ongoing growth strategy. He is also involved in technical and management-related tasks with multiple R&D and software development teams.

Business-Model

It-Jim’s business model consists of an in-house team of computer vision and deep learning engineers dedicated to researching, developing algorithms, and their deployment. A stickler for deadlines, the company makes sure to deliver quality solutions. The exceptional work offered by the company ensures accuracy and performance, coupled with clear communication and full-on collaboration.

Differentiating Factors

“Academic Excellence backed by solid commercial development experience in a highly complex domain such as AI makes our company stand out,” asserts Gorovyi.

In addition to the above, It-Jim offers a custom computer vision course for freshers.

The employees are chosen carefully because they work in a high-stakes arena. The junior developers often undergo two months of trainee program under the supervision of experienced professionals in this field. This exposure allows newcomers to work on real projects during their training period.

The organization is well-focused on improving the Ukrainian CV community by providing internships and winter schools and delivering lectures to university students and IT professionals, allowing them to harness their profound skills.

The company caters to various industries:

Healthcare
Automotive
Entertainment
Sports analytics
Surveillance
Retail

Moreover, the company’s non-exhaustive service list is long. It includes customized computer vision development, software development, iOS, and web development, deep learning solutions and deployment, extended reality (XR) development, digital signal processing research, and, last but not least technical consulting.

Incredible Customer Satisfaction Rate

Customer satisfaction is essential to It-Jim’s worldwide success in the IT sector. “Let the clients do the talking for us,” says Ievgen. It’s no wonder happy clients have left praises for the company on the GoodFirms platform, which in a significant way, vouches for their excellent project outcomes.

“Also clear communication, R&D reports of algorithm development, business analysis, excellent product development process, and post-project support helps us create solid relationships with our clients,” he asserts.

GoodFirms Verdict

According to GoodFirms’ reviewers and analysts, It-Jim has the best team of engineers and scientists who provides unmatched experiences for excellent software solutions through artificial intelligence technology, which endows It-Jim to be among Ukraine’s top artificial intelligence companies in GoodFirms listings.

Future Plans

It-Jim plans to be a leader in computer vision for 3d development in the next ten years. The CEO reveals that the company wishes to become Metaverse’s key partner or contributor and also offer AI-based products for society.

Besides, Ievgen hopes to leave a mark in the sophisticated computer vision space for Ukraine’s community and across the globe.

To read the detailed interview with Ievgen Gorovyi, you can check GoodFirms.

About GoodFirms

Washington, D.C.-based GoodFirms is an innovative B2B Research and Reviews Company that extensively combes the market to find business services agencies amongst many other technology firms that offer the best services to their customers. GoodFirms’ extensive research process ranks the companies, boosts their online reputation, and helps service seekers pick the right technology partner that meets their business needs.

About the Author

Apple RoomPlan API Integration for Innovative AR Apps

Posted on September 22, 2022September 26, 2025 by admin

How to Integrate Apple’s RoomPlan API into Your iOS App: A Comprehensive Guide

Creating a 3D room model has historically been a lengthy, costly, and error-prone process. Real estate managers wasted hours and money hiring experts to make floor plans.

But here’s what changed everything: Apple introduced the RoomPlan API.

Now, users can visualize rooms using their iOS mobile devices in just minutes with incredible detail. AR app developers, proptech startups, interior design platforms, e-commerce, and real estate professionals can benefit from the RoomPlan API.

“When comparing scan dimensions to actual measurements using Apple’s RoomPlan API, they turned out to be accurate enough, with an error usually staying below 5%. This level of precision makes it viable for many professional applications, from interior design to real estate documentation.”

– Oleg Ponomaryov, CTO at It-Jim

This guide walks you through everything you need to know about integrating Apple’s RoomPlan API into your iOS app, namely:

What is Apple RoomPlan, and how does it work?
How to integrate Apple’s RoomPlan API into your iOS app.
Measurable benefits from the RoomPlan API integration.
Overcoming RoomPlan API limitations with proven workarounds.
Advanced use cases powered by It-Jim.

Let’s start by understanding how the Apple RoomPlan API works and its core properties.

What is the Apple RoomPlan API & Its Workflow?

Apple’s RoomPlan API is a framework that uses augmented reality (AR) and the LiDAR Scanner on iPhone and iPad to create 3D models of indoor spaces. This is part of ARKit for building AR apps and using the RoomPlan API.

LiDAR stands for Light Detection and Ranging and uses laser light to measure distances. The technology sends out beams and checks their reflections. RoomPlan API in iOS creates a parametric model that shows the positions and sizes of walls, doors, windows, furniture, and other appliances.

The RoomPlan functionality facilitates automatic object recognition, real-time 3D model reconstruction, and enables easy exports.

Here is a list of potential use cases of an AR app containing the Apple RoomPlan API:

Real estate: create virtual tours of properties and provide accurate floor plans.
Architecture: preview and change room layouts in real-time for faster design decisions.
Interior design: visualize how furniture fits in a room and plan renovations.
Facility management & Logistics: plan office layouts or maintenance paths, inventory space usage in commercial buildings
Home repair: estimate material needs for renovating projects and visualize the results.
Accessibility: help assess room layouts for mobility aids (e.g., wheelchairs), simulate navigation paths for accessible design compliance
Furniture retail: allows customers to visualize furniture in their homes.
Marketing: create engaging advertisements or digital promotions.
Insurance: provide accurate documentation of property layouts and valuable items used for insurance underwriting or claims processing.

Also, RoomPlan’s ability to generate accurate 3D models of indoor spaces makes it well-suited for emergency planning, evacuation modeling, and risk assessment in occupational safety contexts.

Thinking about building a custom AR app and integrating Apple’s RoomPlan API for your business?

Whether you’re building AR apps, using the RoomPlan API to visualize spaces, or creating digital property twins, unlock faster, more innovative development. We don’t just use RoomPlan API – we help enhance it with computer vision services. Reach out to ask questions and receive advice.

Feel free to contact us.

Why Apple’s RoomPlan Outperforms Traditional Methods

RoomPlan API in iOS outperforms older methods, such as Scene Reconstruction and manual CAD modeling. It offers faster results, better accuracy, and greater accessibility, all from one mobile device. Traditional 3D scanning produces unstructured point clouds or meshes.

In contrast, RoomPlan API generates a semantic understanding of interior spaces. Instead of just capturing shapes, it identifies and categorizes room elements. The API produces this comprehensive room data within minutes.

Additionally, previous methods required extensive technical expertise, specialized equipment, and considerable post-processing time.

How Does the Apple RoomPlan API Work?

The process for using the RoomPlan API is straightforward: launch the app, follow the steps to scan the room, and review the results shortly thereafter. You can access and edit the 3D room model anytime.

So, how does Apple RoomPlan API work from a technical perspective?

In brief, the API workflow consists of these three main steps:

1. Scanning

The RoomPlan API uses the device’s camera and LiDAR scanner. It captures the environment and identifies key features: walls, windows, doors, and openings.

2. ML Processing

Sophisticated ML algorithms analyze the captured data to identify room features and create a 3D model of the room.

3. 3D Output

The RoomPlan API in iOS gives results as parametric data. You can export this data in different Universal Scene Description (USD) formats. This property enables developers to easily add 3D models to their apps.

USD is a typical format for AR-based projects. You can edit these files later in tools like AutoCAD, Shapr3D, or Cinema 4D if needed.

RoomPlan API: Data Structure Overview

Apple’s RoomPlan can recognize the following aspects of the captured room:

Structural elements: walls, doors, windows, openings.
Furniture: chairs, tables, sofas, beds, storage units.
Room boundaries: floor plans, room dimensions.
Spatial relationships: object positioning and room layout connections.

Now, let’s see what data is contained in the 3D model produced by the technology. RoomPlan organizes scanned information into two primary categories: surfaces and objects.

4 Surface Types & Their Data Properties

First, let’s elaborate on surface types, properties, and applicable metrics. RoomPlan identifies four distinct types of surfaces that define a room’s structural boundaries:

Wall – primary structural boundary.
Door – entry and exit points.
Window – light sources and viewing areas.
Opening – passages without doors.

Surface Type	Description	Detection Capability	Relevance
Walls	Vertical structural boundaries	Precise positioning and dimensions	Structural layout for AR navigation and space planning
Opennings	Doors, windows, passages	Type identification and measurements	Traffic flow analysis and accessibility planning Useful for layout planning, renovation
Floor	Horizontal base surface	Area calculation and boundaries	Foundation for furniture placement and room use
Furniture	Moveable and built-in objects	Category, size, and spatial relationships	Object interaction and interior design applications

Each of these surfaces contains a standardized set of six data properties.

Data Property	Description	Data Type	Notes
Confidence	Detection reliability	Discrete (Low/Medium/High)	Indicates scan quality
Dimensions	Size measurements	Width × Height	Depth always equals 0 (no thickness)
Transform	Position and orientation	4×4 matrix	Standard transformation matrix
Normal	Surface direction	3D vector	Perpendicular to the surface plane
Curve	Surface curvature	Variable/nil	Nil for flat surfaces
Completed edges	Scan completion status	Array	Tracks user scanning progress

15+ Objects & Data Properties

Apple RoomPlan API can detect a variety of objects, namely:

Furniture: bed, chair, sofa, table.
Kitchen appliances: dishwasher, oven, refrigerator, sink, stove.
Bathroom objects: bathtub, toilet, washer, dryer.
Other: fireplace, stairs, storage, television.

Objects share similar properties to surfaces but with key differences.

Property values include confidence, dimensions, and transform.

3D Dimensions: Full width × height × depth values.
Oriented Bounding Boxes: Match object orientation, not world coordinates.
Spatial Relationships: Position relative to walls and other objects.

Dimensions and the transform metric define a bounding box around an object. The bounding box isn’t aligned with the axes. Instead, it matches the object’s orientation, not the world coordinate axes.

Data Property	Description	Object vs. Surface Difference
Confidence	Detection reliability	Same discrete values (Low/Medium/High)
Dimensions	Size measurements	3D values (width × height × depth)
Transform	Position and orientation	Defines a non-axis-aligned bounding box

As understood, RoomPlan API can detect and visualize many elements. Additionally, it cannot simply be visualized as plain boxes; however, you can replace them with real furniture models to make the scan more detailed and valuable.

This structured method makes RoomPlan data ideal for creating top-notch AR apps, architectural analysis, and automated space planning.

Want to explore how the RoomPlan API can transform your project?

Let’s build a solution that goes beyond the RoomPlan API’s standard features and addresses its limitations. This custom solution creation uses an advanced computer vision algorithm. It enhances object recognition and layout accuracy, especially in cluttered or irregular rooms.

Reach out for a consultation.

Getting Started with the Apple RoomPlan API: Tech Perspective

Developers can seamlessly use the RoomPlan API in their iOS apps using one of two approaches:

Basic integration: add a RoomPlan API with minimal effort and without customization. Users will interact with the built-in solution experience only.
Advanced integration: gain complete control over scanning parameters and real-time data processing. From this perspective, users can build detailed room plans and edit specific elements as needed.

Thus, the easiest way to integrate Apple’s RoomPlan API into your iOS app is by using the default RoomCaptureView in Storyboard.

RoomPlan follows a clear component hierarchy:

The most straightforward integration uses the default RoomCaptureView for Storyboard.

The RoomCaptureView handles all visualizations and interactions with the end user.
A user scans the room with RoomCaptureSession, accessed via the corresponding view’s property.
The RoomCaptureSession itself utilizes the standard ARSession from ARKit.

You might also find it interesting to read:

SDK for Augmented Reality Applications

How Do AR Solutions Benefit from RoomPlan API?

When building AR apps and using the RoomPlan API, focus on how to streamline your business processes.

According to Statista, revenue in the AR&VR market is expected to reach $46.6 billion in 2025. Companies are investing in AR technology because customers want to receive immersive and interactive experiences in their services.

Integrating Apple RoomPlan into your app improves development and user experience in several key ways:

Reduce software development time by more than 50% with this unique iOS technology.
Generate 3D floor plans in under 2 minutes.
Offer intuitive room capture in real estate, interior design, or AR apps.
Cut the need for specialized 3D modeling expertise on your team.

“RoomPlan integration cuts our MVP development cycle from 4 months to 6 weeks. We could focus on user experience and advanced functionality instead of building a scanning solution from scratch.”

– Yurij Gapon, Head of iOS at It-Jim

We can help you test the RoomPlan API integration in your project and ensure accuracy in real-world conditions, such as varying lighting and complex furniture setups.

You can also discover our

3D Computer Vision Services

The true strength of using the Apple RoomPlan API is not only in its scanning features. It also provides valuable business insights.

Here’s how RoomPlan translates technical capabilities into competitive advantages:

1. Faster Time to Market

Scan-based 3D room models eliminate the need for manual drawing, significantly speeding up product release timelines.

Teams can now iterate and deploy features in just days. There is no need to wait weeks for professional architectural drawings. There is no need to use CAD programs and similar tools and to hire an external expert for measurements.

2. Improved App Experience

Integrating RoomPlan into your iOS app offers an immersive AR experience, enabling engaging and personalized interactions. This creates engaging, personalized experiences. They feel real and fit into the actual environment, not just a theoretical space.

Users interact with real-world spatial data instead of static, generic layouts. Planning renovations, arranging furniture, or evaluating properties all become easier and more intuitive.

3. Optimized Processes

RoomPlan API streamlines floor plan creation by automating the process, reducing manual effort, and minimizing errors. It’s a valuable tool for professionals who need fast, reliable, and accurate results.

4. Data-Rich Outputs

RoomPlan output provides detailed object metadata, including:

Type classification.
Precise positioning.
Accurate dimensions.
Spatial relationships.

This structured data is directly integrated into analytics pipelines, AI training datasets, or modeling applications. There’s no need for extra processing.

5. Ready-to-Export Formats

The API provides seamless workflow integration with various export options:

USDZ files for AR apps.
Structured JSON for BIM/CAD tools.
Standard formats for cross-platform use.

Building AR Apps with Apple RoomPlan: It-Jim Experience

Our team has rolled out the RoomPlan API in various industries. This adoption change enables businesses to tackle spatial computing challenges in innovative ways.

Here are proven applications that we helped to implement:

Architecture: 3D Floor Plan Layout

Project Focus: The goal was to create precise 2D floor plans and 3D layouts of real spaces with actual dimensions.

Solutions: A system captures spatial data from iPhone LiDAR scans. Then, it creates 3D models and scaled 2D floor plans.

Result: A mobile app turns spaces into digital layouts in minutes. This outcome saves architects and real estate professionals a whole day of manual work.

Solution for Real Estate

Challenge: Our solution creates 3D room models to facilitate the buying and selling of real estate. A client requested that our team enhance their prototype, which was created using the Apple RoomPlan API. They also requested support to improve app functionality.

Solution: We developed a tool that creates 3D room models to help with buying and selling real estate properties.

Furniture Fitting AR App

Challenge: Customers struggled to see how furniture would look in their own spaces. This aspect caused high return rates and unhappy customers.

Solution: We developed an AR app using the Apple RoomPlan API, which enables users to place furniture in real-time with precise spatial awareness. Users scan their room once, then virtually place items with confidence in scale and fit.

“After It-Jim added RoomPlan to our property management platform, we cut manual surveying costs by 70% and improved accuracy. Property listings now include interactive 3D models generated in minutes, not days.”

– Feedback from our client.

Therefore, RoomPlan API integration makes a solid investment. The technology offers businesses chances to stand out. It helps them create a better user experience and visual materials with less effort.

Have an AR-based project concept in mind and want to use the RoomPlan API?

Let’s build it together with Apple RoomPlan technology and advanced computer vision expertise. We help you turn powerful technology into real-world solutions that deliver results with a proven track record of implementing the RoomPlan API in iOS apps across diverse industries.

Book a call to discuss the implementation strategy.

Overcoming Apple’s RoomPlan API Limitations

Note that RoomPlan is still a new API. Some things may change, and issues might get fixed in future updates.

However, RoomPlan delivers useful but not perfectly accurate results. Measurements and object positions may have minor errors, which is excellent for quick scans but not reliable for precision-oriented tasks.

Since our team has had an immersive experience with the Roomplan API, we’ve identified its key limitations and know how to work around them.

How we work with existing API limitations, read also:

RoomPlan is Awful, and it’s Great!

For example, the RoomPlan framework has significant constraints, including the requirement for rectangular simplifications. The system attempts to reduce all objects and surfaces to a set of rectangles.

Additionally, the technology does not capture data from ceilings or skylights.

The current version of the RoomPlan API in iOS has several constraints, such as:

Limited object recognition – detects only a fixed set of common household items (e.g., chairs, tables, sofas). It does not identify less typical objects, such as water boilers or industrial equipment.
Struggles with multiple or large rooms – not designed for scanning numerous or very large spaces in one go. Apple recommends a maximum of about 9×9 m (30×30 ft). Longer scans degrade tracking accuracy, risk overheating, and may lead to drift.
Measurement Errors – shows measurement drift—errors up to ±5 cm per wall.
Incorrect Wall Thickness – models all walls as a uniform thickness of around 16 cm, regardless of real measurements. Exterior walls are always that thin; structures over ~50 cm break into two separate thin walls.
Door & Window Flaws – merge double doors or door-window combinations incorrectly.
Mirrored Surface Issues – large mirrors and mirrored wardrobes can confuse LiDAR, leading to missing geometry or phantom objects.
Surface shape limitations – assume surfaces are rectangular or slightly curved, so it misrepresents angled walls, arched openings, or detailed trim.
Phantom (ghost) geometry – occasionally “sees” surfaces or objects that don’t exist; LiDAR noise can lead to phantom walls or objects.
No Ceiling or skylight capture – does not capture ceilings or skylights, making it unsuitable for tasks requiring lighting design or accurate volume measurements.

While powerful, RoomPlan isn’t perfect in every environment. With years of vision-based R&D, we know how to work with these and other limitations.

Apple continuously improves RoomPlan API capabilities with each iOS platform update. Our AI iOS development services and approach account for this evolution.

“We are leveraging new RoomPlan features as they’re released. Our clients enjoy Apple’s upgrades while keeping current features intact. This is the benefit of working with a team that knows the framework’s roadmap and even beyond.”

– Yurij Gapon, Head of iOS at It-Jim

Conclusion: Is Apple’s RoomPlan API Right for Your Project?

Apple’s RoomPlan API simplifies the creation of accurate 3D room models. You can use it with a LiDAR-enabled iPhone or iPad.

It simplifies floor plan creation, reduces errors, and enhances app features across various industries, namely:

Real Estate: virtual property tours and instant floor plan generation.
Interior Design: AR-powered furniture placement, estimating materials, and space planning.
Retail: store layout optimization and virtual showrooms.
Property Management: digital twin creation and facility maintenance.
Architecture: rapid as-built documentation and renovation planning.
and much more.

Its ability to quickly generate accurate room layouts also makes it a valuable asset for broader AR applications. The technology is excellent for:

Fast residential room scans within ~9×9 m.
Quick, parametric 3D models and object layouts.

What’s Next: The RoomPlan API is still new, but updates will improve its accuracy and stability.

In the future, we can look forward to improvements such as support for non-rectangular surfaces, scanning multiple rooms, and better detection of floors and ceilings. These upgrades will expand the capabilities of what we can achieve with spatial understanding in AR.

Please note: The technology is not suitable for high-precision needs, complex structural analysis, industrial settings, or multi-floor scanning.

Ready to advance your business with the RoomPlan API and computer vision?

We help you benefit from the RoomPlan API. We do this by testing, customizing, and integrating it into your iOS app. Contact our team to share your needs and find out how we can speed up your RoomPlan implementation.

Read also:

Automatic Floor Segmentation Using Computer Vision

Lecture: Introduction to Computer Vision

Posted on September 22, 2022 by admin

As part of the Guest Edu project of Kharkiv IT Cluster, our Chief Learning Officer Serge Yerin will deliver his talk on Introductory to Computer Vision and tell you about:

✅ what Computer Vision is and its role in the AI field,
✅ real-world cases and its applications,
✅ the main tasks a CV engineer faces and the means to solve them,
✅ where to start if you want to study and work in CV

📅 September 28 | 12:00
🌐 Online | Zoom
📝 Registration | https://forms.gle/hVDKFQSaKyuH8n26A

Computer Vision: What’s under the Hood

Posted on September 20, 2022 by admin

As Artificial Intelligence enthusiasts, we are always happy to support innovative activities and motivate youth to self-develop in the domain. This time we are contributing to the IASA Data Science champ – an online team competition for students interested in Data Science, Machine Learning, Deep Learning, and Artificial Intelligence, organized by IASA Student Council and hosted by Igor Sikorsky Kyiv Polytechnic Institute.

The lecture days of the event will include a talk by our Chief Learning Officer Serge Yerin – he will dive into what is under the hood of Computer Vision and how to start a career in this field 😉

Computer Vision Trainee Program 2022

Posted on September 5, 2022 by admin

The program is a perfect match for:
🎓 engineering and computer science students
💻 software developers who want to switch to the CV / ML / DL domain

It lasts up to 2 months and gives you a chance to work on a real CV project under the personal guidance of one of the best It-Jim experts.

At the end of the program, successful candidates will have the opportunity to continue working at the company 🤝

🔥 Apply with the form by September 23, 2022