AI 3D Generation: From Prototype to Production

Today, there is almost no digital field where 3D assets haven’t found their place: CGI, games, VR/AR, physics simulations, fashion design, marketplace product renders – you name it. Over the past decades, this spread into everyday life and transformed the 3D modeler role from a programmer in an engineering lab to an artist using specialized software.

With the controversial, yet revolutionary AI advancements in generating text, images, video, and music, it seems only natural that 3D generation should be product-ready out of the box, requiring at most a thin wrapper over an API call. But, here’s the twist: that is simply not the case.

So, how big is the gap between expectations and reality in AI-based 3D asset generation exactly, and is there a way to breach it right now?

reference ring ring
Expectation Real Inference

 

What Makes a 3D Asset Usable? Geometry, Materials, and Metadata

Let’s start from afar. The concept of 3D assets didn’t spring from computer science alone. The first mentions in a modern context date back to the Bézier curves that French automotive engineers came up with in the 1960s, followed by… Okay, stop. Maybe this is too far away from what we need.

What we really need is to understand how to represent the spatial position of an object, as well as its relevant properties, conveniently. Akin to 2D image raster and vector formats, there are two approaches for storing spatial information.

Meshes are the 3D equivalent of the raster format. We describe objects with determined blocks, but instead of a pixel, the minimal building block here is a polygon in a 3D coordinate system, usually a triangle, though sometimes a 4-corner shape (quad) is more convenient. A single point of a mesh is called a vertex, a connection between two vertices is an edge, and a group of connections that form a closed polygon is a face. Simple enough. Meshes are used when artistic control, organic shapes, or real-time performance is required. Few of the typical formats for them are OBJ, STL, and FBX.

On the other hand, analogous to vector 2D images, we have CAD-like formats. These use precise, parametric, mathematical definitions (NURBS), which are the evolution of the Bézier curves we mentioned earlier. Use CAD when dimensions and manufacturing precision matter (STEP, IGES, SolidWorks).  Native CAD generation and AI mesh-to-CAD converters aren’t even close to production-ready yet, so we’ll leave them off the table for now.

We covered geometry, but it’s merely the skeleton; to become a functional 3D Asset, it must be fleshed out with properties. Visual realism standard is Physically Based Rendering (PBR), which defines metalness and roughness of the material, not just color, while interactivity demands physical data like collision bounds and rigging. Without this metadata, a generated model is just a static mathematical shell, not a usable item for a game engine.

This “packaging” step is the true bottleneck. While AI can easily hallucinate an acceptable raw shape, it struggles to organize it into complex production formats (like FBX or USD) that bundle geometry, materials, and hierarchy together. The gap between “Expectation” and “Reality” lies here: users want a plug-and-play asset, but AI currently delivers only the raw, unpolished geometry.

 

CAD Mesh

 

How AI 3D Generation Works: From a Photo to a Mesh

A “3D generation model” is rarely a single model; it is usually a complex AI system comprising several specialized networks and pre/post-processing scripts disguised under a shared name. It takes a simple input (like a text prompt or an image) and attempts to return usable geometry and textures. To understand what is actually going on under the hood, we have to look at shape and appearance separately, as they are traditionally treated as sequential tasks. We’ll look at this from the “How it started vs. How it’s going” angle.

Brief historical overview

Initially, acquiring 3D shapes was a pure engineering challenge. Techniques like Structure from Motion (SfM) and Multi-View Stereo (MVS) relied on handcrafted feature detectors (like SIFT, Lowe 2004) to mathematically triangulate points from hundreds of high-overlap images.

The Deep Learning Era shifted this approach. 3D-GANs (Wu et al., 2016) attempted to map 2D images directly to 3D voxel grids, but they hit a “cubic memory wall”: doubling the resolution increased memory usage by eight times. A workaround occurred along with Neural Radiance Fields (NeRF) in 2020: by focusing on novel view synthesis, the network learns to “picture” the object from any direction rather than storing its physical volume. With these generated views, it boils down to classic reconstruction tools mentioned earlier to create a mesh, removing the need for the network to handle complex 3D geometry directly. Our separate NeRF in 2023: Theory and Practice guide covers the full training and rendering workflow in NeRFStudio, including the limitations that pushed the field toward generative methods.

All these experiments across a couple of decades laid the groundwork, but practical 3D generation required a fundamental shift in approach. The generative branch began sprouting with DreamFusion (Poole et al., 2022). It introduced Score Distillation Sampling (SDS), allowing 2D diffusion models (like Stable Diffusion) to “judge” and optimize a 3D shape until it looked correct from every angle. This was “dreaming” 3D from 2D priors and was painfully slow, but it was the final missing piece that allowed for revolutionary progress. 

So, by 2022, the paradigm itself has changed. 3D generation was no longer the deterministic triangulation problem of the SfM era, which demanded lots of multi-view data. Instead, the neural networks were taught some degree of 3D space awareness, even if via indirect means. With these semantic priors, as little as a single image can be turned into a mesh. Unfortunately, the quality of those generations was far from being useful.

dreamfusion examples
Unsettling DreamFusion generations

 

Path to The Modern Generation

With all priors set up, a major practical shift followed in 2023 with the rise of multi-view diffusion. As the 2D generation already showed great results, researchers tried “forcing” those models to understand 3D.  Zero-1-to-3 pioneered this by fine-tuning Stable Diffusion on camera angles and 3D data, while One-2-3-45++ later fleshed out the concept into a complete 3D mesh generation pipeline.

One-2-3-45++ generations

 

Building on this in late 2023, the LRM architecture pioneered a family of Large Reconstruction Models, enabling near-instant, feed-forward generation. In this workflow, a 2D diffusion model first generates 4-6 consistent views from a single image, which LRM “stitches” into a Triplane-NeRF in about five seconds. The mesh still needs to be extracted via an algorithm like Marching Cubes, but the overall speed-up was impressive. While models like InstantMesh (2024), TripoSR (2024), and Hunyuan3D v1.0 (2024) refined this process, they couldn’t overcome the inherent flaws of Triplanes. Because Triplanes rely on 2D axis-aligned projections, they struggle with complex “concave” geometry or occlusions. Furthermore, if the initial views lack perfect pixel-alignment, the 3D output collapses into a blurry mess. Finally, Triplanes suffer from quadratic memory scaling: the same problem as earlier approaches faced with a 3D voxel grid, but less severe. All these capped the level of detail these models can achieve and their further development.

instant mesh examples
InstantMesh generations

This path of trial and error leads to the current state-of-the-art: the Native 3D Era (Late 2024-Present). These models generate assets directly in high-dimensional latent space, skipping the “stitching” phase and the “memory wall” entirely, much like how modern 2D models create images. Two families currently lead the field. The TRELLIS Family prioritizes topological flexibility; its latest version utilizes compact structured latents (SLAT) to encode geometry and PBR materials into a sparse grid. By being “field-free” (moving away from Signed Distance Fields (SDFs), which earlier models in this class used), it can represent complex, non-manifold geometry like open clothing or thin leaves with ease. Meanwhile, the Hunyuan3D Family focuses on massive scale and precision. Hunyuan3D v3.0 leverages a 10-billion-parameter (vs nearly 4 in the latest Trellis) Diffusion Transformer (DiT) to treat 3D generation like a language problem, using a hierarchical “sculpting” approach. It generates a global coarse shape first and progressively refines it to ultra-high resolutions, effectively eliminating the surface noise and artifacts that plagued earlier generations.

 
Hunyuan3D v2.0 shapes comparison with analogs

 

Texture & Material Generation

Once the shape is solidified, it needs to be colored. Historically, this meant projecting 2D images onto a mesh (Texture Mapping). Early models like TRELLIS 1 and Hunyuan 2.0 treated this as a separate, decoupled task, which often resulted in a “plastic” look and noticeable texture drift. Modern versions have shifted toward Native PBR Generation, predicting Albedo, Metalness, and Roughness simultaneously by analyzing both the input image and local geometric details like curvature or sharpness.

TRELLIS 2 utilizes Structured Latents to predict material parameters for specific 3D points directly. By generating material tokens that are perfectly aligned with geometry tokens within the same grid, the materials are “baked in.” This provides superior localization, preventing “bleeding” between different surfaces (think about keeping a gemstone’s high-gloss shine strictly separate from a matte metal ring).

Hunyuan 2.1/3.0 uses a Sequence-based Latent instead of a fixed grid, so they came up with Mesh-Conditioned Multi-View Diffusion. It essentially “paints” the asset by observing the geometry from multiple angles. To solve the localization problem, it injects 3D coordinates into the 2D process using Normal Maps, CCM, and 3D-Aware RoPE. This ensures the model recognizes a single 3D point across different views. They also keep color and material data in parallel branches that stay synchronized through shared attention.

 

Hunyuan3D vs TRELLIS: Comparing the SOTA AI 3D Generation Models

A model’s true value often depends on its adaptability. While proprietary systems often work well out of the box, you have no means to customize them for niche requirements. Having identified the SOTA comes down to two leading “Native 3D” families, the next question is accessibility. How open are these systems? TRELLIS is completely open-source. Both training and inference code are available, making it very developer-friendly. On the contrary, Hunyuan3D v3.0 is currently closed and available only via API. Its predecessor, v2.1, has released weights and code, but the training script requires significant tweaking to work reliably (or at least work at all), so we can consider it an “open-weights” model at best. 

teapot teapot
Input image Generated 3D still

Simple object generation with Hunyuan v2.1 (Utah Teapot)

 

When we put standard shapes into these models, they come out nearly perfect, but such simple results barely provide additional value. The real test is where this technology meets industry-specific requirements, and currently, the winner is determined by how much “mess” your pipeline can tolerate. Advertising is the immediate sweet spot; since the goal is a beautiful view rather than perfect polygons, marketers can use heavy, unoptimized AI meshes for offline renders. E-commerce and entertainment sit in the middle, successfully using AI as a “prop factory” for rigid objects and background fillers where complex animation isn’t required. However, sectors such as gaming and high-end fashion remain the biggest laggards. Due to noisy topology and the absence of true physics simulation capabilities, AI still struggles to generate characters that deform correctly or dresses that realistically flow in the wind, leaving these areas stuck in a “prototype-only” phase until the next breakthrough. To back this up, let’s consider two industry examples: Video generation went through this same transition roughly two years earlier. For context on where 2D generation stands now, our AI video generation tools overview covers the current landscape; the trajectory in both fields is nearly identical.

EXAMPLE 1 EXAMPLE 2
INDUSTRY Advertising/E-commerce

(Jewelry)

Manufacturing

(3D printing)

INPUT  

A Gemstone Ring

 

A Tabletop Monster Miniature
CHALLENGES  

Requires sharp, hard-surface precision with reflective textures.

 

Features thin parts and a complex organic silhouette; 
EXPECTED RESULT  

Model that represents input image 1 to 1 without hallucinations, with smooth surfaces and geometrically correct gems.

 

The model has a distinct base, thin details represented well enough to be printable, and the whole mesh is watertight.

 

 

Advertising/E-commerce  (Jewelry)

reference ring
Input Trellis v2
 

 

Hunyuan v2.1

Hunyuan v3.0

 

 

 

Manufacturing (3D printing)

figure
Input Trellis v2
 

 

 

 

Hunyuan v2.1

Hunyuan v3.0

 

At first glance, all models appear to perform well. There is an obvious dependency: the larger the model, the more satisfactory the results seem (and inference time increases). However, this raises the question of whether this trend truly holds. 

Hunyuan v3.0 (left) and Trellis 2 (Right) generations’ wireframes

 

Under a digital microscope, the wireframe’s clean surface dissolves into millions of chaotic triangles. When a human models a mesh, components like gems are built with minimal polygons and a logical structure, making minor edits as simple as moving a few vertices. You can also predict a skilled modeler’s quality based on their portfolio. Generative AI is different. Most algorithms extract meshes from latents with extremely high-density topology, making manual sculpting or editing a nightmare. It’s often faster to remodel the entire sample than to fix the topology by hand. Furthermore, performance is inconsistent; while models handle familiar shapes well, they often struggle with specific details or views, like Trellis with gems. 

 

Trellis v2

 

Hunuan v3.0

 

 

Close-up of Hunyuan v3.0 textures

On to textures. While Hunyuan v3.0 wins on fidelity, Trellis v2 is better at keeping materials from bleeding into one another. However, both models struggle with PBR. Shadows and highlights are incorrectly painted onto the albedo rather than being handled by metalness and roughness channels. Because distinct materials are merged into a single layer, reusability and editing are nearly impossible. Often, what appears to be complex geometry is actually a “hallucinated” flat texture. For instance, Hunyuan v3.0 makes small diamonds look more like apricot seeds. 

 

So, what are we left with?

The current state of 3D AI follows an 80-20 rule. For familiar objects and clean camera angles, you can expect nearly perfect results. However, that final 20% of accuracy is incredibly difficult to bridge. Unlike 2D AI, where you can easily use Photoshop or Inpainting, 3D models produce assets with such high-density, “messy” topology that turn polishing into a nightmare.

While all this may sound critical, the field has, in fact, skyrocketed recently. The leap from DreamFusion to models like Hunyuan v3 is massive, with new major breakthroughs still dropping every few months. We can get immense value from these tools right now, but they aren’t “plug and play.” They require specific tailoring, heavy post-processing, and cleanup to be production-ready. If you don’t think through your workflow, the costs for computing and polishing labor will add up quickly.

 

Custom AI 3D Generation Pipelines: How IT-JIM Builds Them

Turning raw AI output into a professional asset is where the actual engineering happens. You need a structured pipeline, which takes raw generations as “digital clay” and refines it into a production-ready geometry through five essential stages. The same mesh problems show up in device-based photogrammetry. Our 3D reconstruction on iOS guide covers the ObjectCapture workflow, which hits the same mesh density and topology walls described here.

 

 

The process begins with the first stage: Fine-tuning Generation. Since 3D models (TRELLIS, Hunyuan) are more temperamental than 2D tools like Stable Diffusion, you often need to adapt open-source models to your specific category or camera angles. This requires significant data and compute, as standard training scripts rarely run optimally (or correctly, in the case of Hunyuan) without heavy modification. In case anything goes wrong, as it often does with open-source code, pay attention to issues in the source repo. Even then, fine-tuning has limits and cannot fix every edge case. But it is almost always worth the effort as each step down the line depends on the quality of the initial generation.

The next stage is Segmentation, where you split a single, unmanageable mesh into components. Open-sourced models like P3-SAM, SAMesh, or PartField serve as the backbone here. Instead of fighting one massive blob, segmenting allows you to run specialized logic on different parts: fixing an important part with high fidelity while keeping the base as-is, or replacing it with some preset. This stage is the foundation for organized editing and material management, but it often takes more resources than the generation itself.

Because standard splitting often leaves “holes” where segments connect, Completion becomes the next vital stage, powered by models like X-part or HoloPart (both are open-source). This phase uses the initial mesh to “hallucinate” missing geometry, ensuring every segmented part is a separate, manifold object. This makes the assets physically viable rather than just visual shells, but comes at the cost of high memory requirements and additional time.

Texture Assignment follows as the next stage in the pipeline: AI textures made for the whole mesh can barely ever satisfy production. By having distinct geometry for different components, you can assign PBR materials from a high-quality database or run painting models on individual parts, ensuring that metal looks like metal and cloth looks like cloth.

Finally, the Cleanup stage addresses the chaotic topology. By using sequence-based tools like MeshAnything v2 that place vertices similar to a human artist, or applying remeshing algorithms to redistribute polygons, you can turn dense, uneditable topology into clean, even surfaces. It is far more reliable to retopologize these simple, segmented pieces than to attempt to fix a complex, branching mesh all at once.

The Current State of AI 3D Generation: Strengths, Limits, and What to Expect Next

Tracing the evolution of 3D AI from basic photogrammetry to sophisticated “Native 3D” models makes one thing clear: the technology has not yet achieved “out-of-the-box” usability for professional workflows. The gap between a chaotic AI hallucination and a production-ready asset cannot be bridged by simply waiting for a better model; it requires fine-tuning and the implementation of a robust post-generation workflow. This pipeline might involve segmenting the mesh into essential components, replacing parts with high-quality presets, or retopologizing for clean geometry. It may also require thickening thin parts to meet 3D printing requirements, assigning realistic PBR textures for a final render, or many other steps, depending on the specific use case.

 

While current technology excels at generating visual prototypes suitable for advertising and e-commerce, it still falls short of the rigorous topological and physical standards required for high-end manufacturing or gaming. In conclusion, we are not waiting for a model that does everything, but rather building the workflows that yield results here and now. As AI quality improves, it will open doors to previously unreachable industries. This progress creates a paradox: the better the models get, the more crucial the adaptation layer becomes to move from a simple visual prototype to a complex, physics-ready asset. If you are evaluating whether to build this kind of pipeline in-house or with a partner, our computer vision development services page covers how we work.

AI 3D Generation for Games, Jewelry, Fashion, and Advertising 

The state of 3D AI plays out very differently depending on what you need to do with the output. In some industries the current models are good enough to ship; in others, a custom post-processing pipeline is the only thing that separates a prototype from a product.

Game studios

Game studios spend real money on environmental assets, props, and background characters that players rarely look at directly. AI 3D generation can already automate much of that work: rigid objects, furniture, vehicles. That frees artists for the characters and hero items that require real craft. The catch is topology: game engines want clean, low-poly meshes with proper UV maps, and raw AI output is neither. A post-generation pipeline that covers segmentation, retopology, and LOD generation turns that raw geometry into something a game engine can actually use. IT-JIM builds that kind of pipeline for studios that need to grow asset output without growing the team.

Jewelry and accessories retail

Jewelry e-commerce is a strong fit. Brands that add 3D and AR product views see higher conversion rates, and rings, pendants, and watches are exactly the rigid, hard-surface objects current models handle best. The problem is precision: gemstones and metallic settings require accurate PBR material separation that out-of-the-box models consistently get wrong. A pipeline fine-tuned on a brand’s own SKU catalog, paired with a curated material database for metals and stones, can produce render-ready assets at scale. The ring example from earlier in this article shows exactly what the starting point looks like. A custom pipeline is what closes that gap.

Fashion and apparel

Fashion is split. Anything that drapes or stretches still defeats current generation models. Rigid accessories (bags, shoes, eyewear) are a different story. A well-designed pipeline can triage SKUs by category: send accessories through the automated route and send garments to a 3D artist. That split alone cuts the cost per digital asset significantly.

Product visualization and advertising

Campaign renders only need to look right in a still image or short video. Nobody inspects the wireframe. AI-generated geometry, even with dense topology, is perfectly usable for offline rendering. A pipeline that goes from product photography to render-ready 3D asset is a practical, lower-cost alternative to commissioning a 3D modeler for each SKU.

 

Building production-ready 3D pipelines

The post-generation workflow in this article is real engineering work. Getting from a raw AI mesh to something production-ready requires expertise across computer vision, 3D geometry processing, and model fine-tuning. Clean topology, proper materials, and the right output format for your renderer or game engine do not come automatically. IT-JIM builds that pipeline for clients who need working output, not proof-of-concept demos.

We have fine-tuned open-source models like TRELLIS and Hunyuan on domain-specific data, built segmentation tools, and written retopology and material assignment steps that turn a dense AI mesh into something a game engine or e-commerce renderer can actually use. Several of those projects are in our portfolio.

If you are scoping out an AI 3D generation project in product visualization, game asset production, jewelry e-commerce, or another area and need a team that has already solved the post-generation side, get in touch.

 

 

SAM Audio: Redefining Sound with Multi-Modal AI

The Moment Audio Caught Up with Vision

If you’ve ever tapped on a person in an Instagram photo and watched them lift cleanly from the background as a sticker – that’s Meta’s Segment Anything Model working quietly behind the scenes. Instagram’s Cutouts feature, confirmed by Meta to be built on SAM, turned what used to be a careful manual masking job into a single tap. And with SAM 3, released in late 2025, the model took another step: it added text prompting. Instead of clicking, you can now type “yellow school bus” into a video and the model finds it, segments it, and tracks it through every frame. Now Meta has brought the same thinking to audio.

SAM Audio, released by Meta’s Superintelligence Labs in December 2025, does for sound what SAM did for images. Describe it, point to it in a video frame, or mark a moment in the timeline of a waveform where it’s clearly audible and the model separates it from everything else. One model covering tasks that previously each required their own specialized system: speech denoising, stem separation, speaker diarisation, instrument isolation, arbitrary sound extraction – the full range of audio processing services now addressable through a single promptable interface. As a foundation in multi-modal ai, it unifies text, vision, and temporal cues into a single generative system for sound. In this blog, we’ll walk through what the model can do, how well it actually performs in practice, and what the paper contributes to the audio AI field – including some details worth paying close attention to if you’re thinking about integrating or building on this technology.

 

Tell It What You Want: Three Ways to Talk to SAM Audio

The most natural way to think about SAM Audio is as a model with three different input channels, each suited to a different situation. They can be used independently or combined for more precise control.

Text prompt 

The simplest starting point. You type what you want to extract. “acoustic guitar,” “crowd noise,” “male speech,” “violin section”, and the model does its best to find and isolate that sound within the mix. In practice, this turns the system into a powerful ai audio separation engine that behaves almost like a search bar for sound. This works well when the target is easy to describe and acoustically distinct from everything else around it. Think of it as a search bar for sound: most of the time, good enough phrasing gets you what you need. Where it starts to struggle is when the target is hard to pin in words, like close variants of similar sounds, for example, or a specific effect within a dense soundscape where text alone can’t fully disambiguate what you’re after. For similar-sounding targets coexisting in the same mix the video mask prompt is usually the handiest approach

Video mask 

When there’s accompanying video, this prompting mode becomes genuinely powerful. Instead of describing the sound, you show the model where it’s coming from. Click on the drummer in the frame, draw around the guitar amplifier, tap on the speaker whose voice you want to isolate and the model extracts the audio associated with that visual region. For anyone working with video content, this replaces what would previously have required access to the original multitrack sessions. You just point. Although, for musicians searching for a particular sound effect, span prompting is where it gets most interesting and truly becomes an audio-native tool.

Span prompt 

This is the most distinctive feature, and the one most worth understanding if you’re a sound engineer or producer. Instead of describing the sound or pointing to a visual source, you highlight a short segment of the audio timeline – a moment where the target sound is clearly audible and relatively isolated. That segment becomes a fingerprint. The model uses it to find and extract acoustically similar material throughout the rest of the recording. Think of a field recording where a specific bird call appears and disappears across a 10 min of material, or a backyard capture where you want to pull out just the sound of wind rattling a piece of metal, a texture you heard once clearly, buried somewhere in the mix. Find one clean moment where it’s audible, mark it, and the model hunts it across the whole track, isolating from residual sounds. 

All three modes can be combined in any pair variation or be used independently.

 

How Good Does It Actually Sound?

This is the question that matters most for anyone considering using the model in a real workflow, and it deserves an honest answer rather than a recitation of benchmark numbers.

The first thing to understand is that SAM Audio is a generative model. It doesn’t carve a sound out of a mix the way a mask-prediction model would. It synthesizes a new version of the target sound, conditioned on the mixture and the prompt. This is an important distinction when it comes to evaluation. The standard objective metric for separation quality is SDR (Signal-to-Distortion Ratio), which works by comparing the model’s output sample-by-sample against a clean reference recording of the target sound. But since SAM Audio generates rather than extracts, even minor differences in timing between the output and the reference, things that are completely imperceptible to a human listener, cause SDR scores to drop. Comparing SAM Audio’s raw SDR numbers against those of a mask-based discriminative model is like comparing two different kinds of instruments on a scale calibrated for only one of them. The paper acknowledges this directly, which is part of why the authors developed new evaluation methods (more on that in the next section).

With that caveat in mind, across the benchmarks where comparison is meaningful  (general sound separation, speech, music, and instrument separation in both in-the-wild and professionally produced recordings) – SAM Audio outperforms prior general-purpose models and reaches competitive or state-of-the-art performance against specialized systems that were purpose-built for a single domain. That’s a significant result: one model beating tools that were designed to do only one thing.

In real-world use, though, there’s a practical nuance worth knowing. The training data included not only high-quality studio recordings but also a large volume of medium-quality video audio,  the kind of material you’d find on the internet, recorded with phone microphones in noisy environments. This is actually what gives the model its broad generalization ability. But it also means the model has learned many sounds as they typically exist in the wild: embedded in their natural context, surrounded by ambient noise. When you try to isolate something like an ambulance siren, or the sound of a passing train, or distant crowd noise (sounds that in real recordings almost never appear in silence) – you’ll often find that the extracted result carries some of that environmental texture with it. The model isn’t doing something wrong; it’s producing what it learned is a realistic version of how that sound exists in the world. It’s something to factor in when choosing whether SAM Audio is the right tool for a given task, versus a more specialized model trained on cleaner data for a specific domain.

Under the Hood. What the Paper Actually Contributes

From here, we’re getting more technical. This section is aimed primarily at deep learning engineers and researchers. The paper makes contributions in four areas worth looking at closely: the generative architecture, span prompting, a new automatic evaluation model, and a new approach to subjective evaluation.

Architecture

SAM Audio is built on a Diffusion Transformer trained with flow matching. The core framing is important: the source audio mixture is not the direct input being processed – it’s a conditioning signal. The model learns to generate the target stem by iteratively refining from noise, guided jointly by the audio mixture and the user’s prompt. This is fundamentally different from the mask-prediction paradigm that has dominated audio separation (Demucs, MossFormer2, and most production-grade separation tools fall into this category). Those models effectively learn a filter: given the mixture, predict a mask that when applied recovers the target. SAM Audio doesn’t filter – it generates. The architecture is much closer to how modern music or image generation models work, where the source material conditions the generation process rather than being directly transformed. Prompt encodings – text via a language encoder, video via visual features from masked regions, spans by tokens aligned to audio frames – are all injected as conditioning into the Diffusion Transformer alongside the mixture representation.

Span prompting

Technically, span conditioning works by converting the selected time intervals into a frame-synchronous binary token sequence. Each frame is marked as either silent (<sil>) or active (+) depending on whether the target event is present. This sequence is embedded via a learnable embedding table and concatenated channel-wise with the audio features before entering the DiT backbone, giving the model an explicit temporal prior about when to extract rather than what the sound is. This is fundamentally different from reference-based speaker extraction methods, which require a clean enrollment recording of the target source. SAM Audio needs only a timestamp.

One particularly useful implementation detail from the paper: when a user provides only a text prompt, it can use a secondary model called PEA-Frame that automatically predicts the frame-level activity of that sound event within the mixture and feeds it back as span conditioning. The paper shows that joint text + span conditioning consistently outperforms text-only prompting, so when you describe a sound in words, the model can quietly upgrade itself to also know when that sound is happening.

 

SAM Audio Judge (SAJ)

Evaluating GENERATIVE separation models is genuinely hard. SDR requires a clean reference, which doesn’t exist for real-world recordings, and is also biased against generative outputs for the timing reasons described above. CLAP similarity – another reference-free metric that measures how well the output matches the text description in an embedding space – turns out to correlate poorly with actual human judgment of separation quality, as the paper demonstrates. SAJ is a model trained directly on human perceptual quality annotations. It takes a mixture, text prompt and a separated output and predicts a quality score without needing a reference, and its predictions correlate much more strongly with human listening test results than CLAP similarity does. This matters well beyond SAM Audio itself – SAJ is a genuinely useful contribution to anyone working on audio separation evaluation, providing a way to assess real-world separation quality without the constraints of synthetic benchmarks.

Subjective evaluation protocol 

The paper also rethinks how listening tests are designed for this class of model. This is a harder problem than it looks, and the paper redesigns the evaluation protocol from scratch rather than reaching for the standard tools.

MUSHRA (the traditional approach) requires a clean reference stem to compare against, which simply doesn’t exist for real in-the-wild recordings. Single-stimulus MOS tests avoid that dependency but are poorly sensitive to small differences between models and prone to anchoring drift across a session. The paper sidesteps both with a side-by-side Absolute Category Rating protocol with an always-on preference tie-breaker. Listeners see two model outputs for the same input alongside the original mixture and prompt, scoring each independently across three dimensions: Recall (how much target made it into the extraction), Precision (how much non-target leaked in), and Faithfulness (how similar the extracted sound is to the original mix). They then answer a forced-choice preference question regardless of whether their numerical scores already differ. The SAJ model described earlier was trained to predict these exact human scores – which is why it correlates so much more strongly with listener judgment than SDR or CLAP similarity. The subjective protocol isn’t just a validation step; it’s the foundation the automatic metric was built on.

Taken together, the target sound separation field has been dominated by discriminative, mask-predicting models. SAM Audio is a fully generative pipeline end-to-end – the first foundation model of this kind that unifies text, visual, and temporal span prompting in a single framework, and that matches or exceeds specialized systems across multiple domains simultaneously.

Conclusions

SAM Audio is a meaningful step for the audio AI field as a genuine foundation model that consolidates tasks previously requiring separate specialized systems into one flexible, promptable interface, covering speech, music, and general sound.

The real excitement, though, is what comes next. A model like this is a building block. It enables agents that can understand audio contextually and respond to natural language instructions – a mix engineer describing what they want rather than reaching for a plugin, a post-production workflow that can process field recordings based on spoken instructions, a creative tool for musicians that lets them explore separation as part of composition. The combination of span prompting with generative quality opens up use cases that simply weren’t possible before.

MediaPipe Pose Estimation for Sports Apps: Deep Dive, Deployment, and Limitations

Human pose estimation has emerged as a cornerstone technology in next-gen fitness and video analysis applications. For sports startups and developers building AI-powered coaching or performance tracking apps, real-time posture and movement analysis unlocks new value far beyond step counters or GPS tracking. Pose estimation is becoming foundational in AI-powered sports training where real-time feedback and motion tracking outperform legacy wearables.

Among available pose estimation frameworks, Google’s MediaPipe stands out as a popular choice for mobile-first MVPs. It’s fast, lightweight, and surprisingly production-ready but it also comes with its own quirks and architectural trade-offs. This article explores:

  • Why MediaPipe is often chosen for sports AI prototypes
  • How it compares to alternatives like OpenPose, ARKit Vision, and MoveNet
  • Common pitfalls when deploying MediaPipe on iOS and beyond
  • How to turn raw pose data into actionable video analysis and performance insights

Why MediaPipe Is a Go-To for Sports MVPs

MediaPipe is more than a pose estimation model. It’s a graph-based perception framework optimized for real-time mobile video analysis pipelines. Sports app developers choose MediaPipe for its:

  • Fast experimentation loop: Python prototype to mobile integration in days
  • Out-of-the-box pipelines: Pose, hands, face, and holistic models ready to deploy
  • Efficient on mobile: Runs on CPU or GPU with low latency (30+ FPS on mid-range phones)
  • On-device privacy: Enables edge-based video analysis without cloud compute

That said, MediaPipe only provides raw pose estimation landmarks. To deliver sports-specific insights, developers must implement domain logic like biomechanical metrics, rep counting, and performance scoring.

What MediaPipe Enables in Sports Apps

  • Posture and alignment feedback through pose estimation
  • Phase segmentation (e.g. analyzing stages of a golf swing)
  • Timing, symmetry, and video analysis for athlete movement

What MediaPipe Doesn’t Do Natively

  • Detect or interpret equipment (e.g. racket, club, bat)
  • Deliver actionable coaching feedback out of the box
  • Provide sports semantics – those must be manually built

Pose vs Holistic Models

  • Pose model: Easier to integrate, supports multiple people, but can misidentify limbs
  • Holistic model: More stable anatomically, includes face and hands, but single-person only

Framework Comparison: MediaPipe vs OpenPose vs ARKit vs MoveNet

Feature MediaPipe (BlazePose) OpenPose Apple Vision Framework MoveNet
Platform Support Android, iOS, Web, Desktop Cross-platform (GPU required) iOS only Android, iOS, Web
Performance Real-time (30+ FPS) GPU-dependent, slower 60 FPS on iPhones Real-time mobile
Accuracy & Keypoints 33 landmarks 25+ landmarks 19 joints 17 landmarks
Multi-Person Tracking Limited Excellent Single person Single person
3D Depth Capability 2.5D relative depth Mostly 2D Full 3D if LiDAR 2D only
Ease of Integration Easy for basics; harder for custom pipelines Complex; research-focused Seamless for iOS only Developer-friendly

 

Tech Pitfalls: Limitations of MediaPipe in Sports Video Analysis

While MediaPipe performs impressively in ideal conditions, developers should be aware of its limitations in real-world sports contexts:

Model Accuracy and Depth

  • BlazePose is not anatomically constrained
  • Depth estimates can be unstable or noisy
  • 2D overlays look clean but don’t translate well to biomechanics
  • Limb orientation may flip (e.g. arm facing camera vs away), which breaks 3D interpretation

Multi-Person and Occlusion Challenges

  • Holistic model only tracks one person
  • Pose model handles multiple users but can randomly flip limbs direction if they point towards or away from camera
  • Hands may appear disconnected or incorrectly matched when athletes overlap
  • Extra limbs or partial hands in frame cause rapid degradation; If three hands are visible – expect failure

Sports-Specific Edge Cases

  • Detection range is limited for distant players (e.g. on a tennis court)
  • Equipment (rackets, clubs, bats) often confuse the model
  • Two close hands holding a device may appear unrealistically far apart in 3D, which indicates underrepresentation of such cases in training data
  • Side angles or fast spins reduce tracking fidelity

Eye and Face Tracking

  • Assumes symmetrical eye movement
  • Model may mirror or misinterpret one eye
  • Uneven gaze tracking due to dataset imbalance

For developers working on face-centered features like eye state or attention detection, Apple’s native Vision Framework offers an alternative with better depth and stability in iOS-only contexts.

These gaps are why advanced sports and healthcare apps often evolve past MediaPipe’s stock models. 

 

From Pose Estimation to Sports Insights

Landmarks alone don’t deliver value. A production-grade video analysis or fitness AI solution needs:

  • Smoothing and stabilization to remove jitter
  • Multi-view merging for more accurate 3D insights
  • Sport-specific metrics (e.g. joint symmetry, sequencing)
  • Pose retargeting to compare against an ideal motion pattern

MediaPipe offers great signal quality, but interpretation still happens downstream. Turning that signal into a sports product requires domain expertise and product design.

Techniques adapted for athlete motion have also proven effective in assessing motor symptoms during neurological movement analysis.

Deploying MediaPipe to iOS and Other Platforms

MediaPipe’s cross-platform promise includes iOS, Android, web, and desktop. But iOS presents specific technical challenges:

  • Uses Bazel for builds, not CocoaPods or SwiftPM
  • Requires building custom frameworks in C++
  • Real-time runs at ~30 FPS, but UX suffers without smoothing and gating logic

Still, the benefits are compelling: no cloud, fast on-device pose estimation, and strong privacy compliance. Developers experimenting with MediaPipe’s native API can explore a minimal working example in this C++ starter tutorial.

For Android, MediaPipe is available via Gradle or precompiled AARs. On the web, MediaPipe.js runs directly in-browser using WebAssembly, making lightweight video analysis available with no installation.

Signs You’ve Outgrown MediaPipe

When do you need something beyond MediaPipe?

  • Multi-person tracking must be stable and identity-aware
  • Sport-specific insights require complex kinematics
  • Quality must be predictable across hardware
  • Precision is essential for clinical or rehabilitation use

Building sports apps that can handle real-world edge cases, from occlusion to equipment interference, often requires more than off-the-shelf tools.

Conclusion

MediaPipe is a flexible, efficient, and surprisingly powerful tool for real-time pose estimation and video analysis in sports and fitness apps. Its unique architecture and mobile-first design make it ideal for MVPs and on-device intelligence. But successful deployment requires more than plugging in a model. You’ll need to:

  • Build post-processing for stability and interpretability
  • Design your analytics pipeline for real-world edge cases
  • Understand when MediaPipe is sufficient and when it’s time to go custom

For companies scaling beyond MVPs, we help deliver robust sports ai solutions that translate motion into measurable outcomes.

Beyond Smartwatches: Building AI Athlete Performance Apps with Pose Estimation

Wearable fitness trackers and smartwatches have become standard gear for athletes, collecting metrics like heart rate, speed, and distance. However, these devices often fail to provide insights into how the movement was performed. Enter pose estimation, a computer vision technique that extracts biomechanical data from video, opening up new frontiers in sports performance tracking, sports analytics, and sports analytics software.

From What to How: The Shift Toward Movement Understanding

Traditional tracking answers “what happened” such as steps taken, GPS distance, or calories burned. But high-performance training and injury prevention demand deeper insights into how the athlete moved. Pose estimation enables that by analyzing joint angles, symmetry, sequencing, and compensation patterns, offering new capabilities in sports analytics and sports analytics software.

Using just smartphone video, pose estimation software can turn visual input into a detailed map of joint movement without expensive motion capture labs. This opens up scalable applications in sports performance tracking, enabling remote coaching, video analysis, and rehab monitoring.

Why Pose Estimation Matters for Sports Training

Pose-based video analysis enables:

  • Detailed Technique Analysis
    Coaches can see exactly how an athlete moves, such as knee angle during squats, arm swing symmetry, and hip extension. This enables targeted feedback and performance optimization.
  • Injury Risk Detection
    Pose tracking flags risky mechanics such as valgus knee movement and asymmetries that often precede injuries. This is critical for both elite athletes and youth programs.
  • Personalized Coaching and Rehab
    Training programs can be tailored to individual biomechanics. In rehab, pose-based tracking ensures symmetry and progress through recovery.
  • Progress You Can See
    Improvements in sprint angles, stroke straightness, or jump technique can be quantified and tracked over time, motivating athletes to train smarter, not just harder.
  • Remote and Scalable Coaching
    Athletes can record practice on their phone, receive pose-driven feedback, and compare their form to elite models without requiring an in-person session.

Pose estimation complements wearables. Together, they deliver a 360-degree view of athlete performance.

Core Features of a Sports Performance Tracking App

Let’s break down what goes into a successful sports analytics software solution.

Key Features

  • Video Capture and Upload
    Easy-to-use camera interface with guidance for framing athletic movement.
  • Pose Estimation and Metrics Extraction
    AI analyzes frames for biomechanical data such as knee angles, stride symmetry, and joint velocities.
  • Visualization and Feedback
    Skeleton overlays, comparison videos, angle charts, and coach-style feedback like “Your landing angle is 12 degrees below optimal.”
  • Session History and Tracking
    Progress tracking over time is ideal for athletes and coaches measuring training impact.
  • Multi-Athlete Coaching Interfaces
    For teams or training platforms, coach dashboards allow reviewing, commenting, and planning across users.

Tech Stack Considerations

  • Pose Estimation Engine
    MediaPipe is a popular choice for MVPs thanks to efficient on-device inference. 
  • Mobile App Development
    Native iOS and Android apps offer tight performance, but cross-platform tools like Flutter or React Native are also viable depending on the team.
  • Backend Infrastructure
    Cloud services such as Firebase support user management, video uploads, and session history. For heavier analysis or multi-angle processing, cloud-based inference pipelines using GPUs can be deployed alongside mobile apps.
  • Post-Processing Logic
    Jitter filtering, confidence gating, and biomechanics-driven insights are essential for user trust.
  • Front-End Visualization
    Clean, intuitive UI for displaying movement metrics, skeleton overlays, and progress graphs is key to user adoption.

What Strong Apps Do Differently in Sports Analytics

Apps that succeed in the market tend to:

  • Prioritize post-processing and biomechanics, not just raw pose data.
  • Implement sport-specific insights, going beyond generic joint tracking.
  • Handle real-world camera variability including occlusion, angles, equipment, and lighting conditions gracefully.

Top developers understand that pose detection is infrastructure. The real value lies in the insight layer built on top.

Strategy: From MVP to Custom Models

A smart development path:

  1. Start with MediaPipe to validate product-market fit.
  2. Test real-world video conditions such as occlusion, body type variance, and sports equipment.
  3. Replace components as needed with custom-trained models or cloud inference once requirements mature.

Out-of-the-box tools like MediaPipe work best for:

  • Single subject in frame
  • Controlled camera distance and angle
  • Minimal occlusion

They struggle with:

  • Sports involving gear such as bats or rackets
  • Partial-body or first-person inputs
  • Medical-grade accuracy or neuro assessments

When you’re ready to scale beyond MVP, consult our MediaPipe Tech Review to understand performance trade-offs.

Choosing the Right Sports App Development Company

If you’re planning to build a next-generation sports analytics software app, choose a sports app development company with:

  • Mobile deployment expertise with strong performance on iOS
  • Computer vision and machine learning capabilities
  • A track record converting noisy pose data into clean, actionable feedback

Whether your application demands lightweight edge processing on mobile or high-fidelity, multi-angle analysis in the cloud, your development partner should be equipped to handle both extremes.

The biggest challenge is not pose detection but creating consistent, trusted insights from variable video environments. That is why domain expertise and real-world UX testing are essential.

To explore our solutions, check out our full offering on the Sports Industry Page.

Final Thoughts

Pose estimation shifts the focus from activity tracking to true movement understanding. For startups, coaches, and sports tech brands, it enables scalable coaching, personalized feedback, and smarter training apps.

Whether you’re launching a niche product or building the next big coaching platform, combining pose estimation with biomechanics intelligence and mobile UX is your ticket to standing out.

Now is the time for fitness innovators and sports app developers to move beyond smartwatches and into the era of AI-enhanced sports performance tracking.

 

AI for Musicians: The New Creative Process

AI Music Technology Democratizes Creativity

In 2020, Billie Eilish swept the Grammys: Album of the Year, Record of the Year, Song of the Year, Best New Artist. The album that won it all was recorded in her brother’s childhood bedroom on a setup that cost less than $3,000. Accepting the award, producer Finneas dedicated it to “all the kids making music in your bedroom.” 

A decade earlier, this would have been unthinkable. For most of recording history, making a professional record meant booking expensive studio time, hiring engineers, and accessing gear that cost more than a house. Digital audio workstations changed that. Today, a laptop and a $200 microphone can produce Grammy-winning music.

Beyond cost, something bigger shifted: who gets to create. And every step of the way, skeptics asked the same questions: Is this real music? Is this cheating? Will this replace real artists? Now, AI is the next wave. And the questions sound exactly the same.

This shift is now being accelerated by AI music technology, as a new generation of AI-powered music tools becomes part of everyday creative workflows for musicians.

The AI Music Landscape in 2025

The change is already happening. In 2025, Deezer disclosed that over 20,000 AI-generated tracks were uploaded to its platform daily, representing 18% of all uploads. Suno, the leading AI music platform, has reached nearly 100 million users and raised $250 million at a $2.45 billion valuation. Major labels are taking notice: Warner Music Group settled its copyright lawsuit with Suno and entered a licensing deal, signaling a shift from resistance to collaboration.

But so far, the more interesting story seems to be how AI is becoming part of the creative workflow, not replacing it. Suno’s latest product, Suno Studio, illustrates this perfectly. Described as the world’s first generative audio workstation, it blends generative features with professional multi-track audio editing. Musicians can upload samples, edit in a multitrack timeline, control BPM, volume, and pitch, generate unlimited stem variations: vocals, drums, synths, and export everything as audio and MIDI to continue working in their existing DAW. As Suno’s CEO put it: “Studio was built to expand the toolkit for musicians; it intentionally does not prescribe workflows so that human talent can remain front and center.” This reflects a broader trend in AI music processing, where tools are designed to augment human creativity rather than automate it away.

A recent survey of 1,200 music creators found that 87% of artists have incorporated AI into at least one part of their process from songwriting and production to promotion. The ability to fill skill gaps is the most celebrated benefit of AI, driving the rise of self-sufficient creators who can handle every stage of their release cycle. These results highlight how quickly AI tools for musicians are becoming a standard part of modern music production.

In this blog, we’ll review some of the most interesting AI products and open-source models that are extending the possibilities of how artists create music – tools that enhance creativity, without replacing it. Fair warning: some of these products are pretty specialized, and explaining audio tools in text has its limits. 

 

Music AI Products that Extend the Artist’s Toolkit

Synplant 2: Creating Truly New Sounds with Neural Networks

One common criticism of generative AI is its perceived lack of originality. And it actually makes sense: generative models learn the underlying probability distribution of their training data and optimize to sample from that distribution given a prompt. In music generation, this means outputs often recombine existing patterns instead of creating something fundamentally new.

Creating truly new sounds has traditionally required significant effort: hours spent tweaking synthesizer parameters, recording and processing real-world audio, or experimenting with unconventional techniques. Synplant 2, developed by Sonic Charge, offers a different approach that uses neural networks not to generate music, but to accelerate the discovery of new timbres.

At its core, Synplant 2 is a two-operator FM synthesizer with a unique “genetic” interface. The main feature is Genopatch: a machine learning system trained not on existing music, but on the synthesizer engine itself. The neural network learns the inverse mapping – from sound back to parameters – so when you feed it any audio recording, it makes educated guesses about which settings would produce a similar result.

The output from Genopatch is a playable synth patch, not just an audio clip. Each generated patch becomes a starting point for further exploration through Synplant’s mutation system or DNA Editor. This makes Synplant 2 a strong example of AI music technology focused on sound design rather than composition.

 

Project LYDIA: New Ways to Interact with Sound

Another compelling example of AI opening new possibilities for instrument interaction is Project LYDIA, a collaboration between Roland Future Design Lab and Tokyo-based AI studio Neutone, announced in November 2025. Project LYDIA demonstrates how audio AI and machine learning can redefine how musicians interact with instruments in real time.

Project LYDIA, named as a nod to both “DIY” and “AI” – is a hardware prototype built on a Raspberry Pi 5 that brings Neutone’s Morpho technology into a compact, stage-ready pedal format. The core concept is what Neutone calls “neural sampling”: using an autoencoder neural network to learn the tonal characteristics of any sound source, then applying those characteristics to incoming audio in real time.

The technology works by training a model on a collection of sounds – this could be a traditional instrument like a violin, but also field recordings, environmental textures, or any audio you can capture. The neural network learns a compressed representation of how those sounds behave: their frequency content, how harmonics rise and fall, their overall timbral fingerprint. Once trained, you can feed any live input (your voice, a guitar, a synthesizer)  through the model, and it will reshape that input to carry the timbral qualities of the training material while preserving your original pitch, dynamics, and articulation.

What makes Project LYDIA interesting is the shift in how musicians can interact with sound. Traditional effects process audio through fixed algorithms; samplers trigger pre-recorded material. Neural sampling does something different: it lets you play through a learned understanding of sound. You’re not triggering a recording of a djembe – you’re transforming your input through what the model has learned about how djembes sound. The result is something that responds to your playing in real time while inhabiting a completely different sonic space.

This approach also removes the traditional boundaries of what can become an “instrument.” Users can train models on sounds that were never meant to be musical – the texture of a busy street, the hum of machinery, the ambiance of a specific location, and perform with them on stage. The choice of training material becomes a creative decision in itself.

 

ACE Studio AI Violin: Beyond the Limitations of Sampling

Speaking of moving beyond discrete sample triggering – ACE Studio’s AI Violin, released in beta in May 2025, applies a similar principle to virtual instruments.

Traditional digital music production relies heavily on sampling: large libraries of pre-recorded instrument performances triggered by MIDI data. A violin sample library might contain hundreds of gigabytes of recordings of different notes, articulations, dynamics, and bowing techniques all stitched together when you play. The challenge is that real musical performance is continuous and deeply contextual. A violinist doesn’t think in discrete samples; they shape phrases, transition between notes, and apply expression in ways that are difficult to replicate by sequencing isolated recordings. Making sampled instruments sound natural requires significant skill: programming keyswitches, drawing expression curves, and manually compensating for the inherent discontinuities between samples.

ACE Studio’s AI Violin takes a different approach. Instead of triggering pre-recorded samples, it uses machine learning to synthesize performances directly from MIDI input. The neural network has learned the characteristics of violin performance: bowing, vibrato, dynamics, tonal color, phrasing; and generates audio that exhibits these qualities in context. You input a melody, and the AI produces a performance with natural transitions, appropriate articulation, and expressive nuance, without requiring the producer to manually program every detail. It represents a new class of AI-powered music production tools that move beyond traditional sampling.

​​

BIAS X: Recreating Any Guitar Tone in Seconds

Beyond creative possibilities, AI can also save significant time in the practice and production workflow. Consider the process of practicing electric guitar. Part of mastering a song involves dialing in the right tone, the specific combination of amp settings, effects, and cabinet characteristics that define how the guitar sounds. When you want to play along with a particular track, recreating that exact tone traditionally requires hours of tweaking: adjusting gain staging, experimenting with EQ curves, layering effects, and comparing against the reference. For many players, this technical overhead becomes a barrier to simply playing.

BIAS X, released by Positive Grid in September 2025, addresses this directly with AI-powered tone matching. The software offers two approaches: “Text-to-Tone” lets you describe the sound you’re after in natural language , like “creamy blues lead with a hint of delay” or “90s Swedish death metal rhythm”, and the AI builds a complete signal chain. “Music-to-Tone” goes further: drop in a guitar track or full song, and BIAS X analyzes the tonal characteristics and reconstructs a matching preset. This is a practical example of AI music processing reducing technical friction in everyday creative workflows.

The system was trained on over one million tones and analyzed more than 200 amplifiers to understand the nuances of genre, era, and playing technique. When you upload a reference, it examines spectral and dynamic profiles to approximate the amp, cabinet, and effects chain. The result isn’t always a perfect one-to-one match, but it provides an excellent starting point from which you can refine it either conversationally, asking for “more bite,” “less reverb,” or “tighter low end” or using parameters directly.

This represents a shift in how guitarists interact with their sound. Instead of translating a tonal idea into technical specifications like “I need a mid-scooped high-gain amp with a tube screamer in front and a plate reverb”, you can simply describe what you hear in your head or point to an example. The AI handles the translation, letting you focus on playing.

 

Open-Source AI Music Technology and Models for Music Creation

Beyond commercial products, a thriving ecosystem of open-source models has emerged, giving musicians and developers access to state-of-the-art deep learning architectures and ability to customize further. These projects often push the boundaries of what’s technically possible, and many commercial tools build on or are inspired by this open research.

Demucs: Stems for Everyone

Ever wanted to isolate the bassline from a track to learn it by ear? Or pull out vocals for a remix? Stem separation (splitting a mixed song into its individual parts) used to require access to the original recording sessions. Demucs, developed by Meta AI Research, changed that by making high-quality separation available to anyone. It has become a foundational audio AI model for music source separation.

Drop in a song, and Demucs splits it into vocals, drums, bass, and everything else. The results are clean enough that musicians actually use them, not just as a curiosity, but as part of their workflow. Producers sample isolated elements into new tracks. Guitarists mute the original guitar to practice over the rest of the band. DJs create acapellas and instrumentals on the fly. Teachers extract individual parts for students to study.

What makes Demucs different from earlier separation tools is that it works directly on the audio waveform rather than on spectrograms. Without getting too deep into the technical weeds: this approach preserves more detail and produces fewer of those watery, artificial artifacts that plagued older methods.

There are plenty of commercial stem separators now, many built on similar technology, but Demucs remains a go-to for anyone who wants a free, open-source option they can run locally and customize. For developers and researchers, it’s become the baseline that everyone else measures against.

For more structured music tasks like AI music transcription, newer architectures such as Mamba are enabling faster and more accurate piano transcription systems.

 

RVC v2: The Voice Conversion That Went Viral

If you’ve stumbled across a YouTube video of Frank Sinatra singing “Blinding Lights” or SpongeBob performing death metal, you’ve heard RVC in action. Retrieval-based Voice Conversion took the internet by storm with AI covers, some hilarious, some eerily convincing, and introduced millions of people to what AI could do with music. RVC illustrates how accessible music AI models can rapidly shape creative culture.

But behind the memes, RVC is a genuinely useful tool. It takes a voice recording and transforms it to sound like someone else while keeping the original timing, phrasing, and emotion intact. Think of it as a real-time voice skin: you sing, and the output sounds like the target voice.

The creative applications go well beyond novelty covers. Artists can produce songs in languages they don’t speak by hiring a native singer to perform the vocals, then use voice conversion to transform it into their own voice. Solo producers can create male-female duets without hiring a second vocalist. Some musicians have even trained RVC on instrument samples instead of voices, with which you can sing a melody, and it comes out as a saxophone or violin, useful for sketching ideas quickly.

What made RVC v2 particularly significant was accessibility. The model can be fine-tuned on as little as 10 minutes of clean audio, meaning anyone with a decent microphone and some patience can create a custom voice model. This low barrier helped RVC become the most widely-used voice conversion tool in the open-source community – the baseline that newer models are still compared against.

There are more advanced commercial alternatives now, but RVC’s popularity and community support keep it relevant. For many musicians experimenting with voice conversion for the first time, it’s still the starting point.

 

ACE-Step: A Foundation Model for Song Generation

The music generation tools that grabbed headlines – Suno, Udio, didn’t emerge from nowhere. They built on years of open-source research, with models like Meta’s MusicGen and Stability AI’s Stable Audio pushing the technology forward in public. For anyone wanting to experiment with music generation these open models have been essential.

The current open-source state-of-the-art for full song generation with vocals is ACE-Step, developed jointly by ACE Studio and StepFun. Think of it as the Stable Diffusion of music: a foundation model designed to be flexible, fast, and customizable. ACE-Step represents a new generation of generative music AI built as a flexible foundation model.

What sets ACE-Step apart technically is its diffusion-based architecture. Without diving too deep: this makes it significantly faster than models that generate audio token-by-token. It can produce up to 4 minutes of music in roughly 20 seconds on professional hardware.

But speed isn’t the main appeal. Because it’s diffusion-based, ACE-Step supports features that sequential models struggle with: inpainting (regenerating just a section of a song while keeping the rest), remixing existing audio, and using your own content to influence generations. So it’s not only for “type a prompt, get a song” – artists can feed in their own material and shape the output.

For those wanting to go further, ACE-Step supports LoRA fine-tuning. In plain terms: you can train a lightweight adaptation of the model on your own music, so generations come out closer to your style without needing massive computing resources or starting from scratch.

Released under an open Apache 2.0 license with support for 19 languages, ACE-Step gives independent developers and musicians a foundation to build on. It won’t match the polish of commercial products yet, but it’s where a lot of experimentation is happening.

 

SAM Audio: Point at a Sound and Isolate It

We already covered Demucs, which splits music into fixed categories: vocals, drums, bass, and the rest. SAM Audio, released by Meta in December 2025, takes separation much further: it isolates whatever sound you ask for. This kind of targeted extraction is an emerging capability in advanced audio AI systems.

Want just the violin from an orchestral recording? Type “violin” and it pulls it out. Working with video and need to isolate a specific sound effect? Click on the object in the frame making the sound. Have a section of audio where the target sound is clear? Mark that segment as a reference, and the model finds and extracts similar sounds throughout the mix.

This flexibility with text prompts, time-range references, or even clicking on objects in video opens up new possibilities for sampling and sound design. Grab a specific percussion hit buried in a complex mix. Isolate a texture you like from someone else’s track to study how it was made. Extract a sound from a video clip to build a custom sample library.

The technology builds on Meta’s “Segment Anything” approach from computer vision, adapted for audio. It runs faster than real-time, so you’re not waiting around for results.

Demucs remains the go-to for straightforward stem separation. SAM Audio is for when you need something more specific – when the predefined categories aren’t enough and you know exactly what you’re after.

Many of these approaches extend beyond music into audio AI services, including speech, voice, and general sound processing.

The Toolkit Keeps Growing

The products and models in this post are just a snapshot, a few interesting cases that show how AI tools for musicians are evolving. This field moves fast, and there’s certainly more impressive stuff on the horizon. Together, these tools illustrate how rapidly AI music technology and audio AI services are evolving.

This overview is not exhaustive. It focuses on tools that enhance musicians’ capabilities rather than replace their strengths. Seeing AI positioned as a creative partner, not a substitute, is what makes this space particularly compelling.

Want to shape this future too? If you have an idea for a music AI tool and need a team to bring it to life, we’d love to hear from you. Our AI music services for the music industry page outlines the kinds of systems we help teams build – from creative tools to production-ready music AI solutions.

Have a project you’d like to discuss? Contact us below or reach out at hello@it-jim.com.

AI Music Technology for Piano Transcription

AI Music Tech is Booming

AI music technology is becoming a critical layer in modern digital products. Music education platforms, creative software, rights management systems, and large-scale media services increasingly rely on accurate and fast transcription to unlock value from audio data.

In our work with AI music processing systems, we see that traditional transformer-based approaches, while accurate, are often expensive to operate and difficult to scale. This article outlines how Mamba, a selective state-space model architecture, enables high-performance, cost-efficient AI music technology suitable for real-time and production environments. Hands-on research and experimentation behind this work was done by Roman Bernikov, who focused on model architecture choices, evaluation methodology, and performance optimization.

 

AI Music Technology Use Cases and Business Impact

Efficient transcription underpins many real-world AI music technology products, including:

  • Real-time music education and performance feedback
  • Large-scale music catalog indexing and analytics
  • Creative AI tools and digital audio workstations
  • Edge and on-device AI music technology
  • Enterprise-scale batch music processing

These are representative of the AI music processing challenges encountered in production audio AI services.

The Scalability Challenge in AI Music Technology

Automatic Music Transcription (AMT) converts raw audio into symbolic representations such as MIDI or piano rolls. Transformer-based models dominate benchmarks but scale poorly with sequence length, leading to increased latency and infrastructure cost.

In production audio AI services, this directly impacts feasibility for long recordings, real-time interaction, and cost-efficient deployment.

Why Mamba Is a Strong Fit for AI Music Technology

Mamba is a selective State Space Model (SSM) designed for linear-time sequence processing. Rather than relying on attention across all time steps, it maintains a learned internal state that evolves dynamically with incoming audio frames.

From a deployment perspective, this provides:

  • Linear inference time
  • Stable memory consumption
  • Efficient GPU execution

These characteristics are especially valuable when building scalable audio AI services.

 

Designing a Production-Oriented AI Music Transcription System

mamba model architecture

The transcription system follows a design approach we commonly apply in client projects:

  1. Spectrogram-based feature extraction
  2. Lightweight convolutional preprocessing
  3. Efficient long-range sequence modeling using Mamba
  4. Task-specific prediction heads

While more complex variants were explored, a unidirectional Mamba architecture with skip connections delivered the most reliable balance between accuracy, simplicity, and inference efficiency, which are key requirements for production systems.

 

Accuracy Evaluation 

System outputs were evaluated against ground truth MIDI annotations from the MAESTRO dataset. Metrics included onset timing, offset accuracy, frame-level alignment, and velocity estimation.

The model consistently captured musical structure and timing, with most errors occurring in ambiguous regions such as soft notes or dense passages. While absolute benchmark scores did not exceed heavily optimized transformer models, performance was competitive given the significantly simpler architecture.

For client-facing AI services, this level of consistency and predictability is often more valuable than marginal benchmark gains.

 

Efficiency Gains That Enable Scalable AI Music Technology

Inference benchmarks show that the Mamba-based system transcribed 30 minutes of audio in approximately 1.3 seconds, an order-of-magnitude improvement over transformer-based approaches and an impressive 1300 faster than realtime .

In production AI services, these gains translate directly into higher throughput, lower cloud costs, and the ability to support real-time user experiences.

 

Conclusion

Mamba-based architectures enable a shift toward deployable, efficient AI music technology. By combining competitive transcription quality with exceptional inference efficiency, this approach aligns well with the requirements of scalable audio AI services operating in real-world environments.

If you’re exploring how advanced architectures like Mamba can be applied beyond research, at It-Jim we help teams turn music AI ideas into real products. Our work spans music product prototyping and validation, music generation, music and video synchronisation, and custom plugin and tool creation. Learn more about our work on the Music Tech industry services page.

Have a project you’d like to discuss? Contact us below or reach out at hello@it-jim.com.

 

RoomPlan is Awful and it’s Great!

RoomPlan is a powerful framework from Apple designed for the fast and convenient creation of 3D models of rooms, using augmented reality (AR) technologies and LiDAR scanning capabilities. In our previous article, we reviewed the basic functions of RoomPlan, such as session setup, the structure of core components, and the specifics of output data. We explored how this tool can interact with the surrounding space to transform your rooms into a 3D model.

At first glance, RoomPlan is an impeccable tool for modeling rooms and indoor spaces. Its features might seem exhaustive for many tasks: automatic object recognition, real-time 3D model creation, and export capabilities. All this provides broad possibilities for developers, interior designers, and AR enthusiasts seeking a tool for quick and efficient work with room spaces, visualization, and presentation.

RoomPlan Framework by Apple

However, like many modern technologies, RoomPlan has darker sides worth considering. Despite its progressive features, this framework has several limitations and drawbacks that can significantly impact the final result and may require developers to put in extra effort to overcome them. In this article, we will look at the key issues one might encounter when working with RoomPlan and explain why this tool may not be as perfect as it appears.

Today, we’ll attempt to look beyond the mirror of RoomPlan and examine its limitations. This is an important step for everyone planning to use this tool in their projects, as understanding RoomPlan’s shortcomings will help you prepare for potential problems in advance and devise ways to address them.

Approximately correct, almost accurate

Although RoomPlan is positioned as a tool for professional spatial measurement tasks, in practice, its capabilities are limited by several important aspects that affect the final accuracy of the models.

Apple claims:

“RoomPlan outputs in USD or USDZ file formats that include dimensions of each component recognized in the room, such as walls or cabinets, as well as the type of furniture detected. (https://developer.apple.com/augmented-reality/roomplan/)”

In practice, various factors greatly distort the scanning results.

Limited Object Recognition

Although RoomPlan offers automatic object recognition, its capabilities in this area are quite limited. The tool can only identify basic interior elements, such as tables, chairs, sofas, and some household appliances.

Object Category Detection by RoomPlan

However, more complex or less common objects – like air conditioners, boilers, shelves, wall lamps, or decorative elements – remain beyond RoomPlan’s detection capabilities. Consequently, these objects simply do not appear in the model or are replaced with simplified shapes, leading to significant detail loss and affecting the overall spatial accuracy.

Example of Limited Object Recognition by RoomPlan Apple
Example of Limited Object Recognition by RoomPlan Apple

Rectangular Simplifications

A significant limitation of RoomPlan is that the system attempts to reduce all objects and surfaces to a set of rectangles. This approach ensures processing speed but significantly impacts the quality and detail of the 3D model.

For instance, unique architectural elements, such as semicircular arches and sloped or non-flat walls, are simplified into primitive rectangular blocks, which noticeably distorts the model and reduces its actual accuracy.

Additionally, there is an issue with handling height variations, sloped ceilings, moldings, and baseboards, as these elements are almost always ignored when creating the model.

Rectangular Simplifications by RoomPlan

Ceilings and Skylights

RoomPlan does not capture any ceiling data, meaning you won’t be able to include ceilings in your model. This limitation is especially critical for tasks involving lighting design or calculations of room volume, as ceiling data is essential for these applications.

Ceilings and Skylights Recognition by RoomPlan

Furthermore, RoomPlan does not detect skylights, which are often integral to the functionality and aesthetics of attic or loft spaces. This lack of ceiling and skylight recognition further reduces RoomPlan’s applicability for projects requiring comprehensive architectural detail.

Measurement Errors

RoomPlan has accuracy issues when absolute precision, rather than relative precision, is required, resulting in dimensional discrepancies. An error of ±5 cm in a 1-meter wall may seem minor, but it’s important to remember that such errors accumulate. For example, in a space with multiple partitions, a divided bathroom, or a hallway, the deviation in each wall/window/door compounds, leading to a much more pronounced distortion overall.

In the example below, you can see the dimensions of a wall with a window embedded within it.

Measurement Errors by RoomPlan Framework

For the demo space in this article, the length deviation reached more than 37 cm, with the actual length being 6.45 meters, compared to RoomPlan’s measurement of 6.821 meters.

Incorrect Wall Thickness Representation

RoomPlan sometimes fails to calculate the actual thickness of walls, simplifying them to standard partitions (~16 cm), and only in cases where merging is performed can thicknesses be increased to better match the actual geometry.

Additionally, all exterior walls in your space are guaranteed to be represented as 16 cm. As a result, thick exterior or interior walls appear too thin in the model, which can distort scale and other aspects of the model critical for accurate interior planning.

Incorrect Wall Thickness Representation by RoomPlan

Incorrect Wall Thickness Representation by Apple RoomPlan Framework

Issues with Doors and Windows

When it comes to working with doors, whether they are combined door-window units or double doors, RoomPlan may interpret them as a single plane or merge them incorrectly, compromising the model’s realism. Although RoomPlan does differentiate between doors and openings, this distinction is not visually represented in the 3D model. In 3D, an “opening” is merely a hole in the wall, while a “door” is intended to represent an actual door. However, in practice, both appear identical, offering no distinction in the data or model view.Apple RoomPlan Issues with Doors and Windows

In order to get data on Openings – sizes, positions and determine the parent component, you need to work with the CapturedRoom JSON data file.

Additionally, for doors, factors such as the direction they swing open or even the exact placement within the opening are not captured. This impacts the model’s accuracy and can create mismatched expectations, as knowing the door’s orientation and position is crucial for many professional applications. The lack of this information diminishes the usefulness of the model, as the distinction between doors and openings becomes almost meaningless when there are no visual or data differences.

A further complication arises with double doors when one side is open and the other closed; in this case, RoomPlan often visualizes the closed side as part of the wall. Conversely, if both doors are open, creating a wide passage, it may register this as an opening rather than a door. This leads to inconsistencies in the representation, affecting both the visual model and spatial data.

For windows, RoomPlan often trims frames if they are sectioned or multi-level.

In cases where doors have a complex configuration or non-standard design, the tool may fail to represent them accurately, adding difficulties in further work with the model.

Doors and Windows Recognition by Apple Roomplan

Large Mirror Surfaces

Floor-to-ceiling mirrors and mirrored wardrobe doors pose a particular challenge for RoomPlan. Due to their optical properties, LiDAR often fails to accurately process these reflective surfaces, resulting in significant distortions or errors in the scan.

For example, large mirrors can cause “gaps” in the model, their absence (as if the wardrobe isn’t there), or the creation of phantom objects that don’t exist in the real space.

Each of these issues reduces the accuracy and reliability of models created using RoomPlan and requires developers to invest additional effort to refine and adjust the completed 3D scenes.

Walls Encroaching on Space

In iOS 17, walls in RoomPlan may encroach on the interior space, covering objects that are placed closely against them. This is especially noticeable when furniture or other items are flush with the walls.

This behavior has been improved in iOS 18, where wall boundaries are handled more accurately.

Wall Thickness Limitations

RoomPlan has a restriction on wall thickness, which cannot exceed approximately 50 cm. Walls that are thicker than this limit are treated as two separate thin walls, which can result in incorrect structural representation for spaces with very thick walls.

Inconsistent Wall Heights

Wall heights within a single room can vary, especially at corners where walls of different heights may converge. This issue is primarily seen in rooms with decorative elements, arches, or transitions near the ceiling, which cause height discrepancies.

Inconsistent Wall Heights by RoomPlan
Inconsistent Wall Heights by RoomPlan

Curved Walls and Floor Gaps

RoomPlan struggles with accurately representing curved walls. The system simplifies floors by aligning to the wall’s extreme points, resulting in gaps between the wall and the floor where a curve exists.

Curved Walls and Floor Gaps by RoomPlan

Simplification of Columns and Niches

Columns, niches, and other structural details are typically simplified or removed entirely in the RoomPlan model, which affects the accuracy of the final representation and loses critical architectural elements.

Native merge

One of RoomPlan’s features is the automatic process of merging individual elements of a room or space into a unified 3D model. However, while this function seems beneficial, in practice, it introduces considerable distortions, as RoomPlan attempts to optimize the final model’s appearance, often at the expense of accuracy. As a result, individual rooms may appear reasonably accurate and detailed after scanning, but the combined model often exhibits serious distortions. This makes the final 3D model less suitable for professional use, where precise measurements and proportions are critical.

Merging Floors of Different Rooms

RoomPlan automatically combines all floors into a single plane, which can significantly compromise the model’s realism. This merging largely depends on wall parameters and on how accurately the walls are combined into a shared space.

Merging Floors of Different Rooms with RoomPlan by Apple

Another issue arises from how RoomPlan treats level differences—it does not account for steps or platforms within rooms. In these cases, each room may look reasonably accurate, but upon merging, all these simplifications create additional discrepancies and mismatches between the separate areas. The combined floor gives the impression that all rooms are on the same level and share a uniform appearance.

Lack of Support for Multi-level Structures

RoomPlan is limited to working within a single floor, with merging possible only within a single horizontal plane. This means that for multi-story buildings, it is necessary to create separate models for each floor, treating each as an independent model.

The inability to merge floors into a single model complicates projects where it’s essential to represent all levels of a structure. This limitation makes RoomPlan less convenient for tasks requiring an overall view or when calculating volumes across multiple floors.

Automatic Wall Angle Alignment

RoomPlan automatically adjusts wall angles to make them perpendicular if there are minor deviations, even if, in real space, the angles are not perfectly right. This optimization is aimed at standardizing the model, but it often distorts the geometry of the room. Consequently, the model loses unique architectural features that may be essential for preserving the individuality and accuracy of the space.

Automatic Wall Angle Alignment with Apple RoomPlan

The problem becomes even more pronounced when dealing with spaces featuring complex structures or non-standard wall geometries, such as oval or slanted walls (like those in attics), where automatic angle straightening changes the room’s appearance and is not suitable.

Thus, although RoomPlan’s automatic merging aims to simplify and streamline the model creation process, in practice, it can significantly reduce accuracy. This requires users to put in extra effort to adjust the merged model so that it aligns with real conditions and architectural requirements.

Developers’ suffering

Preview Customization

RoomPlan provides a built-in preview view during scanning, but it is fixed and does not support customization. By default, you will always have an AR session with a visualization of the scanned space and a preview in the middle of the bottom. You can only add elements to the standard view, such as buttons, indicators, etc.

For real-world tasks you might want to go beyond the standard RoomCaptureView,  you can create your own custom view (we’ve already presented this in a previous article) from scratch.

That is, you can completely define the appearance, corners, and colors, for example, by coloring the floor and walls separately, or ignore objects if you are only interested in the outline of the room.

Preview Customization with RoomPlan

Export Issues

When attempting to export data after working with RoomPlan, be prepared for potential errors if file names start with numbers, such as “1234,” or if UUIDs are used for name generation. This issue results in failed exports.

To fix this, just add any Latin letter or word to the beginning of the word, for example, *export_*.

While this bug is resolved starting with iOS 18, earlier versions still exhibit this problem, so it’s important to be cautious with file naming when exporting RoomPlan data on older iOS versions.

Custom AR Session problem

If you want to integrate a custom AR session to work with your own configurations and pass it into the RoomCaptureView initializer, you may encounter several issues once your application runs, including:

  • Incorrect operation due to missing depth data
  • Stuttering and lag
  • Premature session termination if the app is minimized

This bug is also resolved starting from iOS 18, but it remains on earlier versions. If you need to use a custom AR session, it may be best to create a fully custom preview to ensure stable functionality.

Separate Coordinate Systems for Rooms

Each room scanned by RoomPlan has its own local coordinate system, which complicates integrating rooms within a unified space.

Developers must resort to workaround solutions to handle these transformations, making it challenging to work with multiple rooms cohesively in a single environment.

Summary

RoomPlan is an innovative framework that offers the ability to quickly create 3D models of spaces but brings with it many significant challenges. Although it is marketed as a convenient tool for design and visualization, its functions have notable limitations that should be considered.

The simplification of shapes, measurement inaccuracies, merging issues, and lack of easy customization preview support make RoomPlan less versatile than it might initially seem. For professional use, where high precision and detail are required, RoomPlan may prove insufficiently reliable and demand additional processing of the generated models or even the development of custom post-processing solutions.

Fortunately, there are ways to enhance RoomPlan’s capabilities. By combining RoomPlan’s output with raw data from iOS sensors, refining RoomPlan’s data structures through custom C++ integrations, or applying advanced computer vision algorithms, it’s possible to achieve higher accuracy and improve the reliability of the generated models. Some solutions addressing these issues are already emerging, providing a pathway for those looking to maximize RoomPlan’s potential in their applications.

It’s worth noting, however, that this tool is relatively new, and Apple continues to improve it. Even now, we see a significant difference in RoomPlan’s performance between iOS 17 and iOS 18, with the latter offering noticeable improvements. Despite current shortcomings, RoomPlan has great potential and will likely become more functional as technology advances and updates are released.

Thus, using RoomPlan today requires a thorough assessment of its capabilities and limitations, as well as a willingness from developers to adapt to its specific requirements. For those prepared to put in the extra effort, this tool may still open up new possibilities in creating interactive and rich AR experiences.

Barcode Safari: Exploring the iOS Scan Frontier

Recently, we encountered a task in one of our projects involving the development of a product management system for a large warehouse. The system needed accurate and efficient barcode detection to streamline inventory tracking, reduce human errors, and optimize workflows.

We have different options to tackle this problem. Should we use a dedicated barcode detection technology, or integrate barcode detection within an Optical Character Recognition (OCR) framework? Let’s try both and find out!

After thorough investigation, we selected four libraries for detailed research: Vision, MLKit, ZXingObjC, and SwiftyTesseract.  The main challenge was ensuring that the system could scan and identify multiple types of barcodes quickly and with high accuracy. Given the scale of the warehouse operations, performance and reliability were critical factors.

During our investigation, we faced several challenges, including:

  • Accurately identifying different types of barcodes
  • Determining the position of barcodes in photos
  • Handling scenarios where multiple barcodes appear in the same frame
  • Achieving high performance with minimal lag during scanning
  • Ensuring that the selected solution is well-supported and actively maintained for compatibility with future Swift and iOS updates
  • Considering cross-platform compatibility for potential future Android implementation

Picking the right barcode detection solution is key. Every project has its own needs, and by understanding them, we can decide on the best technology for barcode detection in iOS.

Vision

The Vision framework, provided by Apple, offers built-in support for barcode detection, allowing easy implementation with minimal code and no additional dependencies. It integrates seamlessly with AVCaptureSession, making it straightforward to add barcode scanning capabilities to iOS apps.

One of the major advantages of Vision is its seamless integration with the Apple ecosystem, ensuring that you don’t need to rely on external libraries or frameworks. It also provides high performance, with an average barcode detection processing time of just 0.07 seconds, which makes it highly efficient. Additionally, it generally offers high accuracy in barcode detection. However, in some cases, Vision may add a leading zero to barcodes, especially when the barcode starts with zero, so it becomes two zeros. This behavior could require additional handling to account for such scenarios.

Such a barcode, for example, will be detected as 0036000291452 in the picture below.

Barcode example

For the demo app, we created a minimalist UI with a choice of recognition modes and a display of results or errors on the tether for instant feedback.

The framework supports a wide variety of barcode formats, including both linear and 2D barcodes, and provides useful extra details, such as the bounding box and symbology of detected barcodes.

Another benefit is that Vision allows the detection of multiple barcodes within the same frame, which can be crucial for scanning large volumes of barcodes. Furthermore, Vision gives you the ability to specify a region of interest for barcode detection, which removes the need to crop the image beforehand.

You can also customize the barcode detection to focus on specific barcode symbologies or image orientation, which helps reduce unnecessary processing and false positives. With abundant resources like tutorials and official documentation available, integration and troubleshooting are made easier.

Barcodes Recognition with Apple Vision Framework

However, Vision does come with some limitations. It is exclusive to iOS, so if you’re aiming for cross-platform compatibility, it may not be the best fit. Additionally, handling edge cases, such as damaged barcodes, can be challenging. Also the issue of handling leading zeros in certain barcodes might require extra coding effort to ensure accuracy in all cases.

To use barcode detection in your app, you only need to import Vision framework and add the code for barcode detection.

Here’s a simple example that demonstrates how to implement barcode detection with Vision.


func detectWithVision(
        photo: URL,
        completion: @escaping ((String, CGRect)?) -&amp;amp;amp;amp;amp;gt; ()
    ) {
        guard
            let image = UIImage(contentsOfFile: photo.path),
            let cgImage = image.cgImage
        else {
            completion(nil)
            return
        }
        
        let request = VNDetectBarcodesRequest { (request, error) in
            guard
                let results = request.results as? [VNBarcodeObservation],
                error == nil
            else {
                completion(nil)
                return
            }
            
            let detectedBarcodes: [(String, CGRect)] = results.compactMap {
                guard let payloadStringValue = $0.payloadStringValue else {
                    return nil
                }
                return (payloadStringValue, $0.boundingBox)
            }
            
            completion(detectedBarcodes.first)
        }
        
        let handler = VNImageRequestHandler(
            cgImage: cgImage,
            orientation: image.cgImagePropertyOrientation
        )
        try? handler.perform([request])
    }

MLKit

MLKit, developed by Google, provides robust barcode detection for both iOS and Android, offering a cross-platform solution that supports multiple barcode formats.

One of its standout features is the ability to handle multiple barcodes in a single frame, making it ideal for scanning several items at once.

In addition to detecting barcodes, MLKit also provides detailed information for each result, including the barcode’s frame, format, and any specific data it contains – such as URLs, phone numbers, emails, or Wi-Fi credentials. It supports a wide range of barcode formats, covering both linear and 2D types.

The framework provides solid performance with an average processing time of around 0.16 seconds, which is still relatively fast. It has high accuracy and, unlike Vision, does not add extra leading zeros to barcodes. Additionally, it performs well in detecting damaged barcodes, making it a versatile choice for real-world scenarios. MLKit also offers comprehensive documentation and is regularly updated by Google. You can also specify the specific barcode formats you’re interested in, helping optimize performance and reduce unnecessary processing.

For example, with damaged barcodes like those shown below, MLKit still works reliably, whereas other solutions might struggle.

Damaged Barcode Detection with MLKit by Apple

However, MLKit does come with some drawbacks. For iOS, integration requires using Cocoapods, as it is not available through Swift Package Manager (SPM), which can make the initial setup more complicated.

Additionally, while it supports multiple barcode detections, if you need to specify a region of interest, you will either have to crop the image beforehand to focus on that area or implement additional filtering logic after detection. This extra step can increase the complexity of the handling process.

To integrate MLKit into your iOS project follow official documentation. Once MLKit is integrated into your project, you can implement barcode detection using the following code example.

func detectWithMLKit(
        photo: URL,
        completion: @escaping ((String, CGRect)?) -&amp;amp;amp;amp;amp;gt; ()
    ) {
        guard let image = UIImage(contentsOfFile: photo.path) else {
            completion(nil)
            return
        }
        
        let visionImage = VisionImage(image: image)
        visionImage.orientation = imageOrientation
        
        let barcodeScanner = BarcodeScanner.barcodeScanner()
        
        barcodeScanner.process(visionImage) { (barcodes, error) in
            guard let barcodes = barcodes, error == nil else {
                completion(nil)
                return
            }
            
            var detectedBarcodes: [(String, CGRect)] = []
            detectedBarcodes = barcodes.compactMap {
                guard let value = $0.displayValue else {
                    return nil
                }
                return (value, $0.frame)
            }
            
            completion(detectedBarcodes.first)
        }
    }

ZXingObjC

ZXingObjC is an open-source library for barcode scanning on iOS, and it’s part of the broader ZXing (Zebra Crossing) project. It supports a wide range of barcode formats – including some not covered by Vision or MLKit – such as RSS14 and Maxicode, making it a good fit for projects that need specialized or legacy barcode support.

To integrate ZXingObjC, you can use CocoaPods or Carthage. For barcode-focused apps, the ZXCapture class offers a straightforward way to implement real-time scanning without setting up your own AVCapture session.

However, the integration process is more complex compared to other solutions. ZXingObjC can also add leading zeros to barcodes and struggles with detecting damaged barcodes. Its performance is slower than Vision and MLKit, with an average processing time of 0.3 seconds. Additionally, the accuracy of barcode detection can be inconsistent, especially when the barcode is at a non-optimal angle. This can make scanning barcodes challenging, as it may require the user to adjust the angle for detection. ZXingObjC does not support multiple barcode detection simultaneously, and it lacks the ability to specify image rotation. Furthermore, while it can provide the coordinates of a detected barcode, it only returns two points, meaning you don’t get the full bounding box or frame of the barcode. Another downside is that ZXingObjC is no longer actively maintained, and there have been no updates for some time, which raises concerns about its long-term reliability.

Another concern is that ZXingObjC is no longer actively maintained, raising questions about future compatibility. As shown in the example below, the detection results can vary depending on the angle, lighting, and visibility of smaller elements.

Barcodes scanning on iOS with ZXingObjC Open-Source Library

To add ZXingObjC to your project the instructions in the GitHub repository. Once you have ZXingObjC integrated into your project, you can use the following code example to implement barcode detection.

func detectWithZXing(
        photo: URL,
        completion: @escaping ((String, CGRect)?) -&amp;amp;amp;amp;amp;gt; ()
    ) {
        guard
            let image = UIImage(contentsOfFile: photo.path),
            let cgImage = image.cgImage else {
            completion(nil)
            return
        }
        
        DispatchQueue.global().async {
            let source = ZXCGImageLuminanceSource(cgImage: cgImage)
            let binarizer = ZXHybridBinarizer(source: source)
            let bitmap = ZXBinaryBitmap(binarizer: binarizer)
            let reader = ZXMultiFormatReader()
            
            let hints = ZXDecodeHints()
            hints.tryHarder = true
            hints.addPossibleFormat(kBarcodeFormatEan13)
            
            reader.hints = hints
            
            do {
                let result = try reader.decode(bitmap, hints: hints)
                if let value = result?.text {
                    completion((value, .null))
                } else {
                    completion(nil)
                }
            } catch {
                log.error(error: error)
                completion(nil)
            }
        }
    }

SwiftyTesseract

SwiftyTesseract, built on Google’s Tesseract OCR, is primarily designed for optical character recognition (OCR), but it can be adapted for extracting barcode numbers when that is the main goal. It integrates easily with Swift Package Manager (SPM), but it requires additional setup, such as downloading the appropriate language training files and adding them to your project. Since SwiftyTesseract is not specifically tailored for barcode detection, its capabilities are quite limited in this context. To achieve optimal results, the image must first be cropped to the region containing the barcode, and it should be free of additional text. Furthermore, the image quality must be high otherwise, the results may be inconsistent or inaccurate.

However, even when the image is cropped properly and of good quality, it may still miss some numbers or produce completely inaccurate results. Its performance is also a major concern, with an average processing time of around 2 seconds for a cropped image and approximately 12 seconds for the original image, making it unsuitable for real-time or high-performance barcode detection.

Additionally, it cannot be used for non-text-based barcodes. The library is quite old and is no longer actively maintained, further limiting its reliability and support.

In the example below, it sometimes reads a text-based barcode correctly, but other times it produces an entirely incorrect result.

Barcodes Recognition on iOS with SwiftyTesseract

To integrate this library into your project follow the steps outlined in GitHub repository. Be sure to pay attention to the “Additional configuration” section,  as you will need to add language training files to your project.

After completing the setup, you can use the following code example to implement barcode detection.

func detectWithTesseract(
        photo: URL,
        rectOfInterest: CGRect,
        completion: @escaping ((String, CGRect)?) -&amp;amp;amp;amp;amp;gt; ()
    ) {
        guard
            let image = UIImage(contentsOfFile: photo.path),
            let croppedImage = image.cropping(to: rectOfInterest)
        else {
            completion(nil)
            return
        }

        let tesseract = Tesseract(
            language: .english,
            dataSource: Bundle.main
        )
        tesseract.allowList = "0123456789"

        DispatchQueue.global().async {
            let result = tesseract.performOCR(on: croppedImage)
            switch result {
            case .success(let text):
                completion((text, .null))
            case .failure(let error):
                log.error(error: error)
                completion(nil)
            }
        }
    }

 

Final Comparison

Based on the challenges we faced and the requirements for our barcode detection system, we developed a list of criteria to compare each technology.

After researching and testing the selected technologies, we are able to conduct a comparative analysis of their performance.

Let’s do this in the form of a bar chart, with the horizontal axis showing the time taken to process the image, and the vertical axis showing the selected technologies and their results:

Solutions Comparison for Barcode Recognition on iOS

The difference between the results is significant and, in some cases, critical.

If we project these results to the user experience, we can accurately indicate that Vision and MLKit show high performance and can definitely be offered for inclusion in a project. Instead, ZXingObjC offers processing in 300 ms, which is significantly longer than its predecessors, but can still provide a comfortable user experience when working in real time.

SwiftyTesseract shows the worst performance in terms of frame processing time, so it definitely cannot be used in real-time processing applications, but it can be used with photos or for background tasks if available. This is also due to the peculiarities of the general OCR approach to recognize all characters and then process the ones we have selected.

Below is a detailed comparison of Vision, MLKit, ZXingObjC, and SwiftyTesseract based on key factors:

Criteria Vision MLKit ZXingObjc SwiftyTesseract
Ease of integration High Medium Medium Medium
Supported formats Codabar 

Code 39 

Code 93 

Code 128

EAN-8

EAN-13 

ITF
UPC-A

UPC-E

Aztec 

Data Matrix 

PDF417 

QR-code

Codabar 

Code 39 

Code 93 

Code 128

EAN-8

EAN-13 

ITF
UPC-A

UPC-E

Aztec 

Data Matrix 

PDF417 

QR-code

Maxicode

RSS-14

 

Only text-based
Performance 0.07 sec 0.16 sec 0.3 sec 2 sec
Accuracy High High Medium Low
Cross-platform No Yes Yes Yes
Additional info
Barcode format

+ frame

Barcode format 

+ frame

Only 

barcode format

None
Multiple detection
Yes Yes No No
Tutorials / docs
High High Low Medium
Library support and updates Yes Yes No No

Barcodes Recognition on iOS: Conclusion

Each barcode detection library for iOS has its advantages and disadvantages, making the choice dependent on specific project requirements.

Vision: Ideal for projects that prioritize ease of integration, high performance, and simplicity over cross-platform support and ultra-high accuracy. It offers a seamless experience with good results, making it the best choice for applications that don’t require support for multiple platforms and where barcode detection is essential but not necessarily perfect.

MLKit: The go-to solution for cross-platform applications, especially when accuracy is critical and the ability to detect even damaged barcodes is required. It is highly supported with comprehensive documentation and frequent updates, making it an excellent choice for applications that need reliable performance across both iOS and Android.

ZXingObjC: A solid option for projects needing support for barcode formats not available in Vision or MLKit, such as Maxicode and RSS-14. However, the integration is more complex, and the lack of ongoing support could lead to issues in the future. It is a good option for projects with specific barcode format requirements but less ideal for projects requiring long-term stability and maintenance.

SwiftyTesseract: Not recommended for traditional barcode detection. It’s more suitable for projects where OCR is the primary focus, with barcode detection as a secondary task. It can handle only text-based barcodes and has slower performance, making it unsuitable for high-performance barcode scanning.

Ultimately, the choice depends on your project’s goals and constraints. Will you opt for the simplicity and speed of Vision, the cross-platform power of MLKit, the extended format support of ZXingObjC, or the OCR focus of SwiftyTesseract? The decision is yours.

This exploration has been a real challenge, showing us that a seemingly simple question can lead to complex answers. Which solution would you choose?