Published February 26, 2026 11 min read

SAM Audio: Redefining Sound with Multi-Modal AI

by Ievgen Gorovyi

CEO @ It-Jim

by Ustym Khaburskyi

AI Audio Tech Lead @ it-Jim

The Moment Audio Caught Up with Vision

If you’ve ever tapped on a person in an Instagram photo and watched them lift cleanly from the background as a sticker – that’s Meta’s Segment Anything Model working quietly behind the scenes. Instagram’s Cutouts feature, confirmed by Meta to be built on SAM, turned what used to be a careful manual masking job into a single tap. And with SAM 3, released in late 2025, the model took another step: it added text prompting. Instead of clicking, you can now type “yellow school bus” into a video and the model finds it, segments it, and tracks it through every frame. Now Meta has brought the same thinking to audio.

SAM Audio, released by Meta’s Superintelligence Labs in December 2025, does for sound what SAM did for images. Describe it, point to it in a video frame, or mark a moment in the timeline of a waveform where it’s clearly audible and the model separates it from everything else. One model covering tasks that previously each required their own specialized system: speech denoising, stem separation, speaker diarisation, instrument isolation, arbitrary sound extraction – the full range of audio processing services now addressable through a single promptable interface. As a foundation in multi-modal ai, it unifies text, vision, and temporal cues into a single generative system for sound. In this blog, we’ll walk through what the model can do, how well it actually performs in practice, and what the paper contributes to the audio AI field – including some details worth paying close attention to if you’re thinking about integrating or building on this technology.

Tell It What You Want: Three Ways to Talk to SAM Audio

The most natural way to think about SAM Audio is as a model with three different input channels, each suited to a different situation. They can be used independently or combined for more precise control.

Text prompt

The simplest starting point. You type what you want to extract. “acoustic guitar,” “crowd noise,” “male speech,” “violin section”, and the model does its best to find and isolate that sound within the mix. In practice, this turns the system into a powerful ai audio separation engine that behaves almost like a search bar for sound. This works well when the target is easy to describe and acoustically distinct from everything else around it. Think of it as a search bar for sound: most of the time, good enough phrasing gets you what you need. Where it starts to struggle is when the target is hard to pin in words, like close variants of similar sounds, for example, or a specific effect within a dense soundscape where text alone can’t fully disambiguate what you’re after. For similar-sounding targets coexisting in the same mix the video mask prompt is usually the handiest approach

Video mask

When there’s accompanying video, this prompting mode becomes genuinely powerful. Instead of describing the sound, you show the model where it’s coming from. Click on the drummer in the frame, draw around the guitar amplifier, tap on the speaker whose voice you want to isolate and the model extracts the audio associated with that visual region. For anyone working with video content, this replaces what would previously have required access to the original multitrack sessions. You just point. Although, for musicians searching for a particular sound effect, span prompting is where it gets most interesting and truly becomes an audio-native tool.

Span prompt

This is the most distinctive feature, and the one most worth understanding if you’re a sound engineer or producer. Instead of describing the sound or pointing to a visual source, you highlight a short segment of the audio timeline – a moment where the target sound is clearly audible and relatively isolated. That segment becomes a fingerprint. The model uses it to find and extract acoustically similar material throughout the rest of the recording. Think of a field recording where a specific bird call appears and disappears across a 10 min of material, or a backyard capture where you want to pull out just the sound of wind rattling a piece of metal, a texture you heard once clearly, buried somewhere in the mix. Find one clean moment where it’s audible, mark it, and the model hunts it across the whole track, isolating from residual sounds.

All three modes can be combined in any pair variation or be used independently.

How Good Does It Actually Sound?

This is the question that matters most for anyone considering using the model in a real workflow, and it deserves an honest answer rather than a recitation of benchmark numbers.

The first thing to understand is that SAM Audio is a generative model. It doesn’t carve a sound out of a mix the way a mask-prediction model would. It synthesizes a new version of the target sound, conditioned on the mixture and the prompt. This is an important distinction when it comes to evaluation. The standard objective metric for separation quality is SDR (Signal-to-Distortion Ratio), which works by comparing the model’s output sample-by-sample against a clean reference recording of the target sound. But since SAM Audio generates rather than extracts, even minor differences in timing between the output and the reference, things that are completely imperceptible to a human listener, cause SDR scores to drop. Comparing SAM Audio’s raw SDR numbers against those of a mask-based discriminative model is like comparing two different kinds of instruments on a scale calibrated for only one of them. The paper acknowledges this directly, which is part of why the authors developed new evaluation methods (more on that in the next section).

With that caveat in mind, across the benchmarks where comparison is meaningful (general sound separation, speech, music, and instrument separation in both in-the-wild and professionally produced recordings) – SAM Audio outperforms prior general-purpose models and reaches competitive or state-of-the-art performance against specialized systems that were purpose-built for a single domain. That’s a significant result: one model beating tools that were designed to do only one thing.

In real-world use, though, there’s a practical nuance worth knowing. The training data included not only high-quality studio recordings but also a large volume of medium-quality video audio, the kind of material you’d find on the internet, recorded with phone microphones in noisy environments. This is actually what gives the model its broad generalization ability. But it also means the model has learned many sounds as they typically exist in the wild: embedded in their natural context, surrounded by ambient noise. When you try to isolate something like an ambulance siren, or the sound of a passing train, or distant crowd noise (sounds that in real recordings almost never appear in silence) – you’ll often find that the extracted result carries some of that environmental texture with it. The model isn’t doing something wrong; it’s producing what it learned is a realistic version of how that sound exists in the world. It’s something to factor in when choosing whether SAM Audio is the right tool for a given task, versus a more specialized model trained on cleaner data for a specific domain.

Under the Hood. What the Paper Actually Contributes

From here, we’re getting more technical. This section is aimed primarily at deep learning engineers and researchers. The paper makes contributions in four areas worth looking at closely: the generative architecture, span prompting, a new automatic evaluation model, and a new approach to subjective evaluation.

Architecture

SAM Audio is built on a Diffusion Transformer trained with flow matching. The core framing is important: the source audio mixture is not the direct input being processed – it’s a conditioning signal. The model learns to generate the target stem by iteratively refining from noise, guided jointly by the audio mixture and the user’s prompt. This is fundamentally different from the mask-prediction paradigm that has dominated audio separation (Demucs, MossFormer2, and most production-grade separation tools fall into this category). Those models effectively learn a filter: given the mixture, predict a mask that when applied recovers the target. SAM Audio doesn’t filter – it generates. The architecture is much closer to how modern music or image generation models work, where the source material conditions the generation process rather than being directly transformed. Prompt encodings – text via a language encoder, video via visual features from masked regions, spans by tokens aligned to audio frames – are all injected as conditioning into the Diffusion Transformer alongside the mixture representation.

Span prompting

Technically, span conditioning works by converting the selected time intervals into a frame-synchronous binary token sequence. Each frame is marked as either silent (<sil>) or active (+) depending on whether the target event is present. This sequence is embedded via a learnable embedding table and concatenated channel-wise with the audio features before entering the DiT backbone, giving the model an explicit temporal prior about when to extract rather than what the sound is. This is fundamentally different from reference-based speaker extraction methods, which require a clean enrollment recording of the target source. SAM Audio needs only a timestamp.

One particularly useful implementation detail from the paper: when a user provides only a text prompt, it can use a secondary model called PEA-Frame that automatically predicts the frame-level activity of that sound event within the mixture and feeds it back as span conditioning. The paper shows that joint text + span conditioning consistently outperforms text-only prompting, so when you describe a sound in words, the model can quietly upgrade itself to also know when that sound is happening.

SAM Audio Judge (SAJ)

Evaluating GENERATIVE separation models is genuinely hard. SDR requires a clean reference, which doesn’t exist for real-world recordings, and is also biased against generative outputs for the timing reasons described above. CLAP similarity – another reference-free metric that measures how well the output matches the text description in an embedding space – turns out to correlate poorly with actual human judgment of separation quality, as the paper demonstrates. SAJ is a model trained directly on human perceptual quality annotations. It takes a mixture, text prompt and a separated output and predicts a quality score without needing a reference, and its predictions correlate much more strongly with human listening test results than CLAP similarity does. This matters well beyond SAM Audio itself – SAJ is a genuinely useful contribution to anyone working on audio separation evaluation, providing a way to assess real-world separation quality without the constraints of synthetic benchmarks.

Subjective evaluation protocol

The paper also rethinks how listening tests are designed for this class of model. This is a harder problem than it looks, and the paper redesigns the evaluation protocol from scratch rather than reaching for the standard tools.

MUSHRA (the traditional approach) requires a clean reference stem to compare against, which simply doesn’t exist for real in-the-wild recordings. Single-stimulus MOS tests avoid that dependency but are poorly sensitive to small differences between models and prone to anchoring drift across a session. The paper sidesteps both with a side-by-side Absolute Category Rating protocol with an always-on preference tie-breaker. Listeners see two model outputs for the same input alongside the original mixture and prompt, scoring each independently across three dimensions: Recall (how much target made it into the extraction), Precision (how much non-target leaked in), and Faithfulness (how similar the extracted sound is to the original mix). They then answer a forced-choice preference question regardless of whether their numerical scores already differ. The SAJ model described earlier was trained to predict these exact human scores – which is why it correlates so much more strongly with listener judgment than SDR or CLAP similarity. The subjective protocol isn’t just a validation step; it’s the foundation the automatic metric was built on.

Taken together, the target sound separation field has been dominated by discriminative, mask-predicting models. SAM Audio is a fully generative pipeline end-to-end – the first foundation model of this kind that unifies text, visual, and temporal span prompting in a single framework, and that matches or exceeds specialized systems across multiple domains simultaneously.

Conclusions

SAM Audio is a meaningful step for the audio AI field as a genuine foundation model that consolidates tasks previously requiring separate specialized systems into one flexible, promptable interface, covering speech, music, and general sound.

The real excitement, though, is what comes next. A model like this is a building block. It enables agents that can understand audio contextually and respond to natural language instructions – a mix engineer describing what they want rather than reaching for a plugin, a post-production workflow that can process field recordings based on spoken instructions, a creative tool for musicians that lets them explore separation as part of composition. The combination of span prompting with generative quality opens up use cases that simply weren’t possible before.

Post Views: 5,897

Ready to Make Your Business Processes Up to 90% More Efficient?

Partner with a team that builds AI to work in the real business world. We help companies cut manual work, speed up operations, and turn complexity into clarity.