Sony Patents an AI That Builds Music by Filling In Blanks, Over and Over
Sony is patenting an AI audio engine that works like a digital mad-lib — it deliberately blanks out parts of a sound file, fills them in, then repeats the process until finished music emerges. The result, the company claims, is audio generation fast enough to happen in real time.
How Sony's iterative audio-patching approach generates music
Imagine writing a story by first placing placeholder blanks where words should go, then filling each one in — and doing that over and over until the whole thing reads naturally. Sony's patent applies that same idea to music and audio.
Here's the basic idea: the system starts with audio that has intentional 'gaps' — chunks of sound that are missing or masked. An AI model fills those gaps in, producing a repaired version. Then the system looks at that repaired audio, decides which parts still need work, masks them again, and fills them in once more. That cycle repeats a fixed number of times until the final audio is ready.
The goal is to make AI-generated audio fast enough that it can happen as you listen — or close to it. That matters for any tool where waiting around for a clip to render would break the creative flow, whether that's a game, a music app, or a production studio.
How the mask-and-repair loop assembles final audio output
The system Sony describes is built around a technique called iterative masked synthesis — a process borrowed from image-generation AI but applied here to audio spectrograms (visual maps of sound frequencies over time).
At each pass through the loop, the CPU does three things:
- Repairs masked audio data — the AI model fills in the blanked-out sections of a sound representation, making a best guess at what should be there.
- Extracts new mask positions — after the repair, the system identifies which parts of the audio are still uncertain or low-quality and marks those as the next targets.
- Repeats — this repair-and-remask cycle runs a set number of times, progressively refining the audio with each pass.
The patent emphasizes that this runs on a CPU rather than requiring specialized AI chips, which is notable because most generative audio models lean heavily on GPUs. Running on a CPU lowers the hardware barrier significantly. The iterative design also means the system can trade quality for speed by simply running fewer loops — useful for real-time applications where a slightly imperfect result delivered instantly beats a perfect one delivered too late.
What this means for real-time AI music tools
Real-time AI audio generation is one of the harder problems in creative tech. Most current tools require you to wait — sometimes seconds, sometimes longer — for a generated clip to render. Sony's looping approach is designed to cut that wait down to something imperceptible, which could matter a lot for interactive applications like games, live performances, or adaptive soundtracks that change with what's happening on screen.
Sony's music and audio division is one of the largest in the world, and the company has been quietly building AI music tools under its research arm. A real-time generation engine that runs on standard CPU hardware — rather than expensive cloud GPUs — would make that technology far more accessible to developers and creators who don't have specialized infrastructure. For you as a user, that could eventually show up as a 'generate background music' button that actually works instantly inside a consumer app.
This is a genuinely interesting technical approach — using masked generative modeling for audio is well-established in research circles, but Sony's specific claim around CPU-based real-time performance is the part worth watching. If it delivers on that promise, it closes a meaningful gap between what AI audio can do in a lab and what works in a shipped product.
Get one Big Tech patent every Sunday
Plain English, intelligent commentary, no hype. Free.
Editorial commentary on a publicly published patent application. Not legal advice.