Adobe · Filed Nov 18, 2024 · Published May 21, 2026 · verified — real USPTO data

Adobe Patents a Fix for AI Images That Mix Up Objects and Locations

By Patentlyze Team · Updated Jul 10, 2026

Ask an AI image generator to put 'a red ball on the left and a blue cube on the right' and you'll often get the ball on the right, the cube on the left, or both crammed into the same corner. Adobe's new patent targets exactly that failure mode.

Figure from the official USPTO publication.

Publication number US 2026/0141572 A1

Applicant ADOBE INC.

Filing date Nov 18, 2024

Publication date May 21, 2026

Inventors Aravindan Kamatchi Sundaram, Ujjayan Pal, Abhimanyu Chauhan, Aishwarya Agarwal, Srikrishna Karanam

CPC classification 345/418

Grant likelihood Medium

Examiner BROWN, SHEREE N (Art Unit 2612)

Status Docketed New Case - Ready for Examination (Dec 23, 2024)

Document 20 claims

AI image & video

Why Adobe's AI image fix matters for text prompts

Imagine you type a prompt like 'a cat sitting on a chair next to a dog lying on a rug' into an AI image tool. The AI understands the words, but when it renders the image, the cat and dog end up overlapping, or the chair and rug swap positions. It's one of the most frustrating limitations of today's text-to-image tools.

Adobe's patent describes a process that catches this problem before the final image is drawn. Instead of just running the prompt through the model and hoping for the best, the system generates a rough intermediate version first, then checks whether each object is landing in the right place. If something is off, it adjusts that intermediate output — a kind of structured noise — until the layout matches the prompt.

Only after that corrected layout is confirmed does the model render the final synthetic image. Think of it like a director blocking actors on a stage before the cameras roll, rather than fixing it in post-production.

How attention contrast loss guides the noise optimization

Text-to-image diffusion models (the kind that power tools like Adobe Firefly, Midjourney, and DALL·E) work by starting from random noise and gradually refining it into a coherent image. The attention mechanism inside these models — the part that decides which pixels 'belong to' which words in your prompt — often struggles to keep distinct objects spatially separated, especially when the prompt describes two or more elements at specific locations.

Adobe's approach introduces an attention contrast loss — a mathematical penalty score that measures how well the model's internal attention maps separate the first described element from the second. A high loss means the model is confusing where things should go; a low loss means the layout is correct. This loss is calculated on an intermediate output, which in diffusion model terms is the partially-denoised latent (a compressed numerical representation of the image in progress, not yet a visible picture).

The system then runs an optimization loop on that intermediate output:

Generate an intermediate noisy latent from the input prompt
Measure the attention contrast loss to see if elements are spatially separated
Adjust the latent to reduce the loss (push each element toward its intended location)
Feed the corrected latent back into the image generation model to produce the final image

The key insight is that fixing the layout at the noise/latent stage is cheaper and more reliable than trying to fix a fully rendered image after the fact. The final image then inherits the corrected spatial structure.

What this means for Firefly's prompt-following accuracy

For anyone who uses Adobe Firefly or similar generative tools professionally — designers, marketers, content teams — the inability to reliably place objects where you ask for them is a real workflow bottleneck. Right now, getting a correctly composed image often means iterating through dozens of generations or falling back to manual editing in Photoshop. A system that bakes spatial fidelity into the generation process could meaningfully cut that iteration time.

More broadly, this points to a wider problem the industry is still working through: text-to-image models are fluent at style and texture but clumsy at spatial reasoning. Adobe's patent takes a targeted, optimization-based approach rather than retraining a new model from scratch — which suggests it could be layered on top of existing Firefly infrastructure without a full model overhaul.

Editorial take

This is a genuinely useful engineering contribution to a well-documented pain point in generative AI. It's not flashy research — attention-based spatial control has been an active area since 2023's Attend-and-Excite paper — but Adobe is translating that research direction into a concrete, patentable pipeline that fits their existing products. If this ships in Firefly, most users won't know it's there; they'll just notice that their prompts stop lying to them.

Which company should we read for you?

We track 17 companies here. Pro is the same weekly breakdown for any company you choose, delivered privately. Type a name and we'll scope it and send you a quote.

Get one Big Tech patent every Sunday

Plain English, intelligent commentary, no hype. Free.

Source. Full patent text and figures from the official USPTO publication PDF.

Editorial commentary on a publicly published patent application. Not legal advice.