Google · Filed Apr 11, 2025 · Published May 21, 2026 · verified — real USPTO data

Google Patents a Position-Aware Slot Attention System for Object-Centric AI

Most AI vision systems struggle when objects move around or appear at different scales — they have to re-learn what they already knew. Google's new patent tackles that by giving each detected 'object slot' its own local coordinate system that travels with the object.

Google Patent: Translation & Scaling Equivariant Slot Attention — figure from US 2026/0141705 A1
FIG. 1A — rendered from the official USPTO publication PDF.
Publication number US 2026/0141705 A1
Applicant Google LLC
Filing date Apr 11, 2025
Publication date May 21, 2026
Inventors Aravindh Mahendran, Ondrej Biza, Thomas Kipf, Simon Jacob van Steenkiste, Gamaleldin Elsayed, Seyed Mohammad Mehdi Sajjadi
CPC classification 382/156
Grant likelihood Medium
Examiner CENTRAL, DOCKET (Art Unit OPAP)
Status Docketed New Case - Ready for Examination (Feb 20, 2026)
Parent application is a National Stage Entry of PCTUS2022079903 (filed 2022-11-15)
Document 20 claims

What Google's slot attention fix actually does

Imagine you're teaching an AI to recognize a dog in a photo. The system learns the dog just fine — but slide the dog to the other side of the image, or zoom in, and it can get confused because it memorized the dog's position as much as its appearance. That's a real problem for AI models that try to separate a scene into individual objects.

Google's patent describes a way to fix this using something called slot attention — a technique where the AI carves up a scene into a set of "slots," each one representing a distinct object. The twist here is that each slot gets its own local reference frame: a floating coordinate system centered on that object, not on the whole image.

When the scene changes — because the object moves or the camera zooms — the slot's coordinate system updates automatically. That means the AI's understanding of each object stays consistent even as its position in the frame shifts. It's the difference between remembering a face versus remembering where in the photo that face appeared.

How entity-centric position vectors anchor each slot

The patent describes a neural network layer that processes feature vectors (compressed representations of patches in an image or video) alongside absolute positional encodings — numbers that describe where each patch sits in the full image grid.

For each object "slot" (called an entity-centric latent representation), the system maintains a position vector that tracks that object's center in the scene. Rather than comparing slots to patches using absolute coordinates, it computes relative positional encodings — essentially, "how far is this patch from where I think this object is?" That relative framing is what gives the system its equivariance property (meaning the output transforms predictably when the input is translated or scaled, rather than breaking entirely).

The attention matrix — the core lookup table that decides which patches each slot should "pay attention to" — is built from three ingredients:

  • Feature vectors passed through a key function
  • Slot representations passed through a query function
  • The relative positional encodings for each slot

Finally, the slot's position vector is updated each iteration as a weighted average of all patch positions, weighted by how much attention the slot gave each patch. This is an iterative refinement loop: slots home in on objects over multiple rounds.

Why this matters for object-recognition AI models

Object-centric learning — training AI to decompose scenes into discrete entities rather than treating an image as one big blob — is an active area of research with direct implications for robotics, video understanding, and autonomous systems. A robot arm that can track individual objects even as they move around a cluttered table is far more useful than one that only works when everything is perfectly positioned.

Equivariance (the property this patent specifically targets) is a known weak spot in standard slot attention models. By making the attention mechanism explicitly position-relative and scale-aware, Google is building in a form of geometric common sense that current models have to learn from scratch via expensive training. If this approach holds up, it could mean faster training, better generalization, and models that don't quietly fall apart when objects shift.

Editorial take

This is genuinely interesting research infrastructure, not a product feature. Slot attention is a relatively niche but important technique in AI scene understanding, and the equivariance problem it addresses is a real limitation — not a manufactured one. Whether this ends up in a Google product or stays a research contribution, it's the kind of principled architectural fix that tends to propagate through the field.

Get one Big Tech patent every Sunday

Plain English, intelligent commentary, no hype. Free.

Source. Full patent text and figures from the official USPTO publication PDF.

Editorial commentary on a publicly published patent application. Not legal advice.