Nvidia · Filed Jan 6, 2026 · Published May 21, 2026 · verified — real USPTO data

Nvidia Patents a Vision Transformer That Learns Object Relationships Without Supervision

By Patentlyze Team · Updated Jul 10, 2026

Teaching a machine to understand that a cup is *on* a table — not just that a cup and a table both exist — is a surprisingly hard problem. Nvidia's new patent tackles exactly that, by training vision models to reason about relationships between objects without needing a mountain of labeled data.

Figure from the official USPTO publication.

Publication number US 2026/0141239 A1

Applicant NVIDIA Corporation

Filing date Jan 6, 2026

Publication date May 21, 2026

Inventors Xiaojian Ma, Weili Nie, Zhiding Yu, Huaizu Jiang, Chaowei Xiao, Yuke Zhu, Anima Anadkumar

CPC classification 706/25

Grant likelihood Medium

Examiner CENTRAL, DOCKET (Art Unit OPAP)

Status Docketed New Case - Ready for Examination (Feb 10, 2026)

Parent application is a Continuation of 17893026 (filed 2022-08-22)

Document 26 claims

AI vision

What Nvidia's relational vision model actually does

Imagine showing a child two photos: one where a dog is jumping over a fence, and one where a cat is jumping over a wall. A smart kid quickly sees the shared concept — something leaping over something else. Teaching a computer to do the same thing has traditionally required enormous piles of hand-labeled examples. Nvidia's patent describes a way to make this much easier.

The system trains a vision model using two complementary exercises. The first groups images that share the same conceptual relationship — all the "object A over object B" images get clustered together. The second pushes the model to find the exact parts of each image that make the relationship meaningful — it learns to spot the relevant objects, not just the general scene.

Together, these two training signals help the model generalize — meaning it can recognize relational concepts in real-world photos it's never seen, not just in the controlled synthetic environments where this kind of AI has historically been stuck.

How the global and local tasks train the transformer

At the core of this patent is a vision transformer (ViT) — a type of deep learning model that breaks images into patches and uses attention mechanisms (essentially, a way for the model to weigh which parts of an image are relevant to each other) to understand visual content.

Nvidia's twist is adding two specialized training objectives on top of the standard ViT architecture:

Global task: Uses a concept-feature dictionary — a lookup table linking visual features to abstract relational concepts like "above," "holding," or "left of" — to cluster images that share the same concept. This pushes the model to build consistent, semantically meaningful internal representations.
Local task: Guides the model to identify the specific objects within each image that define the relationship. If the concept is "carrying," the model learns to zero in on the carrier and the carried object, not just the background context.

The concept-feature dictionary acts as a kind of weak supervision signal — you don't need a human to label every pixel, just a structured vocabulary of concepts and representative features. This allows the model to learn object-centric semantic correspondence (finding the same type of object-relationship across different images) without full annotation overhead, and to operate on real-world imagery rather than only synthetic training sets.

What this means for robots and vision AI at scale

Visual relational reasoning is a core bottleneck for robotics and embodied AI. A robot arm that can only identify objects — not how they relate spatially or functionally — will fail at most real manipulation tasks. Nvidia's robotics and autonomous systems work (through platforms like Isaac and Cosmos) depends heavily on models that generalize from simulation to the real world, which is exactly the gap this patent addresses.

More broadly, if vision models can learn relational concepts with less labeled data, that dramatically reduces the cost and time of training pipelines for anyone building visual AI — from warehouse automation to medical imaging to self-driving systems. For Nvidia, whose business now runs on selling the compute that trains these models, having proprietary techniques that make training more efficient is a meaningful strategic asset.

Editorial take

This is solid, publishable research territory — the challenge of relational reasoning in vision AI is real and well-documented, and Nvidia's dual global/local training approach is a clean conceptual contribution. The concept-feature dictionary as a lightweight supervision mechanism is the interesting piece here, and it fits neatly into Nvidia's broader push to make AI training work better at scale with less human annotation effort. Worth paying attention to if you follow robotics or vision AI.

Which company should we read for you?

We track 17 companies here. Pro is the same weekly breakdown for any company you choose, delivered privately. Type a name and we'll scope it and send you a quote.

Get one Big Tech patent every Sunday

Plain English, intelligent commentary, no hype. Free.

Source. Full patent text and figures from the official USPTO publication PDF.

Editorial commentary on a publicly published patent application. Not legal advice.

Nvidia Patents a Vision Transformer That Learns Object Relationships Without Supervision

What Nvidia's relational vision model actually does

How the global and local tasks train the transformer

What this means for robots and vision AI at scale

More from Nvidia

More in AI vision

Get one Big Tech patent every Sunday