Nvidia · Filed Feb 17, 2025 · Published May 28, 2026 · verified — real USPTO data

Nvidia Patents a Foundation Model for Zero-Shot Stereo Depth Estimation

Nvidia is patenting a depth-estimation model that can figure out how far away objects are from a pair of cameras — without being trained specifically for that camera setup or scene. That's the "zero-shot" part, and it's a meaningful engineering goal for robotics and autonomous systems.

Nvidia's Zero-Shot Stereo Matching Foundation Model Patent — figure from US 2026/0148405 A1
FIG. 1A — rendered from the official USPTO publication PDF.
Publication number US 2026/0148405 A1
Applicant NVIDIA Corporation
Filing date Feb 17, 2025
Publication date May 28, 2026
Inventors Bowen Wen, Matthew Trepte, Orazio Gallo, Jan Kautz, Stanley Thomas Birchfield
CPC classification 348/43
Grant likelihood Medium
Examiner ADAMS, EILEEN M (Art Unit 2481)
Status Publications -- Issue Fee Payment Received (May 22, 2026)
Parent application Claims priority from a provisional application 63726916 (filed 2024-12-02)
Document 29 claims

What Nvidia's stereo depth model actually does

Imagine you're building a robot and you want it to understand how far away things are. One reliable way to do that is to give it two cameras — like your two eyes — and compare what each one sees. The slight difference between the two views (called disparity) tells you the distance. But traditional systems need to be re-trained every time you change cameras or environments.

Nvidia's patent describes a foundation model — a general-purpose AI trained broadly enough to work across many different setups without extra training. You point it at a new stereo camera pair in a new environment, and it can still produce a reliable depth map (a picture where each pixel encodes how far away that part of the scene is).

The clever part is how it blends two types of AI architecture — a vision transformer (good at understanding global context in an image) and a convolutional neural network (good at detecting fine local details) — to get the best of both. The result is a system that's both flexible and precise.

How the STA adapters and AHCF filtering work together

The system takes two images from a stereo camera pair and runs each through its own Side-Tuning Adapter (STA) — a lightweight module that sits alongside a pre-trained Vision Transformer (ViT) monocular depth network. A ViT is a type of neural network that treats an image as a sequence of patches and finds relationships between them across the whole frame. The STA adds a convolutional neural network (CNN) branch alongside it, letting the model pick up on sharp edges and fine-grained local texture that transformers sometimes miss.

The two feature maps (one per image) are then combined into a hybrid cost volume — essentially a 3D data structure that encodes, for every pixel in the left image, how well it matches every candidate pixel in the right image. This is the core of how stereo depth works.

That cost volume gets processed by Attentive Hybrid Cost Filtering (AHCF), which again runs two parallel branches — one transformer-based, one CNN-based — to clean up ambiguous or noisy matches. Think of it like cross-checking two editors on the same manuscript.

Finally, the model refines its initial depth guess iteratively using a convolutional Gated Recurrent Unit (GRU) — a type of recurrent loop that progressively corrects errors in the disparity estimate, pass by pass, until the output map is stable.

What this means for robotics and autonomous perception

The zero-shot capability is the real headline here. Most stereo matching models need fine-tuning on data from the specific sensors and environments they'll be deployed in. A foundation model approach means Nvidia's robotics and autonomous vehicle customers could drop this into new hardware without a retraining pipeline — which is a significant operational saving at scale.

Nvidia has been aggressively building out its robotics stack (Isaac, Cosmos, Omniverse), and a general-purpose stereo depth model fits squarely into that infrastructure. Better zero-shot depth perception is one of the core unsolved problems in giving robots reliable spatial awareness in uncontrolled environments — so this patent sits at the heart of where that platform needs to go.

Editorial take

This is substantive work, not a defensive filing. Combining ViT and CNN architectures through side-tuning to preserve a pre-trained monocular depth prior while adapting it for stereo is a real architectural insight — and the iterative GRU refinement on top is a proven pattern from optical flow that adds credibility to the design. If Nvidia ships this into Isaac or a future sensor stack, it could meaningfully lower the barrier to deploying stereo-based robots in novel environments.

Get one Big Tech patent every Sunday

Plain English, intelligent commentary, no hype. Free.

Source. Full patent text and figures from the official USPTO publication PDF.

Editorial commentary on a publicly published patent application. Not legal advice.