Nvidia · Filed Feb 9, 2026 · Published Jun 18, 2026 · verified — real USPTO data

Nvidia Patents an AI That Rebuilds 3D Scenes From Flat Photos

By Patentlyze Team · Updated Jun 19, 2026

Nvidia is teaching an AI to look at a flat photo — the kind you take with any camera — and reconstruct the full three-dimensional scene it came from. That's a problem that sounds simple but has stumped computer vision researchers for decades.

FIG. 1A — rendered from the official USPTO publication PDF.

Publication number US 2026/0170763 A1

Applicant NVIDIA CORPORATION

Filing date Feb 9, 2026

Publication date Jun 18, 2026

Inventors Yang FU, Sifei LIU, Jan KAUTZ, Xueting LI, Shalini DE MELLO, Amey KULKARNI, Milind NAPHADE

CPC classification 345/418

Grant likelihood Medium

Examiner CENTRAL, DOCKET (Art Unit OPAP)

Status Docketed New Case - Ready for Examination (Mar 13, 2026)

Parent application is a Continuation of 18497938 (filed 2023-10-30)

Document 20 claims

AI/ML

What Nvidia's 2D-to-3D photo reconstruction AI actually does

Imagine you take a photo of a living room. That photo is flat — it has no real sense of how far away the couch is, or how deep the bookshelf goes. Nvidia's patent describes an AI model trained to look at that photo (plus a depth image, the kind a depth-sensing camera produces) and figure out the full 3D shape and color of everything in the scene.

The system learns this by training on pairs of images: a regular color photo and a matching depth map. It teaches two separate 'halves' of the AI — one focused on shape and geometry, one focused on color and texture — to work together and agree on a single, consistent 3D picture of the world.

The practical upside is that you don't need a fancy 3D scanner or a film studio's worth of equipment. Any depth-aware camera — like the ones already built into modern phones and robots — could feed this system enough data to reconstruct a usable 3D model.

How the encoder-decoder pipeline handles geometry and color

The patent describes a training framework that teaches a machine learning model to convert a 2D image (plus depth data) into a 3D representation of that scene.

The system works in two parallel tracks:

Geometry track: A depth image and a camera viewpoint are fed into a geometry encoder and decoder, which output signed distance function (SDF) values — essentially a mathematical description of how far any given point in space is from the nearest surface. Think of it as the AI drawing an invisible 3D mesh around everything it sees.
Texture track: A standard RGB color image is fed into a separate texture encoder and decoder, which figure out the radiance values (how bright and what color each point in space should appear) at those same 3D locations.

The two tracks are reconciled using an RGBD reconstruction loss — a measure of how wrong the model's combined color-plus-depth prediction is compared to the real image. The model is penalized for errors and adjusts itself accordingly during training.

Notably, the geometry encoder and decoder start from pre-trained weights (meaning Nvidia is reusing knowledge from an existing model), while the texture components start from scratch. This hybrid approach can speed up training and improve how well the final model generalizes to new scenes it has never seen before.

What this means for robotics, simulation, and digital twins

The ability to reliably reconstruct 3D scenes from ordinary camera inputs is foundational to several of Nvidia's biggest business areas: robotics (robots need a 3D model of the world to navigate and manipulate objects), autonomous vehicles (same principle), and digital twins (virtual replicas of real-world environments used in industrial simulation). A general-purpose AI that can do this from commodity depth cameras — rather than requiring expensive LiDAR or structured-light scanners — would lower the cost and complexity of building those systems.

For you as a consumer, the more immediate downstream effect could be better spatial understanding in apps that run on devices with depth cameras, like augmented reality features or 3D scanning tools on phones and tablets.

Editorial take

This is serious infrastructure-level AI research, not a flashy demo. The specific contribution — combining pre-trained geometry encoders with freshly trained texture encoders and training them jointly on color-plus-depth images — is the kind of incremental-but-meaningful engineering that actually ships inside products like Nvidia's Omniverse or Isaac robotics platform. It won't make headlines at a press event, but it's exactly the kind of patent that shows up quietly in a product three years later.

Get one Big Tech patent every Sunday

Plain English, intelligent commentary, no hype. Free.

Source. Full patent text and figures from the official USPTO publication PDF.

Editorial commentary on a publicly published patent application. Not legal advice.

Nvidia Patents an AI That Rebuilds 3D Scenes From Flat Photos

What Nvidia's 2D-to-3D photo reconstruction AI actually does

How the encoder-decoder pipeline handles geometry and color

What this means for robotics, simulation, and digital twins

More from Nvidia

More in AI/ML

Get one Big Tech patent every Sunday