Nvidia · Filed Aug 15, 2025 · Published May 21, 2026 · verified — real USPTO data

Nvidia Patents a 4D Scene Generator That Builds Synthetic Worlds for Self-Driving AI

By Patentlyze Team · Updated Jul 9, 2026

Training a self-driving car requires millions of hours of real-world footage — unless you can just generate those worlds from scratch. Nvidia's new patent describes a pipeline that takes a single image and manufactures a full four-dimensional scene, complete with 3D geometry, camera perspective, and motion over time.

Figure from the official USPTO publication.

Publication number US 2026/0141618 A1

Applicant NVIDIA CORPORATION

Filing date Aug 15, 2025

Publication date May 21, 2026

Inventors Jiageng MAO, Yue WANG, Yuxiao CHEN, Boris IVANOVIC, Marco PAVONE, Boyi LI, Yan WANG, Chaowei XIAO, Danfei XU, Yurong YOU

CPC classification 345/419

Grant likelihood Medium

Examiner CENTRAL, DOCKET (Art Unit OPAP)

Status Docketed New Case - Ready for Examination (Sep 12, 2025)

Parent application Claims priority from a provisional application 63721343 (filed 2024-11-15)

Document 20 claims

Automotive

How Nvidia conjures full 4D worlds from a single photo

Imagine trying to teach a student driver using only photos from one dashboard camera. You'd want to show them the same intersection from multiple angles, in different weather, with pedestrians cutting across at unexpected moments — but you only have one shot. That's essentially the data problem facing self-driving AI systems today.

Nvidia's patent describes a way to take that single input image and synthesize an entire 4D scene — three dimensions of space plus time — that a self-driving system can actually train in. The process uses two separate AI models: the first one generates multiple views of the scene from different camera angles, and the second one reconstructs the 3D geometry and figures out exactly where those cameras are positioned.

The end result is a richly detailed, navigable simulation that includes both the static stuff (roads, buildings, parked cars) and dynamic elements (moving vehicles, pedestrians). Your self-driving AI gets a full synthetic world to practice in, built almost entirely from a single starting photo.

How the two-model pipeline builds space and time

The pipeline has two main machine learning stages working in sequence. The first model is a video diffusion model — think of it as a generative AI that's been trained on driving footage and can extrapolate a single image into a plausible multi-view video sequence, showing the scene from several camera perspectives simultaneously.

Those synthesized multi-view images are then handed off to a multiview stereo model, which reconstructs the 3D geometry of the scene and estimates the camera parameters (position, orientation, focal length) needed to make sense of it all. This is the step where flat pixels become a structured 3D space.

From there, the system builds a 4D spatio-temporal scene representation using a technique called Gaussian Splatting — a way of representing a 3D scene as a cloud of fuzzy ellipsoids rather than hard polygons, which is faster to render and easier to optimize. The patent distinguishes between:

Static Gaussians — background elements like roads and buildings that don't move
Dynamic Gaussians — moving objects like vehicles and pedestrians, tracked over time

A self-supervised scene decomposition module automatically separates the static and dynamic parts without needing manual labels. A cluster-based grouping module then organizes dynamic objects into coherent groups, and a dynamic score prediction module estimates motion trajectories — producing a full scene that can be "played forward" in time.

What this means for autonomous vehicle training data

The biggest bottleneck in autonomous vehicle development isn't algorithms — it's data. Real-world edge cases (a child darting into the road, a car drifting into your lane on a foggy overpass) are rare and expensive to capture at scale. A system that can generate photorealistic, physically plausible 4D training environments from minimal inputs could dramatically reduce that dependency on real-world miles.

For Nvidia specifically, this sits squarely in the wheelhouse of its DRIVE simulation platform, which already sells synthetic training infrastructure to automakers. A patent like this signals that Nvidia is pushing toward generating richer, more temporally coherent synthetic worlds — not just static scenes — which is the harder and more valuable problem to solve.

Editorial take

This is genuinely interesting infrastructure work, not a flashy consumer feature. The combination of video diffusion for view synthesis and Gaussian Splatting for scene reconstruction is a smart pairing — both techniques have been maturing fast in academic research, and Nvidia is essentially trying to patent their production-grade integration for AV training. The fact that it handles dynamic objects separately (rather than baking motion into a static scene) is the detail that makes it practically useful.

Which company should we read for you?

We track 17 companies here. Pro is the same weekly breakdown for any company you choose, delivered privately. Type a name and we'll scope it and send you a quote.

Get one Big Tech patent every Sunday

Plain English, intelligent commentary, no hype. Free.

Source. Full patent text and figures from the official USPTO publication PDF.

Editorial commentary on a publicly published patent application. Not legal advice.