Nvidia · Filed Jul 25, 2025 · Published May 21, 2026 · verified — real USPTO data

Nvidia Patents a Dynamic 3D Scene Reconstruction System Using Gaussian Splatting and ML-Driven Motion Tokens

By Patentlyze Team · Updated Jul 10, 2026

Nvidia has filed a patent describing a machine learning pipeline that takes a series of ordinary images shot across time and reconstructs a fully dynamic 3D scene — complete with motion — without needing any explicit geometry data upfront.

Figure from the official USPTO publication.

Publication number US 2026/0141631 A1

Applicant NVIDIA CORPORATION

Filing date Jul 25, 2025

Publication date May 21, 2026

Inventors Yue WANG, Jiahui HUANG, Boris IVANOVIC, Yuxiao CHEN, Yan WANG, Boyi LI, Yurong YOU, Apoorva SHARMA, Maximilian IGL, Peter KARKUS, Danfei XU, Marco PAVONE, Jiawei YANG

CPC classification 345/419

Grant likelihood Medium

Examiner CENTRAL, DOCKET (Art Unit OPAP)

Status Docketed New Case - Ready for Examination (Aug 21, 2025)

Parent application Claims priority from a provisional application 63721348 (filed 2024-11-15)

Document 20 claims

AI simulation

How Nvidia rebuilds moving 3D worlds from flat images

Imagine you point a camera at a busy intersection and take a burst of photos over a few seconds. Normally, turning those flat photos into a 3D model that also captures movement — cars pulling away, pedestrians crossing — is brutally hard. You'd need expensive depth sensors or a lot of manual work.

Nvidia's patent describes a system that does this automatically. It chops each photo into small patches, converts those patches into data tokens a neural network can understand, then mixes in special motion tokens that are designed to carry velocity information. The model learns both what things look like in 3D and how they're moving at the same time.

The result is a dynamic 3D scene — one you can navigate through and that reflects real motion in the world. The system trains itself by rendering the scene back into a 2D image and checking how close it looks to the original photo, tightening its own accuracy in a feedback loop.

How motion tokens and 3D Gaussians drive the pipeline

The core idea combines two powerful techniques: 3D Gaussian Splatting (a fast rendering method that represents 3D geometry as clouds of fuzzy ellipsoids instead of hard polygons) and a transformer-based ML model (the kind that powers large language models, but applied to visual data).

Here's the pipeline step by step:

Each input image — captured at multiple timesteps — is divided into patches, and each patch becomes an image token (a compact numeric representation the model can process).
One or more motion tokens are appended to these image tokens, giving the model dedicated slots to encode velocity information.
The transformer processes all tokens together and outputs two types of decoded results: each image token produces a 3D Gaussian (geometry) and a motion key; each motion token produces a velocity basis and a motion query.
Velocity vectors are assembled by matching motion queries against motion keys — think of it like a soft lookup table for how fast each part of the scene is moving.

The model renders those 3D Gaussians plus velocity vectors into a 2D image, compares it against the real photo from that timestep, and uses the error to train itself. Once trained, it outputs optimized 3D Gaussians that together form a fully dynamic, navigable 3D reconstruction.

What this means for autonomous vehicles and simulation

For autonomous vehicle simulation, this kind of system is enormously valuable. Nvidia's DRIVE and Omniverse platforms need to generate realistic, dynamic training environments from real-world sensor data — and a model that can reconstruct moving 3D scenes directly from camera images makes that pipeline faster and cheaper than approaches requiring LiDAR or manual annotation.

Beyond self-driving, dynamic 3D reconstruction from images has applications in robotics, film visual effects, and mixed reality. The use of Gaussian Splatting specifically is notable: it's much faster to render than traditional neural radiance fields (NeRF), which suggests Nvidia is optimizing not just for accuracy but for the kind of real-time or near-real-time performance you'd need in a simulation loop or on-device inference.

Editorial take

This is a technically dense but genuinely meaningful patent, squarely aimed at one of Nvidia's core strategic bets: synthetic data generation for autonomous systems. The combination of Gaussian Splatting with a learned motion-token architecture is a real engineering choice, not a vague idea — and the self-supervised training loop (rendering back to 2D and comparing) is an elegant way to avoid needing expensive labeled 3D ground truth. Worth watching closely.

Which company should we read for you?

We track 17 companies here. Pro is the same weekly breakdown for any company you choose, delivered privately. Type a name and we'll scope it and send you a quote.

Get one Big Tech patent every Sunday

Plain English, intelligent commentary, no hype. Free.

Source. Full patent text and figures from the official USPTO publication PDF.

Editorial commentary on a publicly published patent application. Not legal advice.