New Google Patents · Filed Jan 27, 2026 · Published Jun 4, 2026 · verified — real USPTO data

Google Patents an AI That Builds a 3D Model of Your Specific Object From a Few Photos

Give Google's system a handful of photos of your coffee mug and a text prompt, and it will generate a full 3D model of that specific mug — not a generic one. That's the core promise of this patent.

Google Patent: Subject-Driven Text-to-3D Model Generation — figure from US 2026/0154903 A1
FIG. 1A — rendered from the official USPTO publication PDF.
Publication number US 2026/0154903 A1
Applicant Google LLC
Filing date Jan 27, 2026
Publication date Jun 4, 2026
Inventors Yuanzhen Li, Amit Raj, Varun Jampani, Benjamin Joseph Mildenhall, Benjamin Michael Poole, Jonathan Tilton Barron, Kfir Aberman, Michael Niemeyer, Michael Rubinstein, Nataniel Ruiz Gutierrez, Shiran Elyahu Zada, Srinivas Kaza
CPC classification 345/419
Grant likelihood Medium
Examiner CENTRAL, DOCKET (Art Unit OPAP)
Status Docketed New Case - Ready for Examination (Feb 27, 2026)
Parent application is a Continuation of 18611236 (filed 2024-03-20)
Document 20 claims

What Google's photo-to-3D generation pipeline actually does

Imagine you want a 3D model of your actual sneaker — not some generic sneaker shape, but your specific pair, with the right colorway, worn soles, and brand logo. Today, building that from scratch takes a 3D artist and a lot of time. AI tools can generate generic 3D objects from text, but they struggle to capture the specific details of a real-world subject.

Google's patent describes a system that tackles this by combining two AI models in a careful training loop. You feed it a few photos of your subject and a text description, and it learns to generate that specific object from any angle — including angles your photos never showed.

The clever part is the staged approach: the system does a partial training pass first, uses those partial results to bootstrap a 3D model, then uses that rough 3D model to generate new synthetic views, and finally trains everything to completion. It's a back-and-forth refinement loop that helps both models teach each other.

How the two-model training loop builds the 3D output

The patent describes a pipeline that connects two distinct AI models: a generative image model (think a fine-tuned diffusion model, like those behind Stable Diffusion or Imagen) and a 3D implicit representation model (a neural network that stores a scene's geometry and appearance as a mathematical field rather than explicit geometry — similar to NeRF, or Neural Radiance Fields).

The key innovation is the staged, interleaved training process:

  • Fractional image training: The generative image model is first fine-tuned on your subject photos — but only partially, not to completion.
  • Fractional 3D optimization: That partially trained image model is then used to partially optimize the 3D implicit representation model.
  • Pseudo multi-view generation: The partially optimized 3D model and a fully trained image model collaborate to generate synthetic images of the subject from viewpoints not present in the original photos.
  • Full training: Both models are then trained to completion using the original photos plus those synthetically generated views.

The multi-view image model — a separate model trained to predict what an object looks like from multiple camera angles given a text prompt — acts as a supervisor and data augmentor throughout. By generating plausible views the camera never captured, it gives the 3D model enough information to build a coherent three-dimensional representation.

What this means for AI-generated 3D content creation

The hardest problem in text-to-3D generation isn't making a 3D object — it's making your 3D object. Subject-driven generation, where the output must match a specific real-world item, is significantly harder because the model needs to preserve identity while still generalizing across viewpoints and lighting. Google's staged pipeline directly attacks that data scarcity problem: you don't need dozens of photos from every angle because the system generates the missing views itself.

For product visualization, gaming asset pipelines, AR/VR content creation, and e-commerce, a system like this could dramatically lower the barrier to creating personalized 3D assets. If this ships in a Google product — say, a future version of a Google Labs tool or integrated into Android's AR stack — it could let everyday users create 3D models from phone snapshots.

Editorial take

This is a genuinely interesting technical approach to a real and stubborn problem in generative AI. The staged fractional training loop — where partial models bootstrap each other before full training — is a thoughtful solution to the chicken-and-egg problem of needing multi-view data to train a 3D model you don't have yet. Google has the research firepower here (several of these inventors are behind DreamFusion and DreamBooth), so this isn't speculative work from a team without a track record.

Get one Big Tech patent every Sunday

Plain English, intelligent commentary, no hype. Free.

Source. Full patent text and figures from the official USPTO publication PDF.

Editorial commentary on a publicly published patent application. Not legal advice.