New Google Patents · Filed Dec 20, 2024 · Published Jun 25, 2026 · verified — real USPTO data

Google Patent Reveals AI That Reasons Through Images Before Producing Text Answers

What if an AI answered your question by first sketching a picture in its head? Google is patenting exactly that: a system where the AI generates internal images or videos as a thinking step, even when you only asked for text.

Google Patent: AI That Thinks in Images and Video — figure from US 2026/0179261 A1
FIG. 1A — rendered from the official USPTO publication PDF.
Publication number US 2026/0179261 A1
Applicant GOOGLE LLC
Filing date Dec 20, 2024
Publication date Jun 25, 2026
Inventors Agoston Weisz, Ivor Rendulic
CPC classification 345/619
Grant likelihood Medium
Examiner RICHER, AARON M (Art Unit 2617)
Status Docketed New Case - Ready for Examination (Jan 22, 2025)
Document 20 claims

How Google's AI uses private images to answer questions

Imagine you ask a question like "Which of these two routes has more curves?" Most AI assistants process that as pure text. Google's patent describes a different approach: the AI first generates a visual, like a map or diagram, internally, uses that image to figure out the answer, and then responds to you in plain text. You never see the image. It's just how the AI works through the problem.

Think of it like a person who scribbles a quick sketch on scratch paper before explaining something out loud. The sketch never gets handed to you, but it helped them think. Google's system does the same thing digitally, generating an image or short video as a reasoning step, not as the final output.

The key detail: you don't have to ask for any image or video. Your question can be completely text-based. The AI decides on its own that a visual would help it reason better, creates one internally, analyzes it, and delivers a text answer based on what it "saw."

Inside Google's visual chain-of-thought pipeline

The patent describes a method called visual chain-of-thought reasoning. A chain of thought (CoT) is a technique where an AI breaks a problem into intermediate steps before giving a final answer. Normally those steps are text. This patent extends that idea to images and video.

Here's the flow the patent outlines:

  • A user sends a request to a system running a generative model (GM). The request asks for a text answer and does not ask for any image or video.
  • The GM processes the request and, as part of generating a response, produces a generative image or video as an intermediate output. This is the "scratch paper" step.
  • The system then analyzes that generated visual to determine the final response, which is text delivered back to the user.

The generative model here is doing double duty: it functions both as an image/video generator and as a visual analyzer, all within a single reasoning pipeline. The patent notes this is especially useful for questions that are inherently spatial or visual, where text-only reasoning tends to miss things.

Importantly, the generated images or videos are internal. They are not shown to the user unless the system is separately instructed to surface them.

What visual AI reasoning means for Google Search and Gemini

Most AI reasoning today is entirely verbal, essentially a very fast form of text prediction. For questions about physical space, object layout, motion, or anything that benefits from a picture, that text-only approach has real limits. A system that can generate a visual internally and then "look at it" before answering is closer to how humans actually solve spatial problems.

For Google, this is relevant across multiple products: Gemini, Google Search's AI Overviews, and any assistant-style interface. If the technique works as described, it could improve accuracy on the kinds of questions where today's AI most visibly stumbles, things like directions, comparisons of physical objects, or interpreting diagrams. The gains would be invisible to the user but meaningful in output quality.

Editorial take

This is one of the more genuinely interesting AI reasoning patents filed recently, because it's not about making a model bigger or faster. It's about giving the model a new cognitive tool. The analogy to human sketch-before-you-explain thinking is apt and the patent is clear about the mechanism. Whether Google can make this reliable at scale is the real question, but the direction is worth paying attention to.

Get one Big Tech patent every Sunday

Plain English, intelligent commentary, no hype. Free.

Source. Full patent text and figures from the official USPTO publication PDF.

Editorial commentary on a publicly published patent application. Not legal advice.