Nvidia · Filed Sep 8, 2025 · Published Jun 18, 2026 · verified — real USPTO data

Nvidia Patent: Teaching AI to Read Images and Answer in Plain Text

By Patentlyze Team · Updated Jun 19, 2026

Nvidia has filed a patent for a training approach that breaks images into tiles, runs each tile through multiple specialized AI vision systems, and then feeds all of that visual information into a language model — the kind of system that powers chatbots — so it can respond in plain text.

FIG. 1A — rendered from the official USPTO publication PDF.

Publication number US 2026/0170228 A1

Applicant NVIDIA CORPORATION

Filing date Sep 8, 2025

Publication date Jun 18, 2026

Inventors Zhiding YU, Zhiqi LI, Guo CHEN, Shilong LIU, Shihao WANG, Vibashan VISHNUKUMAR SHARMINI, Shiyi LAN, Hao ZHANG, Yilin ZHAO, Subhashree RADHAKRISHNAN, Nai Chen CHANG, Karan SAPRA, Amala Sanjay DESHMUKH, Tuomas RINTAMAKI, Matthieu LE, De-An HUANG, Jose Manuel ALVAREZ LOPEZ, Bryan CATANZARO, Jan KAUTZ, Andrew J. TAO, Guilin LIU

CPC classification 715/256

Grant likelihood Medium

Examiner CENTRAL, DOCKET (Art Unit OPAP)

Status Docketed New Case - Ready for Examination (Oct 7, 2025)

Parent application Claims priority from a provisional application 63733405 (filed 2024-12-12)

Document 20 claims

AI/ML

How Nvidia's image-tiling AI vision system works

Imagine asking an AI to look at a crowded photo and describe what's happening. Most AI systems today treat the whole image as one big chunk, which can cause them to miss fine details in the corners or small text buried in a complex scene.

Nvidia's patent describes a different approach: break the image into smaller tiles, like cutting a photo into puzzle pieces. Each tile gets analyzed by multiple AI systems that were each trained to look for different things — one might be great at reading text, another at identifying objects, another at understanding spatial relationships. All of those impressions get combined into a set of tokens (small data units a language model can read), and the language model then writes a response.

The training process itself is also structured in stages, moving from a broad dataset to a smaller, more refined one. The idea is that the final model ends up both well-rounded and precisely tuned — something like going from a general education to a specialized graduate program.

How the tile encoders feed into the language model

The patent describes a three-stage training pipeline for building a multimodal model — an AI that can take in images and text and produce text responses (think: the kind of system behind tools like GPT-4o or Google Gemini).

Stage one trains only the connector layer — the bridge between the vision side and the language model side — keeping both endpoints frozen. This is like calibrating the translation layer before touching the underlying languages.
Stage two trains the full model on a large, diverse dataset to build broad capability.
Stage three fine-tunes the model on a smaller, curated dataset to sharpen its performance on specific tasks.

The first independent claim focuses on a specific inference technique: splitting an input image into tiles and passing each tile through multiple vision encoders (specialized AI systems trained on different tasks — object detection, text recognition, depth estimation, and so on). The outputs are merged into tokens that a language model then processes to generate a text answer.

This multi-encoder approach is meaningful because no single vision model is equally good at all tasks. Combining their outputs gives the language model a richer, more complete picture of what's in the image — literally and figuratively.

What this means for AI that sees and speaks

Most current AI vision-language systems rely on a single vision encoder, which means they inherit whatever blind spots that encoder has. Nvidia's tiling-plus-multiple-encoder approach could yield AI assistants that are more accurate on complex visual questions — reading a chart, spotting a crack in machinery, or parsing a dense medical image — without needing one impossibly capable vision model.

For Nvidia, this also fits squarely into its push to own not just the hardware AI runs on, but the training recipes that make frontier models work. A patented, structured training pipeline could become a selling point for Nvidia's AI development platforms, giving enterprise customers a reproducible method for building their own multimodal systems.

Editorial take

This is a serious research patent, not a product announcement — but it reflects exactly where the AI field is heading: richer visual understanding baked into the same systems that already handle text. The multi-encoder tiling approach is a real architectural choice that distinguishes this from generic multimodal training filings. If Nvidia's internal research bears out the approach, expect it to show up in future model releases tied to its AI platform.

Get one Big Tech patent every Sunday

Plain English, intelligent commentary, no hype. Free.

Source. Full patent text and figures from the official USPTO publication PDF.

Editorial commentary on a publicly published patent application. Not legal advice.

Nvidia Patent: Teaching AI to Read Images and Answer in Plain Text

How Nvidia's image-tiling AI vision system works

How the tile encoders feed into the language model

What this means for AI that sees and speaks

More from Nvidia

More in AI/ML

Get one Big Tech patent every Sunday