Nvidia · Filed Sep 8, 2025 · Published Jun 18, 2026 · verified — real USPTO data

Nvidia Patents an AI That Reads Images the Way You'd Describe a Painting

Nvidia has filed a patent for a training approach that teaches AI to understand images by slicing them into tiles, running each piece through multiple specialized analyzers, and then feeding all that visual information into a language model — the kind of AI that generates text.

Nvidia Patent: Teaching AI to See Images in Pieces — figure from US 2026/0170269 A1
FIG. 1A — rendered from the official USPTO publication PDF.
Publication number US 2026/0170269 A1
Applicant NVIDIA CORPORATION
Filing date Sep 8, 2025
Publication date Jun 18, 2026
Inventors Zhiding YU, Zhiqi LI, Guo CHEN, Shilong LIU, Shihao WANG, Vibashan VISHNUKUMAR SHARMINI, Shiyi LAN, Hao ZHANG, Yilin ZHAO, Subhashree RADHAKRISHNAN, Nai Chen CHANG, Karan SAPRA, Amala Sanjay DESHMUKH, Tuomas RINTAMAKI, Matthieu LE, De-An HUANG, Jose Manuel ALVAREZ LOPEZ, Bryan CATANZARO, Jan KAUTZ, Andrew J. TAO, Guilin LIU
CPC classification 382/159
Grant likelihood Medium
Examiner CENTRAL, DOCKET (Art Unit OPAP)
Status Docketed New Case - Ready for Examination (Oct 3, 2025)
Parent application Claims priority from a provisional application 63733405 (filed 2024-12-12)
Document 20 claims

How Nvidia's image-tiling AI model actually works

Imagine trying to describe a crowded painting to a friend over the phone. It's easier if you talk about it section by section — top-left corner, center, bottom-right — rather than trying to capture everything at once. Nvidia's patent takes a similar approach with AI.

Instead of feeding a whole image into an AI in one shot, this system chops the image into smaller tiles and runs each tile through multiple vision analyzers, each trained to notice different things (shapes, objects, text in the image, etc.). The results are then combined and handed off to a language model — the same kind of AI behind chatbots — which turns all of that into a written response.

The training process is also staged: the system first trains just the "bridge" connecting the visual and language parts, then does a big training run, then finishes with a smaller, more focused one. The goal is an AI that can genuinely answer questions about images, not just describe them in generic terms.

How the tile-encoder pipeline feeds the language model

The patent describes a three-stage training pipeline for a multimodal model — an AI that handles both images and text. The key architectural move is in how images are prepared before the language model ever sees them.

  • Image tiling: An input image is split into a grid of smaller tiles, similar to how map apps load a big map in chunks. This lets the model pay close attention to fine details that might get washed out when processing a whole image at once.
  • Multiple vision encoders: Each tile is processed by several vision encoders (specialized neural networks trained to detect different features — one might be good at reading text in images, another at identifying objects). Running the same tile through different encoders generates richer, more varied feature data.
  • Token generation: The visual features from all those encoders are converted into tokens (the basic units a language model reads, similar to words in a sentence), which the language model then uses to generate a text output.

The training itself runs in three phases: first, a connector (the bridge between the vision and language components) is trained alone; then the full model trains on a large dataset; finally, it fine-tunes on a smaller, higher-quality dataset. This staged approach is a common way to squeeze better performance out of large AI systems without starting from scratch each time.

What this means for AI that reads images and text together

Multimodal AI — systems that understand both images and text — is one of the most competitive areas in AI right now. The ability to ask an AI "what's wrong with this X-ray?" or "summarize this chart" depends on how well the visual and language sides of the model work together. Nvidia's tile-and-encode approach is designed to preserve fine-grained detail that gets lost when a single encoder tries to process a full image at once.

For Nvidia, which sells the chips that power most AI training infrastructure, patenting the training methods themselves is a way to extend influence beyond hardware. If this technique ends up in Nvidia's AI software stack, it could shape how customers build vision-language models on Nvidia's platforms.

Editorial take

This is a legitimate engineering contribution to a genuinely hard problem — getting AI to look carefully at images rather than just glancing at them. The tile-plus-multiple-encoders approach is well-motivated and the staged training is practical. That said, multimodal training methods are a crowded field, and Nvidia is one of dozens of organizations filing patents in this space. Whether this particular combination is distinct enough to matter legally or technically is a real question.

Get one Big Tech patent every Sunday

Plain English, intelligent commentary, no hype. Free.

Source. Full patent text and figures from the official USPTO publication PDF.

Editorial commentary on a publicly published patent application. Not legal advice.