New Google Patents · Filed Dec 18, 2025 · Published Jun 25, 2026 · verified — real USPTO data

Google Patent Reveals Virtual Batch Processing to Boost On-Device AI Efficiency

When an AI model works through a complex answer, it generates dozens of intermediate steps that depend on each other in complicated ways. Google's new patent describes a way to map out those dependencies upfront and bundle the steps into tidy groups so the model can process them more efficiently, especially on a phone or tablet.

Google Patent: Virtual Batches for Faster On-Device LLM Inference — figure from US 2026/0178626 A1
FIG. 1A — rendered from the official USPTO publication PDF.
Publication number US 2026/0178626 A1
Applicant Google LLC
Filing date Dec 18, 2025
Publication date Jun 25, 2026
Inventors Michael Christian Butler
CPC classification 704/9
Grant likelihood Medium
Examiner ADESANYA, OLUJIMI A (Art Unit 2658)
Status Non Final Action Mailed (Mar 25, 2026)
Parent application is a Continuation of PCTUS2024062418 (filed 2024-12-31)
Document 20 claims

What Google's virtual-batch AI inference actually does

Imagine you're solving a multi-step math problem. Some steps have to happen in order (you can't finish step 4 before step 2), but others are independent and could be done at the same time. A good teacher would organize those steps into sensible groups so students work in parallel wherever possible. Google's patent applies that same idea to AI language models.

When an AI generates a long or complex response, it produces many small reasoning steps called tokens. Right now, models process these somewhat blindly. This patent describes a system where the model first builds a "dependency map" showing which tokens rely on which others, then organizes them into "virtual batches," each representing a self-contained chunk of reasoning the model can evaluate on its own.

The key detail is that these batches can include placeholder "masked" positions where no active token exists yet, keeping the structure consistent even when some slots are empty. The model then picks the best batch (or combination of batches) as its final answer.

How the dependency map builds and fills virtual batches

The patent describes a three-stage process that runs inside a large language model, potentially on a device like a phone rather than a remote server.

  • Token generation: The LLM produces a set of inference tokens, which can come from the same input prompt, multiple different prompts, or a mix of both.
  • Dependency mapping: The model builds a structured map that assigns each token one or more index markers (essentially a position ID) and a correlation marker (a tag showing which other tokens it depends on). This map captures the logical order of reasoning steps.
  • Virtual batch construction: Using the dependency map, the model groups tokens into virtual batches. Each batch represents a discrete, self-contained inference step. Batches can contain masked positions, meaning placeholder slots where no token is active, which keeps the batch sizes uniform and predictable for the underlying hardware.

Finally, the model selects one or more of these virtual batches as the definitive output. The masking system is notable because it lets the model handle multiple reasoning paths of different lengths without wasting computation or forcing awkward padding that could confuse the model's attention mechanism (the part of the AI that decides which pieces of information to focus on).

What this means for AI running directly on your phone

On-device AI is a crowded space right now, with every major phone maker racing to run language models locally rather than sending your prompts to the cloud. The bottleneck is almost always compute efficiency: phones have limited chips and batteries, so every wasted calculation costs you time and power. A system that maps dependencies before processing and batches work intelligently could meaningfully reduce how much a model has to "redo" when generating complex answers.

This patent also hints at handling multiple concurrent prompts, not just one at a time. If a phone's AI assistant is juggling several background tasks simultaneously, smarter batching could keep all of them responsive without one hogging the chip. That's a practical engineering problem Google clearly wants to solve at the model level, not just the hardware level.

Editorial take

This is genuinely useful infrastructure work rather than a flashy AI feature. The problem it solves, inefficient token scheduling during on-device inference, is real and limits how capable local AI can be on consumer hardware. It's not the kind of patent that makes headlines, but if this approach works as described, it's the type of optimization that makes everything else better.

Get one Big Tech patent every Sunday

Plain English, intelligent commentary, no hype. Free.

Source. Full patent text and figures from the official USPTO publication PDF.

Editorial commentary on a publicly published patent application. Not legal advice.