Google · Filed Oct 31, 2024 · Published Apr 30, 2026 · verified — real USPTO data

Google Patents a Speech Recognition System That Anticipates Your Words

By Patentlyze Team · Updated May 4, 2026

Most speech recognition systems wait for you to finish talking before they figure out what you said. Google's new patent describes a system that starts narrowing down likely words and phrases while you're still speaking — before transcription even begins.

FIG. 1A — rendered from the official USPTO publication PDF.

Publication number US 2026/0120685 A1

Applicant Google LLC

Filing date Oct 31, 2024

Publication date Apr 30, 2026

Inventors Mr. Zhiqi Huang, Mr. Diamantino Antonio Caseiro, Mr. Christopher Li, Mr. Zelin Wu, Mr. Patrick Maxim Rondon, Mr. Kandarp Joshi, Mr. Petr Zadrazil, Mr. Lillian Qiaohui Zhou, Petar Aleksic

CPC classification 704/232

Grant likelihood Medium

Examiner SIDDO, IBRAHIM (Art Unit 2681)

Status Non Final Action Counted, Not Yet Mailed (Apr 25, 2026)

AI/ML

How Google's speech system predicts words mid-sentence

Imagine you're asking your phone about a restaurant: "Can you get me directions to Nobu Malibu?" A standard voice assistant has to hear the whole thing, then look up every possible word before deciding what you said. That's slow, and it struggles with unusual names, places, or brands.

What Google's patent describes is more like a well-read co-worker who's already thought of the five most likely things you might say next. While your voice is still coming in, the system is already scoring a big list of candidate phrases — things like proper nouns, contact names, or app titles — and picking the most relevant ones based on the audio so far.

Those top candidates then get fed into the transcription process as extra context, nudging the system toward the right answer. The result is a recognizer that's better at handling rare or custom vocabulary without having to slow down or search through millions of options at the last second.

How the neural retrieval module ranks and injects phrases

The system works in three linked stages, each handled by a different component.

First, an audio encoder converts incoming speech into a sequence of audio embeddings — dense numerical vectors that capture the meaning and sound of what's being said, frame by frame. Think of these as a compact mathematical fingerprint of your voice input.

Second, a neural retrieval module runs in parallel. It takes a large candidate phrase corpus — a library of words or phrases the system might need to recognize, like contacts, app names, or location names — and scores each one against the audio embeddings using a scoring function. Each phrase has been pre-encoded into its own phrase embedding and broken into wordpiece embeddings (sub-word tokens, the standard unit in modern language models). The top-K highest-scoring phrases are selected — the K most contextually relevant candidates.

Third, a biaser module combines the audio embeddings with the wordpiece sequences from those top-K phrases to produce a context vector — essentially a hint to the final recognizer about what vocabulary is most likely. The speech recognizer then uses both that context vector and the raw speech features to generate the final transcription.

Audio encoder → speech embeddings
Neural retrieval → top-K biasing phrases
Biaser module → context vector
Speech recognizer → final transcription

What this means for Google Assistant and Pixel voice

Voice assistants have always had a known weak spot: rare words. Proper nouns, brand names, niche medical terms, your contact list — these trip up generic models trained on broad text. The usual fix is brute-force: check every possible word after the fact. That's slow and doesn't scale.

Google's approach here — doing the vocabulary narrowing before transcription, using audio-based relevance scoring — could meaningfully improve accuracy for exactly the cases where voice recognition frustrates you most. If this lands in Pixel phones or Google Assistant, it would be most noticeable when you're asking for something specific: a person's name, a local business, or a niche command. That's also the category where getting it wrong is most annoying.

Editorial take

This is a genuinely smart systems patent, not a vague AI grab. The key insight — rank your vocabulary candidates against live audio before transcription, not after — is the kind of pipeline-level improvement that compounds over time. Google has been iterating on contextual biasing in speech recognition for years, and this looks like a meaningful architectural step forward, not incremental tuning.

Get one Big Tech patent every Sunday

Plain English, intelligent commentary, no hype. Free.

Source. Full patent text and figures from the official USPTO publication PDF.

Editorial commentary on a publicly published patent application. Not legal advice. Patentlyze may earn a commission if you click an affiliate link and make a purchase. This doesn't affect what we cover or how we cover it.