Google · Filed Nov 12, 2025 · Published May 21, 2026 · verified — real USPTO data

Waymo Patents an LLM System That Predicts What Self-Driving Cars Will See Next

Waymo is patenting a system that makes its self-driving car describe what it sees in plain text — then uses those text descriptions as inputs to a separate neural network that predicts what's about to happen on the road.

Waymo Patent: Multimodal LLMs for Autonomous Driving Predictions — figure from US 2026/0141246 A1
FIG. 1A — rendered from the official USPTO publication PDF.
Publication number US 2026/0141246 A1
Applicant Waymo LLC
Filing date Nov 12, 2025
Publication date May 21, 2026
Inventors Katie Luo, Jingwei Ji, Mingxing Tan
CPC classification 706/25
Grant likelihood Medium
Examiner CENTRAL, DOCKET (Art Unit OPAP)
Status Docketed New Case - Ready for Examination (Dec 3, 2025)
Parent application Claims priority from a provisional application 63721396 (filed 2024-11-15)
Document 21 claims

What Waymo's LLM-assisted scene prediction actually does

Imagine a self-driving car approaching a busy intersection. Instead of just feeding raw camera and lidar data into a prediction model and hoping for the best, Waymo's system first has a language model describe the scene in words — something like 'a cyclist is signaling left near a stopped bus' — and then passes that description alongside the raw sensor context into a second model that figures out what everyone is likely to do next.

The key idea is that large multimodal language models (think GPT-4V-style systems that understand both images and text) are already very good at interpreting complex visual scenes. Waymo wants to tap into that capability as a scene understanding layer that sits upstream of its dedicated prediction system.

The result is a two-stage pipeline: an LLM that reads the world and narrates it, and a prediction network that uses that narration — plus raw sensor context — to forecast how pedestrians, cyclists, and other vehicles will behave. You get the language model's broad world knowledge combined with the prediction model's precise trajectory-forecasting ability.

How the MLM converts sensor data into prediction inputs

The patent describes a multimodal language model (MLM) neural network — a model that can process visual inputs like camera frames alongside text — generating one or more text outputs that describe what's happening in a driving scene at a given moment.

Those text outputs are then combined with scene context data (structured information about the environment: road geometry, traffic signals, agent states, etc.) to form a prediction input to a separate, dedicated prediction neural network. That second network is what actually outputs predictions — think trajectory forecasts for every agent (pedestrian, cyclist, car) in the scene.

The architecture is deliberately modular:

  • Sensor data (camera, lidar, radar) goes into the MLM, which narrates the scene in natural language.
  • Those text descriptions are fused with structured scene context data.
  • The combined input is processed by a prediction network tuned specifically for trajectory or behavior forecasting.

This separation matters because the MLM handles open-ended scene interpretation — it can describe unusual or ambiguous situations in a way that rigid perception pipelines might miss — while the prediction network focuses on the precise, safety-critical task of forecasting motion. The training system described in the patent also allows the prediction network's parameters to be optimized using perception outputs from training data, keeping both stages jointly grounded.

What this means for how self-driving cars understand the world

Self-driving prediction models have historically struggled with rare or ambiguous scenarios — the edge cases that structured sensor pipelines can't reliably classify. By routing scene understanding through an LLM first, Waymo is betting that language models' exposure to vast amounts of real-world visual and textual data gives them a richer, more generalizable grasp of what's actually happening on the road — especially in weird situations a narrow perception model might misread.

For Waymo's commercial robotaxi operations, better scene prediction directly translates to smoother, safer rides. If this pipeline proves robust, it also hints at a broader architectural shift: using foundation models not as end-to-end drivers, but as interpretive middleware that feeds specialized downstream models — a design pattern that could show up across the autonomous driving industry.

Editorial take

This is a genuinely interesting architectural bet. Rather than trying to make one giant model do everything, Waymo is threading LLM scene understanding into an existing prediction pipeline as a discrete, swappable layer. That modularity is practical engineering, not flashy AI branding — and it lines up with how serious ML teams actually ship safety-critical systems. Worth watching closely as the multimodal model space matures.

Get one Big Tech patent every Sunday

Plain English, intelligent commentary, no hype. Free.

Source. Full patent text and figures from the official USPTO publication PDF.

Editorial commentary on a publicly published patent application. Not legal advice.