Google · Filed Oct 31, 2024 · Published Apr 30, 2026 · verified — real USPTO data

Waymo Patents Transformer-Based AI System for Building Robotaxi Navigation Maps

By Patentlyze Team · Updated May 4, 2026

Waymo is borrowing the same AI architecture that powers ChatGPT to help its self-driving cars understand what's around them — and turn that understanding directly into navigation decisions.

FIG. 1A — rendered from the official USPTO publication PDF.

Publication number US 2026/0118129 A1

Applicant Waymo LLC

Filing date Oct 31, 2024

Publication date Apr 30, 2026

Inventors Congrui Hetang, Guan Sun, Yan Jiao, Xiaohan Jin, Yue Shen, Ningshan Zhang, Guohao Zhang

CPC classification 701/446

Grant likelihood Medium

Examiner LEE, BRANDON DONGPA (Art Unit 3662)

Status Docketed New Case - Ready for Examination (Dec 9, 2024)

Automotive

How Waymo's AI reads the road in real time

Imagine you're driving through an unfamiliar neighborhood. You glance around, mentally note the stop sign ahead, the cyclist on your right, the parked truck blocking the lane — and instantly build a mental picture you use to steer. Waymo's patent describes a system that does something very similar, but with AI.

Sensors on the robotaxi — cameras, lidar, radar — feed raw data into an AI model. That model, using the same transformer architecture behind modern large language models, converts the raw sensor soup into a compact, structured map of the driving environment. That map is then handed off to the car's navigation system to decide where to go next.

The clever part is the targeted querying: instead of processing every sensor reading with equal attention, the system uses specific "queries" to direct the AI toward the parts of the scene that actually matter for navigation — think of it like a spotlight that focuses on relevant details rather than drowning in noise.

How the transformer decoder queries build driving embeddings

The patent describes a mapping subsystem that sits between a self-driving car's raw sensors and its navigation planner. Here's the pipeline:

Input embedding generation: Sensor data (lidar point clouds, camera frames, etc.) is converted into an input embedding — a dense mathematical representation that encodes the full driving environment as a high-dimensional vector.
Transformer decoder queries: The system selects one or more decoder queries — think of these as questions directed at specific portions of the embedding, like "what's in the lane ahead?" or "is there a pedestrian at the crosswalk?" This is the same attention mechanism used in models like GPT, but aimed at spatial scene understanding rather than text.
Driving environment embeddings: The transformer decoder processes those queries against the input embedding and outputs driving environment embeddings — compact vector representations of individual features in the scene (a lane boundary, an obstacle, a traffic light state).
Navigation handoff: Those embeddings are passed directly to the navigation system, which uses them to plan the vehicle's path.

The key architectural insight is using transformer decoders with learnable queries — a technique also seen in object detection models like DETR — to produce structured, navigation-ready scene representations rather than generic feature maps.

What this means for the future of Waymo's robotaxi maps

For Waymo, the ability to generate compact, query-targeted scene representations matters because it potentially replaces or augments the painstaking process of pre-building HD maps of every street the car will ever drive. If the car can reliably build its own real-time map from sensor data, it becomes less dependent on those expensive, frequently-outdated static maps — which is a genuine bottleneck to scaling a robotaxi fleet into new cities.

More broadly, applying transformer-based attention to AV perception is a bet that the same scaling laws that made LLMs so capable will also improve autonomous driving. You wouldn't notice any of this as a Waymo passenger — but under the hood, it represents a meaningful architectural shift in how the car understands the world around it.

Editorial take

This is a technically meaningful filing, not marketing fluff. Using transformer decoders with learned queries for AV scene understanding is a real architectural choice with real tradeoffs — it aligns Waymo's perception stack more closely with the broader ML research mainstream, which makes it easier to benefit from advances in foundation models. Whether it beats their existing approaches in production is the interesting open question.

Get one Big Tech patent every Sunday

Plain English, intelligent commentary, no hype. Free.

Source. Full patent text and figures from the official USPTO publication PDF.

Editorial commentary on a publicly published patent application. Not legal advice. Patentlyze may earn a commission if you click an affiliate link and make a purchase. This doesn't affect what we cover or how we cover it.