Qualcomm · Filed Dec 3, 2024 · Published Jun 4, 2026 · verified — real USPTO data

Qualcomm Patents a Way to Track 3D Objects Using Only 2D Camera Feeds

Most autonomous vehicle systems do heavy geometric lifting to convert camera images into a 3D "bird's-eye view" of the world — Qualcomm's new patent tries to skip that step entirely, going from flat 2D video straight to 3D object positions.

Qualcomm Patent: 3D Object Tracking from 2D Cameras Explained — figure from US 2026/0154826 A1
FIG. 1A — rendered from the official USPTO publication PDF.
Publication number US 2026/0154826 A1
Applicant QUALCOMM Incorporated
Filing date Dec 3, 2024
Publication date Jun 4, 2026
Inventors Narayanan Elavathur Ranganatha, Amin Ansari, Sai Madhuraj Jadhav
CPC classification 382/103
Grant likelihood Medium
Examiner CENTRAL, DOCKET (Art Unit OPAP)
Status Docketed New Case - Ready for Examination (Jan 7, 2025)
Document 21 claims

How Qualcomm skips the 3D geometry math

Imagine a car surrounded by eight cameras, each capturing a flat 2D video stream. Right now, most self-driving software has to take all those flat images and stitch them together into a 3D overhead map of the car's surroundings before it can say "there's a pedestrian 4 meters ahead." That stitching step is computationally expensive and requires precise knowledge of exactly where each camera is mounted.

Qualcomm's patent describes a system that skips the overhead map entirely. Instead, it tracks objects across all the 2D camera feeds simultaneously, pulls out visual features from each view, and combines those features to infer where objects actually sit in 3D space — all in one pass. A special "spatial token" acts like a placeholder that gets filled in with 3D location information as the system processes each camera's input.

The result is a pipeline that's potentially lighter to run, more flexible about camera placement, and able to maintain consistent object tracking frame-to-frame without rebuilding a full 3D model every time.

How 2D features become 3D spatial parameters

The system starts with 2D multi-object tracking running in parallel across several cameras — think of it as each camera independently following every car, pedestrian, or cyclist it can see. For each tracked object, the pipeline extracts image features (learned visual descriptors that encode shape, texture, and motion).

When the same object appears in more than one camera view, those per-view features are aggregated — combined into a single, richer representation of that object. This is where the cross-view reasoning happens. The system also generates a spatial representation token for each object, built from the raw 2D tracking data and a set of reference vectors (geometric anchors that encode possible positions in space without requiring a full coordinate-frame transformation).

Attention operations (a mechanism borrowed from transformer neural networks that weighs how relevant different pieces of information are to each other) then analyze relationships between the aggregated image features and the spatial tokens to produce final 3D spatial parameters — position, size, and orientation in the real world.

Critically, the patent claims this works without computing a birds-eye-view (BEV) feature map — the expensive intermediate 3D representation most current perception stacks rely on. This makes the approach more portable across different camera rig configurations.

What this means for self-driving and robotics cameras

For autonomous vehicles and robots, perception pipelines are a major bottleneck for both compute cost and latency. BEV-based systems require precise camera calibration and significant processing overhead; a system that can produce accurate 3D tracking directly from 2D inputs could run on less powerful hardware or free up compute budget for other tasks. That's a real advantage for Qualcomm, whose Snapdragon Ride chips power automotive compute platforms for many automakers.

This approach also has a practical flexibility angle: if you change your camera layout, a BEV system needs expensive recalibration. A camera-configuration-agnostic tracker is easier to deploy across different vehicle models or robot form factors — which matters a lot when you're a chip vendor selling to dozens of hardware partners.

Editorial take

This is a genuinely interesting systems-level patent from Qualcomm, not a routine filing. Eliminating the BEV feature computation step is a known pain point in the AV perception stack, and the attention-based aggregation approach maps cleanly onto Qualcomm's existing AI accelerator hardware. Whether the accuracy holds up against full BEV pipelines in real-world conditions is the open question, but the architecture is worth watching.

Get one Big Tech patent every Sunday

Plain English, intelligent commentary, no hype. Free.

Source. Full patent text and figures from the official USPTO publication PDF.

Editorial commentary on a publicly published patent application. Not legal advice.