Nvidia · Filed Nov 27, 2024 · Published May 28, 2026 · verified — real USPTO data

Nvidia's New Patent Teaches Machines to See Objects by Listening to Them

By Patentlyze Team · Updated Jun 5, 2026

Nvidia is patenting a way to use neural networks to extract image-like object features directly from audio — essentially letting a system 'see' objects by listening to them. It's an unusual cross-modal trick that could matter for robotics and autonomous perception.

FIG. 1A — rendered from the official USPTO publication PDF.

Publication number US 2026/0148747 A1

Applicant NVIDIA Corporation

Filing date Nov 27, 2024

Publication date May 28, 2026

Inventors Xianchao Wu, Scott Nunweiler

CPC classification 704/500

Grant likelihood Medium

Examiner DESIR, PIERRE LOUIS (Art Unit 2659)

Status Docketed New Case - Ready for Examination (Jan 2, 2025)

Document 20 claims

AI/ML

What Nvidia's audio-to-image feature trick actually does

Imagine a self-driving car or a robot that can figure out what an object is — not just by looking at it with a camera, but by listening to the sounds it makes. Nvidia's new patent describes exactly that kind of system: a neural network that takes audio signals and produces image object features, the same kind of descriptors normally extracted from a photograph.

The twist here is that the system handles audio recorded at variable sample rates — meaning it doesn't need a perfectly clean, consistent audio feed to work. That flexibility is important in the real world, where microphones and sensors rarely behave perfectly.

In plain terms, you could think of it like teaching an AI to recognize a cup not because it can see the cup, but because it can hear the cup — whether you tap it, drop it, or drag it across a surface.

How the neural net maps variable-rate audio to image features

The patent describes a processor containing circuits that feed one or more audio signals into one or more neural networks. The networks' job is to output image object features — structured numerical representations (like embeddings or feature vectors) that traditionally come from processing visual data like photos or video frames.

The key engineering detail is support for a variable sample rate on the incoming audio. Sample rate refers to how many audio measurements are captured per second (think 44,100 Hz for CD-quality audio). Variable sample rate means the system can accept audio that isn't locked to a fixed standard — a significant practical advantage when dealing with diverse hardware or noisy real-world pipelines.

The underlying idea is cross-modal feature generation — using one sensory modality (sound) to produce representations normally associated with another (vision). This is related to a broader research area called audio-visual correspondence, where AI systems learn that certain sounds reliably correlate with certain visual objects or scenes.

The patent is light on architectural specifics in its published form, but the claim structure suggests the core novelty is in the combination of neural-network-driven processing and variable-rate audio tolerance for producing image-domain outputs.

What this means for audio-visual AI perception systems

For robotics and autonomous systems — areas where Nvidia's compute platforms are heavily deployed — the ability to infer visual object properties from audio alone could fill gaps when cameras are occluded, lighting is poor, or visual sensors fail. A robot on a factory floor, for instance, might confirm what an object is by listening to how it sounds when contacted, cross-checking that against its vision system.

More broadly, this kind of cross-modal AI is increasingly important as systems move beyond single-sensor pipelines. If Nvidia can bake this capability into its GPU and inference hardware at a low level, it becomes a building block for richer multi-modal perception in everything from autonomous vehicles to embodied AI agents — without requiring developers to stitch together separate audio and vision models themselves.

Editorial take

The abstract and claim here are unusually sparse for a company with Nvidia's engineering depth, which makes it hard to judge how differentiated the actual implementation is. The core concept — deriving image features from audio using neural networks — is real and active in academic research, but the variable-sample-rate framing feels more like an engineering constraint being claimed than a fundamental insight. Worth a bookmark if you're tracking Nvidia's multi-modal AI stack, but this one needs a fuller disclosure before you get excited.

Get one Big Tech patent every Sunday

Plain English, intelligent commentary, no hype. Free.

Source. Full patent text and figures from the official USPTO publication PDF.

Editorial commentary on a publicly published patent application. Not legal advice.

Nvidia's New Patent Teaches Machines to See Objects by Listening to Them

What Nvidia's audio-to-image feature trick actually does

How the neural net maps variable-rate audio to image features

What this means for audio-visual AI perception systems

More from Nvidia

More in AI/ML

Get one Big Tech patent every Sunday