Nvidia · Filed Dec 20, 2024 · Published Jun 25, 2026 · verified — real USPTO data

Nvidia Patents an AI That Watches and Listens to a Video Before Answering Your Questions

By Patentlyze Team · Updated Jun 26, 2026

Most AI tools that let you ask questions about a video only look at the frames. Nvidia's new patent describes a system that also listens to the audio, then combines both to give you a more complete answer.

FIG. 1A — rendered from the official USPTO publication PDF.

Publication number US 2026/0178642 A1

Applicant NVIDIA Corporation

Filing date Dec 20, 2024

Publication date Jun 25, 2026

Inventors Mansata KAMAL, Vishesh Gupta, Rohit Singh

CPC classification 715/254

Grant likelihood Medium

Examiner CENTRAL, DOCKET (Art Unit OPAP)

Status Docketed New Case - Ready for Examination (Feb 3, 2025)

Document 20 claims

AI/ML

What Nvidia's audio-visual question-answering AI actually does

Imagine uploading a video of a product demo and asking your AI assistant, "What problem does this solve?" If the AI only scans the frames, it might miss a key explanation the presenter spoke out loud. Nvidia's patent describes a system that analyzes both the visual content and the audio of a video before forming a reply.

The idea is that you ask a question about a video, and the AI pulls from what it saw and what it heard, not just one or the other. That means spoken context, background sounds, or narration can all factor into the answer, the same way a person watching and listening at the same time would understand more than someone who only skimmed the pictures.

This kind of approach could apply to tools like video search, media analysis platforms, or any assistant that lets you query a library of recorded content.

How the system combines audio and video into one AI answer

The patent describes a processing system that handles multimodal queries, meaning questions that draw on more than one type of media at once.

At the core of the system are two types of embeddings (compressed numerical representations of data that an AI can work with):

A first embedding representing the audio track of a video
A second embedding representing the visual frames of a video

When a user submits a question, the system combines representations of both embeddings and feeds them, along with the query text, into a model that generates a response. The model doesn't choose between audio or video; it uses a combined representation of both. This is sometimes called multimodal fusion (merging different data types before the AI reasons over them).

The patent's first claim is intentionally broad, covering any processor setup that receives a user query about a video, retrieves those two embeddings, and produces an answer by processing the query against both data sources together.

What this means for video search and AI assistants

Right now, most AI video tools treat audio and video separately, or simply ignore the audio entirely. A system that fuses both means you get answers that reflect the full content of a recording, not just what was visible on screen. That's meaningful for anything from meeting transcripts to surveillance footage to instructional videos, where what's said often matters as much as what's shown.

For Nvidia, whose AI infrastructure powers a large portion of the industry's model training and inference, a patent like this signals continued investment in multimodal AI. It fits alongside broader industry efforts to build AI assistants that understand video the way people naturally do: by watching and listening at the same time.

Editorial take

The abstract oversells this with phrases like 'holistic understanding' and 'nuanced analysis,' but the underlying idea is genuinely practical. Combining audio and video embeddings before answering a query is a real improvement over vision-only approaches, and Nvidia is well-positioned to build this into its AI platforms. The claim is broad enough that it may face prior art scrutiny, but the direction is sound.

Get one Big Tech patent every Sunday

Plain English, intelligent commentary, no hype. Free.

Source. Full patent text and figures from the official USPTO publication PDF.

Editorial commentary on a publicly published patent application. Not legal advice.

Nvidia Patents an AI That Watches and Listens to a Video Before Answering Your Questions

What Nvidia's audio-visual question-answering AI actually does

How the system combines audio and video into one AI answer

What this means for video search and AI assistants

More from Nvidia

More in AI/ML

Get one Big Tech patent every Sunday