Meta · Filed Nov 17, 2025 · Published Jun 11, 2026 · verified — real USPTO data

Gesture and Voice Can Now Work Together to Control AI Assistants

By Patentlyze Team · Updated Jun 12, 2026

Talking to an AI assistant is useful — but what if you could point at something while you speak, and the AI understood both at once? That's the core idea in Meta's latest patent filing.

FIG. 1A — rendered from the official USPTO publication PDF.

Publication number US 2026/0162417 A1

Applicant Meta Platforms Technologies, LLC

Filing date Nov 17, 2025

Publication date Jun 11, 2026

Inventors Paul Anthony Crook, Xiaohu Liu, Francislav P. Penov, Rajen Subba

CPC classification 345/156

Grant likelihood Low

Examiner BOLOTIN, DMITRIY (Art Unit 2623)

Status Docketed New Case - Ready for Examination (Mar 10, 2026)

Parent application is a Continuation of 18915864 (filed 2024-10-15)

Document 21 claims

AR/VR

What Meta's combined gesture-and-voice assistant actually does

Imagine wearing a pair of smart glasses and wanting to ask your AI assistant about a restaurant across the street. Instead of just saying 'what's that place?', you could point at it while you speak, and the assistant would understand both signals together — your gesture and your words — as one combined request.

That's exactly what this Meta patent describes. The system takes a hand gesture from you, figures out what you probably mean by it using a model trained specifically on your gestures, then combines that with whatever you said out loud to carry out the task and give you a result.

The key detail is the 'personalized' part. Rather than relying on a one-size-fits-all gesture dictionary, the system learns your particular way of gesturing over time. So if your 'point at something' looks a little different from the average person's, it still gets you.

How Meta's system reads gestures and speech together

The patent describes a pipeline that handles two inputs simultaneously: a gesture-input (a physical hand or body movement captured by the device's sensors) and a speech-input (what you said out loud).

On the gesture side, a personalized gesture-classification model — meaning an AI model fine-tuned to recognize your specific movements — interprets the gesture and assigns it an intent (i.e., what you were trying to do). Think of intent as the machine's best guess at the goal behind your action, like 'select that object' or 'dismiss this notification.'

Once the intent is determined from the gesture, the system combines it with your spoken words to execute one or more tasks. The results are then sent back to whatever device you're using — presumably glasses, a headset, or a phone — and presented to you.

Gesture-input: captured and classified by a user-specific AI model
Speech-input: processed alongside the gesture to complete the full request
Personalized model: adapts to how each individual user naturally gestures, rather than requiring standardized movements
Task execution: runs based on the combined intent, not just one input channel

What this means for Meta's AR glasses ambitions

For Meta's AR glasses — the Ray-Ban Meta line and whatever follows it — this kind of multimodal input is almost a necessity. A screenless device you wear on your face can't rely on tapping or typing, so combining what you say with what you do physically is the most natural replacement. This patent suggests Meta is building the underlying AI infrastructure to make that feel intuitive rather than clunky.

The personalization angle is also worth noting. Systems that learn your gesture style are more likely to feel natural in everyday use, which is the long-standing challenge for gesture-based interfaces. If Meta can make this reliable enough across different users and contexts, it would be a meaningful step toward hands-free computing that doesn't require you to memorize a fixed set of robot-like hand signals.

Editorial take

This is a solid infrastructure patent for Meta's AR ambitions — not a flashy consumer feature announcement, but the kind of foundational AI plumbing that would need to exist before gesture-driven glasses could feel genuinely usable. The personalization angle is the most interesting wrinkle: most gesture systems fail because they're too rigid. A model that adapts to how you move is a smarter approach.

Get one Big Tech patent every Sunday

Plain English, intelligent commentary, no hype. Free.

Source. Full patent text and figures from the official USPTO publication PDF.

Editorial commentary on a publicly published patent application. Not legal advice.

Gesture and Voice Can Now Work Together to Control AI Assistants

What Meta's combined gesture-and-voice assistant actually does

How Meta's system reads gestures and speech together

What this means for Meta's AR glasses ambitions

More from Meta

More in AR/VR

Get one Big Tech patent every Sunday