Gesture and Voice Can Now Work Together to Control AI Assistants
Talking to an AI assistant is useful — but what if you could point at something while you speak, and the AI understood both at once? That's the core idea in Meta's latest patent filing.
What Meta's combined gesture-and-voice assistant actually does
Imagine wearing a pair of smart glasses and wanting to ask your AI assistant about a restaurant across the street. Instead of just saying 'what's that place?', you could point at it while you speak, and the assistant would understand both signals together — your gesture and your words — as one combined request.
That's exactly what this Meta patent describes. The system takes a hand gesture from you, figures out what you probably mean by it using a model trained specifically on your gestures, then combines that with whatever you said out loud to carry out the task and give you a result.
The key detail is the 'personalized' part. Rather than relying on a one-size-fits-all gesture dictionary, the system learns your particular way of gesturing over time. So if your 'point at something' looks a little different from the average person's, it still gets you.
How Meta's system reads gestures and speech together
The patent describes a pipeline that handles two inputs simultaneously: a gesture-input (a physical hand or body movement captured by the device's sensors) and a speech-input (what you said out loud).
On the gesture side, a personalized gesture-classification model — meaning an AI model fine-tuned to recognize your specific movements — interprets the gesture and assigns it an intent (i.e., what you were trying to do). Think of intent as the machine's best guess at the goal behind your action, like 'select that object' or 'dismiss this notification.'
Once the intent is determined from the gesture, the system combines it with your spoken words to execute one or more tasks. The results are then sent back to whatever device you're using — presumably glasses, a headset, or a phone — and presented to you.
- Gesture-input: captured and classified by a user-specific AI model
- Speech-input: processed alongside the gesture to complete the full request
- Personalized model: adapts to how each individual user naturally gestures, rather than requiring standardized movements
- Task execution: runs based on the combined intent, not just one input channel
What this means for Meta's AR glasses ambitions
For Meta's AR glasses — the Ray-Ban Meta line and whatever follows it — this kind of multimodal input is almost a necessity. A screenless device you wear on your face can't rely on tapping or typing, so combining what you say with what you do physically is the most natural replacement. This patent suggests Meta is building the underlying AI infrastructure to make that feel intuitive rather than clunky.
The personalization angle is also worth noting. Systems that learn your gesture style are more likely to feel natural in everyday use, which is the long-standing challenge for gesture-based interfaces. If Meta can make this reliable enough across different users and contexts, it would be a meaningful step toward hands-free computing that doesn't require you to memorize a fixed set of robot-like hand signals.
This is a solid infrastructure patent for Meta's AR ambitions — not a flashy consumer feature announcement, but the kind of foundational AI plumbing that would need to exist before gesture-driven glasses could feel genuinely usable. The personalization angle is the most interesting wrinkle: most gesture systems fail because they're too rigid. A model that adapts to how you move is a smarter approach.
Get one Big Tech patent every Sunday
Plain English, intelligent commentary, no hype. Free.
Editorial commentary on a publicly published patent application. Not legal advice.