Microsoft · Filed Nov 27, 2024 · Published May 28, 2026 · verified — real USPTO data

Microsoft Patents a Context-Aware System for Summarizing Mixed Text, Images, and Audio

By Patentlyze Team · Updated Jul 10, 2026

Imagine feeding an AI a meeting recording, a slide deck, and a chat transcript — and getting back a single, coherent summary tailored to whether you're reading on a phone, a screen reader, or a desktop dashboard. That's the core idea here.

Figure from the official USPTO publication.

Publication number US 2026/0147833 A1

Applicant Microsoft Technology Licensing, LLC

Filing date Nov 27, 2024

Publication date May 28, 2026

Inventors Maurice DIESENDRUCK, Vijay MITAL, Harsh SHRIVASTAVA, Pramod K. SHARMA, Shima IMANI

CPC classification 386/241

Grant likelihood Medium

Examiner YANG, NIEN (Art Unit 2484)

Status Response to Non-Final Office Action Entered and Forwarded to Examiner (Apr 17, 2026)

Document 20 claims

AI/ML

What Microsoft's mixed-modality summarizer actually does

Think about how much information comes at you in mixed formats — a Teams call with screen sharing, a Word doc with embedded images, or a news story with video clips and captions. Today's AI summarizers mostly handle one type of content at a time. Microsoft's patent describes a system that processes multiple types of content at once — text, images, audio, video — and generates a single, coherent summary from all of it.

The twist is that the summary isn't one-size-fits-all. The system pays attention to who's asking and what device they're on. A summary delivered to a smartwatch might be text-only and three sentences long. The same content sent to a desktop app might include image thumbnails and a structured outline.

This means your summary of a two-hour product demo could look completely different depending on whether you're checking it on your phone during lunch or reviewing it in detail at your desk — without you having to ask for a different version.

How coresets compress multi-modal embeddings into summaries

The system takes in mixed-modality data — meaning content drawn from multiple formats like text, images, audio, and video simultaneously. It then encodes all of that into a shared mathematical space called a joint embedding space (think of this as a common coordinate system where a sentence about a red car and a photo of a red car end up near each other, because they mean the same thing).

From that shared space, the system uses a technique called a coreset — a compressed, representative subset of all the data points (embeddings) that captures the most important information without keeping everything. Coresets are a well-established concept in computational geometry; the key insight here is applying them to mixed-modality content.

The system then generates a second coreset from the first, this time deliberately reducing the number of modalities per data point. So a joint text-image embedding might become text-only, because the output device can't render images. This two-stage compression pipeline is driven by:

User-derived constraints — preferences, accessibility needs, role, or context
Output device constraints — screen size, bandwidth, supported formats
Time-varying constraints — what's relevant may change depending on when the summary is consumed

Finally, the system generates the actual summary from the second coreset and routes it to the appropriate output device.

What this means for Copilot and enterprise content tools

For Microsoft, this is directly relevant to Copilot and its suite of productivity tools — Teams, OneNote, Word, and PowerPoint all deal in mixed content daily. A system that can intelligently summarize across modalities while adapting to device and user context would be a meaningful upgrade over current single-format summarizers. Enterprise customers in particular — who often deal with long meetings, dense presentations, and multi-format reports — would benefit most.

For you as a user, the practical upside is summaries that don't require you to specify what format you want. The system figures out what fits your situation. That said, the coreset-plus-constraints architecture is technically interesting but not entirely novel — the value here is in the integration and the adaptive constraint layer, not a single jaw-dropping breakthrough.

Editorial take

This is a solid, technically coherent patent that tackles a real problem: AI summarizers today are mostly format-blind, and the world's content isn't. The coreset approach is mathematically grounded and the constraint-layering idea is genuinely useful for enterprise tools like Copilot. It's not flashy research, but it's the kind of infrastructure work that quietly makes productivity software feel much more capable.

Which company should we read for you?

We track 17 companies here. Pro is the same weekly breakdown for any company you choose, delivered privately. Type a name and we'll scope it and send you a quote.

Get one Big Tech patent every Sunday

Plain English, intelligent commentary, no hype. Free.

Source. Full patent text and figures from the official USPTO publication PDF.

Editorial commentary on a publicly published patent application. Not legal advice.

Microsoft Patents a Context-Aware System for Summarizing Mixed Text, Images, and Audio

What Microsoft's mixed-modality summarizer actually does

How coresets compress multi-modal embeddings into summaries

What this means for Copilot and enterprise content tools

More from Microsoft

More in AI/ML

Get one Big Tech patent every Sunday