Microsoft · Filed Feb 16, 2026 · Published Jun 25, 2026 · verified — real USPTO data

Microsoft Patent Shrinks AI Memory Costs With Live Compression

Every word an AI generates requires it to remember everything it has already read. Microsoft's new patent describes a system that automatically compresses that memory as it grows, so the model can keep going without running out of room.

Microsoft Patent: KV Cache Quantization for AI Models — figure from US 2026/0178191 A1
FIG. 1A — rendered from the official USPTO publication PDF.
Publication number US 2026/0178191 A1
Applicant Microsoft Technology Licensing, LLC
Filing date Feb 16, 2026
Publication date Jun 25, 2026
Inventors Ori LASLO, Gilad KIRSHENBOIM
CPC classification 711/154
Grant likelihood Medium
Examiner CENTRAL, DOCKET (Art Unit OPAP)
Status Docketed New Case - Ready for Examination (Mar 23, 2026)
Parent application is a Continuation of 18772712 (filed 2024-07-15)
Document 20 claims

How Microsoft shrinks AI working memory mid-conversation

Imagine reading a very long book and having to keep notes on every sentence before you can answer a question about it. Now imagine your notepad filling up halfway through the book. That's essentially the problem AI chat systems face: the longer the conversation, the more notes they need to hold in fast memory, and fast memory is expensive.

Microsoft's patent describes a system that watches those notes accumulate and, once a batch reaches a set size, automatically compresses the whole batch into a smaller form before moving it to a bigger, cheaper storage area. The original detail is reduced slightly, but the key meaning is preserved, and the model can still read those compressed notes when it needs them.

The system also handles an edge case: if a batch isn't quite full yet, it temporarily pads it with placeholder values so the compression math works out correctly, then discards those placeholders afterward. The result is more efficient memory use without the model needing to stop and wait.

How the block-quantization pipeline batches and compresses values

The patent addresses a specific bottleneck in transformer-based generative AI models: the KV cache (key-value cache). Every time a model processes a token (a word or word-fragment), it stores numerical representations of context in this cache so it doesn't have to recompute them. For long documents or long conversations, that cache can consume enormous amounts of fast GPU memory.

Microsoft's system introduces a pipeline that compresses those cached values in blocks:

  • New value-matrix entries are written to a first memory (fast, on-chip or high-bandwidth memory).
  • Once a configurable number of entries accumulates (the group size parameter), the system applies data quantization (a process that represents full-precision floating-point numbers using fewer bits, trading a tiny amount of precision for a large reduction in storage size).
  • The compressed block is then moved to a second memory (larger, lower-cost storage), freeing up the fast memory for new entries.
  • When the model needs to look back at earlier context, it retrieves the compressed blocks from second memory.

The patent also describes a padding mechanism: if the current batch of entries is smaller than the required group size, the system inserts dummy placeholder values to reach the threshold, runs the compression, then discards the placeholders. This prevents the compression algorithm from failing or producing corrupt results on undersized batches.

What cheaper AI memory means for cloud and on-device models

The KV cache is one of the biggest memory bottlenecks stopping AI models from handling very long conversations or very long documents efficiently. Compressing it in structured blocks, rather than all at once at the end, means the model can process far more context on the same hardware. For cloud providers, that translates directly into lower cost per query.

For on-device AI (think a model running on a phone or a laptop), this kind of technique is even more important. Those devices have strict memory limits, and any system that lets a model stay within those limits for longer conversations without degrading quality is worth watching. This patent suggests Microsoft is thinking carefully about both ends of the deployment spectrum.

Editorial take

This is infrastructure-level work, not a flashy product announcement, but it's the kind of patent that enables every product built on top of a long-context AI model. The padding trick for undersized batches is a small but telling detail: it shows this is a real engineering design meant to handle messy real-world conditions, not a theoretical sketch. If Microsoft ships this in Azure AI or in its Copilot stack, the main beneficiary will be cost, not capability, but cost is often what determines whether a product is viable at scale.

Get one Big Tech patent every Sunday

Plain English, intelligent commentary, no hype. Free.

Source. Full patent text and figures from the official USPTO publication PDF.

Editorial commentary on a publicly published patent application. Not legal advice.