AMD · Filed Dec 23, 2024 · Published Jun 25, 2026 · verified — real USPTO data

AMD Patent Reduces Hardware Storage Costs by Generating AI Memory Data On Demand

Running a large AI model eats memory fast, and one of the biggest culprits is a cache of calculations the model drags along with every word it generates. AMD's new patent proposes throwing that cache away and recomputing it on the fly instead.

AMD Patent: Shrinking AI Memory by Recomputing Key Data — figure from US 2026/0178832 A1
FIG. 1A — rendered from the official USPTO publication PDF.
Publication number US 2026/0178832 A1
Applicant ADVANCED MICRO DEVICES, INC.
Filing date Dec 23, 2024
Publication date Jun 25, 2026
Inventors Shaizeen Aga, Akila Subramaniam, Suchita Pati
CPC classification 704/9
Grant likelihood Medium
Examiner CHAVEZ, RODRIGO A (Art Unit 2658)
Status Docketed New Case - Ready for Examination (Feb 5, 2025)
Document 20 claims

How AMD decides what to discard and recompute each token

Generative AI models like large language models process text in layers, and each layer relies on two sets of vectors called keys and values (the KV in KV cache). These are intermediate calculations that summarize the context the model has seen so far. Normally, every key and value computed during a generation pass is kept in memory so later layers can refer back to them.

AMD's patent describes a different approach: selective recomputation. Instead of holding onto every KV vector after it is first used, the system discards them. When the next token (the next word or word-piece) needs to be generated, the system recomputes only the specific KV vectors that the current layer requires, at the moment it needs them.

The key design decision is the word selective. Not every vector gets thrown away; the system decides which ones are worth recalculating versus which ones are cheap enough to store. That judgment call is what separates this from simply running the full model twice.

  • Generate a token using KV vectors from all relevant model layers.
  • Discard those vectors rather than holding them in the KV cache.
  • On the next token, recompute only the vectors needed, layer by layer, just before each layer consumes them.

The net effect is a significant reduction in peak memory footprint, trading some additional compute cycles for far less memory pressure.

What this means for running large AI models on tighter hardware

Memory is the main bottleneck for running large AI models on real hardware. The KV cache is often the single largest consumer of GPU or accelerator memory during inference, and it scales with both model size and the length of the conversation. A patent that reduces that footprint without changing the underlying model weights is directly useful for deploying AI at scale or on smaller devices.

For AMD, this fits into the company's push to compete with Nvidia on AI inference workloads. If AMD accelerators can run the same models while using less memory, that is a practical advantage for data center customers weighing chip options. It also opens the door to running longer context windows on hardware that would otherwise run out of room.

Editorial take

This is infrastructure-level work, not a flashy product announcement, but it addresses a real and well-documented pain point in AI deployment. Memory constraints are one of the main reasons organizations have to buy more expensive or more numerous chips to serve large models. A credible solution here would matter in practice, not just on paper.

Get one Big Tech patent every Sunday

Plain English, intelligent commentary, no hype. Free.

Source. Full patent text and figures from the official USPTO publication PDF.

Editorial commentary on a publicly published patent application. Not legal advice.