Qualcomm · Filed Oct 28, 2025 · Published Jun 11, 2026 · verified — real USPTO data

Qualcomm Patents a Way to Keep On-Device AI Running Without Running Out of Memory

By Patentlyze Team · Updated Jun 12, 2026

Every time an AI model generates text, it has to remember what it's already said — and that memory fills up fast. Qualcomm's new patent describes a way to automatically prune the least useful memories before they crowd out the important ones.

FIG. 1A — rendered from the official USPTO publication PDF.

Publication number US 2026/0161571 A1

Applicant QUALCOMM Incorporated

Filing date Oct 28, 2025

Publication date Jun 11, 2026

Inventors Junyoung PARK, Dalton James JONES, Matthew James MORSE, Raghavv GOEL, Mingu LEE, Christopher LOTT

CPC classification 711/133

Grant likelihood Medium

Examiner CENTRAL, DOCKET (Art Unit OPAP)

Status Docketed New Case - Ready for Examination (Nov 20, 2025)

Parent application Claims priority from a provisional application 63730291 (filed 2024-12-10)

Document 33 claims

AI/ML

How Qualcomm's AI memory trimming actually works

Imagine you're reading a very long book and taking notes on a notepad. Eventually the notepad fills up. The smart move isn't to throw away notes at random — it's to cross out the ones that basically say the same thing as everything else, keeping the notes that are truly unique.

That's essentially what Qualcomm's patent describes for AI models. When a generative AI — like the kind that powers chatbots or text autocomplete — processes a long conversation, it stores a running record of each word or phrase it has handled. Over time, that record eats up memory, which is a real problem on phones and chips where memory is limited.

Qualcomm's system periodically checks which stored items look the most similar to the average of everything stored so far. Items that are too ordinary — too close to the average — get dropped first, because they're adding the least new information. The result is a leaner memory that keeps the unusual, high-value context and quietly discards the redundant filler.

How cosine similarity scores decide what gets dropped

At the heart of modern AI text generation is a structure called a KV cache (short for key-value cache). For every word or token the model processes, it stores two pieces of data: a key tensor (a numeric fingerprint that helps the model figure out what's relevant) and a value tensor (the actual content associated with that token). As a conversation grows longer, this cache grows too — eventually straining memory, especially on mobile hardware.

Qualcomm's patent proposes a cache eviction policy driven by cosine similarity — a standard math technique that measures how alike two lists of numbers are, where a score of 1.0 means identical direction and 0.0 means completely unrelated. The system computes an average key tensor across everything currently stored, then scores each individual key against that average.

The logic: if a key scores very high similarity to the average, it's essentially redundant — it doesn't add much that the average isn't already capturing. Those high-similarity (low-distinctiveness) entries become eviction candidates and get dropped from memory first.

Compute the average of all stored key tensors in the cache
Score each key by how closely it matches that average (cosine similarity)
Evict keys whose scores cross a set threshold — too similar means too redundant
Keep the unusual, distinctive keys that carry the most unique context

This is a lightweight operation compared to more complex attention-based eviction schemes, making it a practical fit for constrained hardware like a phone's AI accelerator.

What this means for AI on phones and chips

On-device AI is one of Qualcomm's core bets — its Snapdragon chips power a huge portion of Android phones, and the company has been pushing AI processing onto the device itself rather than relying on cloud servers. The KV cache is one of the biggest memory bottlenecks for running large AI models locally, so a smarter eviction policy directly expands what's possible on a given chip.

For you as a user, this is the kind of plumbing work that could mean longer coherent conversations with an on-device AI assistant, or more capable AI features on mid-range phones that don't have as much memory to spare. It won't show up in a spec sheet, but it's the sort of optimization that quietly determines whether AI on your phone feels useful or frustrating.

Editorial take

This is unglamorous but genuinely important infrastructure work. The KV cache memory wall is a well-known problem in the AI field, and Qualcomm is in a better position than almost anyone to care about solving it cheaply — their chips run in billions of devices where memory is tight and cloud offload isn't always an option. The cosine-similarity approach is elegant in its simplicity: it doesn't require training a separate model or running expensive attention lookups to decide what to forget.

Get one Big Tech patent every Sunday

Plain English, intelligent commentary, no hype. Free.

Source. Full patent text and figures from the official USPTO publication PDF.

Editorial commentary on a publicly published patent application. Not legal advice.

Qualcomm Patents a Way to Keep On-Device AI Running Without Running Out of Memory

How Qualcomm's AI memory trimming actually works

How cosine similarity scores decide what gets dropped

What this means for AI on phones and chips

More from Qualcomm

More in AI/ML

Get one Big Tech patent every Sunday