IBM · Filed Nov 19, 2024 · Published May 21, 2026 · verified — real USPTO data

IBM Patents a Deduplicated Mixture-of-Experts System for Multi-Model AI Inference

Running multiple AI models simultaneously means loading a lot of redundant components into expensive GPU memory — IBM's patent describes a way to store overlapping parts just once and share them across every model that needs them.

IBM Patent: Shared Mixture of Experts for AI Inference — figure from US 2026/0141275 A1
FIG. 1A — rendered from the official USPTO publication PDF.
Publication number US 2026/0141275 A1
Applicant International Business Machines Corporation
Filing date Nov 19, 2024
Publication date May 21, 2026
Inventors UMESH DESHPANDE, Travis Janssen, Swaminathan Sundararaman
CPC classification 706/10
Grant likelihood Medium
Examiner CENTRAL, DOCKET (Art Unit OPAP)
Status Docketed New Case - Ready for Examination (Dec 13, 2024)
Document 20 claims

What IBM's shared expert deduplication actually does

Imagine a hospital with five specialists on call. If two departments both need a cardiologist, it's wasteful to hire one for each — you just share the same cardiologist. IBM's patent applies that same logic to AI models running on GPUs.

Today's large AI systems often use a design called Mixture of Experts (MoE), where a model is split into many specialized sub-models ("experts"), and only the relevant ones activate for a given task. When you're running several of these MoE models at the same time — say, for multiple customers — you may end up loading duplicate expert sub-models into GPU memory over and over.

IBM's approach identifies those duplicates, stores each one only once on the GPU, and then lets all the models that need it share that single copy. A central "gate" mechanism figures out which experts to route each incoming request to, regardless of which model originally owned that expert. The result is less GPU memory wasted on copies of the same thing.

How IBM's gate routes requests across deduplicated experts

The patent describes a shared MoE inference architecture designed to serve multiple clients simultaneously without redundantly loading duplicate expert sub-models into GPU memory.

Here's the setup: you have N clients, each sending a request to their own MoE model. Each MoE model has a set of experts (specialized neural sub-networks trained to handle specific task types) and a gate (a learned model that decides which experts should handle a given input). Normally, if two MoE models share an identical expert — trained on the same data, same weights — both models would load their own copy into the GPU.

IBM's system adds a deduplication step: before inference, it identifies experts that are common across multiple models. Each duplicative expert is stored only once in GPU memory and referenced by all models that share it. The gate mechanism is extended to operate across all N models simultaneously, selecting from the combined (deduplicated) pool of experts.

  • Gate mechanism: receives all N requests and their model identifications, then selects E experts from the shared pool
  • Router: sends each request to the appropriate subset of those E experts
  • Execution: each expert processes only the tasks it was trained for, then returns responses routed back to the originating client

What this means for GPU memory costs in AI inference

GPU memory is one of the most expensive and constrained resources in AI inference today. MoE models are increasingly popular precisely because they scale parameter count without scaling compute — but when you're serving multiple tenants or running multiple model variants simultaneously, redundant expert storage quietly eats up that memory headroom. IBM's deduplication approach attacks that directly.

This matters most in multi-tenant inference environments — cloud AI services, enterprise AI platforms, or any setup where many customers share the same underlying hardware. If you're an infrastructure team running dozens of fine-tuned MoE variants for different business units, this kind of sharing could meaningfully reduce the number of GPUs you need to provision. It's not glamorous, but reducing GPU memory pressure is real money.

Editorial take

This is solid, practical infrastructure work aimed squarely at the economics of running MoE models at scale in the cloud. It won't generate headlines, but the problem it solves — GPU memory bloat from duplicate expert weights across concurrent model instances — is genuinely real and getting more painful as MoE adoption grows. IBM is staking out IP in a space where the hyperscalers (Google, Microsoft, Amazon) are all quietly facing the same pressure.

Get one Big Tech patent every Sunday

Plain English, intelligent commentary, no hype. Free.

Source. Full patent text and figures from the official USPTO publication PDF.

Editorial commentary on a publicly published patent application. Not legal advice.