IBM Patents a Deduplicated Mixture-of-Experts System for Multi-Model AI Inference
Running multiple AI models simultaneously means loading a lot of redundant components into expensive GPU memory — IBM's patent describes a way to store overlapping parts just once and share them across every model that needs them.
What IBM's shared expert deduplication actually does
Imagine a hospital with five specialists on call. If two departments both need a cardiologist, it's wasteful to hire one for each — you just share the same cardiologist. IBM's patent applies that same logic to AI models running on GPUs.
Today's large AI systems often use a design called Mixture of Experts (MoE), where a model is split into many specialized sub-models ("experts"), and only the relevant ones activate for a given task. When you're running several of these MoE models at the same time — say, for multiple customers — you may end up loading duplicate expert sub-models into GPU memory over and over.
IBM's approach identifies those duplicates, stores each one only once on the GPU, and then lets all the models that need it share that single copy. A central "gate" mechanism figures out which experts to route each incoming request to, regardless of which model originally owned that expert. The result is less GPU memory wasted on copies of the same thing.
How IBM's gate routes requests across deduplicated experts
The patent describes a shared MoE inference architecture designed to serve multiple clients simultaneously without redundantly loading duplicate expert sub-models into GPU memory.
Here's the setup: you have N clients, each sending a request to their own MoE model. Each MoE model has a set of experts (specialized neural sub-networks trained to handle specific task types) and a gate (a learned model that decides which experts should handle a given input). Normally, if two MoE models share an identical expert — trained on the same data, same weights — both models would load their own copy into the GPU.
IBM's system adds a deduplication step: before inference, it identifies experts that are common across multiple models. Each duplicative expert is stored only once in GPU memory and referenced by all models that share it. The gate mechanism is extended to operate across all N models simultaneously, selecting from the combined (deduplicated) pool of experts.
- Gate mechanism: receives all N requests and their model identifications, then selects E experts from the shared pool
- Router: sends each request to the appropriate subset of those E experts
- Execution: each expert processes only the tasks it was trained for, then returns responses routed back to the originating client
What this means for GPU memory costs in AI inference
GPU memory is one of the most expensive and constrained resources in AI inference today. MoE models are increasingly popular precisely because they scale parameter count without scaling compute — but when you're serving multiple tenants or running multiple model variants simultaneously, redundant expert storage quietly eats up that memory headroom. IBM's deduplication approach attacks that directly.
This matters most in multi-tenant inference environments — cloud AI services, enterprise AI platforms, or any setup where many customers share the same underlying hardware. If you're an infrastructure team running dozens of fine-tuned MoE variants for different business units, this kind of sharing could meaningfully reduce the number of GPUs you need to provision. It's not glamorous, but reducing GPU memory pressure is real money.
This is solid, practical infrastructure work aimed squarely at the economics of running MoE models at scale in the cloud. It won't generate headlines, but the problem it solves — GPU memory bloat from duplicate expert weights across concurrent model instances — is genuinely real and getting more painful as MoE adoption grows. IBM is staking out IP in a space where the hyperscalers (Google, Microsoft, Amazon) are all quietly facing the same pressure.
Get one Big Tech patent every Sunday
Plain English, intelligent commentary, no hype. Free.
Editorial commentary on a publicly published patent application. Not legal advice.