Apple · Filed Dec 31, 2025 · Published May 14, 2026 · verified — real USPTO data

Apple Patents Memory-Efficient Streaming Convolutions for Its Neural Engine

Running a deep neural network on a phone chip means constantly fighting a memory bottleneck. Apple's latest patent describes a way to stream data through multiple network layers without ever needing to hold the whole picture in memory at once.

Apple Patent: Memory-Efficient Neural Network Streaming — figure from US 2026/0134272 A1
FIG. 1A — rendered from the official USPTO publication PDF.
Publication number US 2026/0134272 A1
Applicant Apple Inc.
Filing date Dec 31, 2025
Publication date May 14, 2026
Inventors Sayyed Karen Khatamifard, Alexander J. Kirchhoff, Rohit K. Gupta, Jeffrey D. Marker, Thomas G. Anderl, Saman Naderiparizi, Chenfan Sun, Alon Yaakov, Husam Khashiboun, Ramana V. Rachakonda
CPC classification 712/34
Grant likelihood Medium
Examiner CENTRAL, DOCKET (Art Unit OPAP)
Status Docketed New Case - Ready for Examination (Feb 5, 2026)
Parent application is a Continuation of 17745032 (filed 2022-05-16)

How Apple's chip streams neural layers without running out of memory

Imagine you're trying to process a giant mosaic, but you only have a small table to work on. Instead of laying out the entire mosaic at once, you work on one tile, pass it along, and immediately start on the next. That's essentially what Apple is patenting here — a smarter way to move data through the layers of a neural network on its custom chips.

Most neural networks work in distinct stages: one layer finishes its job, writes all its results to memory, then the next layer reads them back. That's expensive in terms of memory and time. Apple's approach keeps only a slice of the data in fast, local memory at any given moment — processing it just enough to feed the next layer, then moving on.

The result is that Apple's Neural Engine can run deeper, more capable AI models on devices like the iPhone or Apple Silicon Macs without requiring a huge amount of on-chip memory — which is both physically limited and power-hungry.

How Apple's data processor tiles tensors across two network layers

The patent describes a neural processor circuit made up of two key components: a neural engine circuit (the actual number-cruncher doing convolutions and other math-heavy operations) and a data processor circuit (a smarter memory manager that decides what data to keep close at hand).

The core idea is layer-fused streaming. Normally, Layer 1 would fully compute its output tensor (think of this as the complete results grid from one processing stage) and park it in memory before Layer 2 even starts reading it. Instead, the data processor stores only a portion of the first layer's input tensor — just enough to produce a partial output — then immediately makes that partial output available as input to the second, higher-level layer.

  • Partial input caching: Only the slice of data needed for the current computation window is kept in local memory.
  • Partial output forwarding: Results from Layer 1 are handed directly to Layer 2 as soon as they're ready, without waiting for the full layer to finish.
  • Hierarchical layer awareness: The system knows which layer is "above" which, allowing it to pipeline the data flow intelligently.

This is sometimes called operator fusion or layer fusion in ML compiler literature — the insight that you don't always need to materialize a full intermediate tensor if you can consume it in chunks.

What this means for Apple's on-device AI ambitions

On-device AI is only as good as the hardware constraints allow it to be. The biggest constraint on a phone or laptop chip isn't raw compute — it's memory bandwidth and capacity. Every byte you don't have to write to main memory and read back is energy saved and latency cut. Apple's Neural Engine already powers features like Face ID, real-time photo processing, and on-device language models, and this kind of memory optimization is exactly what lets those models get bigger without draining your battery faster.

For you as a user, this is the invisible plumbing behind snappier Siri responses, better computational photography, and the kind of on-device AI Apple has been promising won't need a cloud server. It's not a single feature — it's infrastructure that makes a whole class of future features possible.

Editorial take

This is firmly in the "unglamorous but essential" category of silicon patents. Streaming convolutions and layer fusion are well-studied problems in ML systems research, so Apple isn't inventing a new concept here — it's filing on a specific hardware implementation for its Neural Engine architecture. That specificity is what gives it teeth: controlling how tensor data flows through your own custom silicon is a genuine competitive moat, and it's why Apple's on-device AI consistently punches above its memory spec.

Get one Big Tech patent every Sunday

Plain English, intelligent commentary, no hype. Free.

Source. Full patent text and figures from the official USPTO publication PDF.

Editorial commentary on a publicly published patent application. Not legal advice.