Microsoft Patent Reuses Past Web Tasks to Lower Automation Costs
Every time an AI agent fills out a web form or clicks through a website, it typically stares at the screen and figures it out from scratch. Microsoft wants to change that by giving the agent a memory of how it handled the same task last time.
What Microsoft's web-task memory system actually does
Imagine you had an assistant who filled out the same expense report every month. The first time, they had to read every field carefully and figure it out. But if they just consulted their notes from last month, they could zip through it in a fraction of the time. That's the idea behind this Microsoft patent.
The system is designed for AI agents that automate tasks on websites, things like logging into a portal, submitting a form, or navigating a multi-step checkout. Instead of using computer vision to analyze every page every single time, the system checks whether it has completed a similar task before. If it has, it pulls up the old playbook and adapts it for the new request with a single call to a language model.
If something goes wrong and the adapted steps don't work, the agent falls back to its slower, look-and-click approach to rebuild its understanding. It's a practical speed-versus-reliability trade-off baked right into the design.
How the system matches tasks and replays stored actions
The patent describes a web task orchestration service that sits between a client application (say, a business tool) and a target website. When a task comes in written in plain English, the system converts that request into a vector embedding (a mathematical fingerprint that captures the meaning of the task) and searches a library of previously completed tasks for a close match.
If a strong match is found, the system retrieves the stored sequence of browser actions from that earlier run and feeds it into a generative language model alongside the new request. The model adapts the old action sequence to fit the new details (different usernames, dates, form values, etc.) in a single pass, rather than generating instructions step by step while staring at screenshots.
The key components are:
- Embedding model: converts task descriptions into vector representations so similar tasks can be found mathematically
- Task record store: a library of past completed tasks and their exact action sequences
- Adaption step: a one-shot language model call that tweaks the stored sequence for the new request
- Vision fallback: if the adapted sequence fails, the system reverts to screenshot-by-screenshot analysis to complete the task and saves the new result for next time
The claim also describes the baseline vision-driven loop: capture screenshot, generate a prompt, send it to a multi-modal language model, receive the next action, execute it, check whether the task is done, repeat.
What this means for AI agents that browse the web for you
AI agents that autonomously browse the web are among the most computationally expensive applications of large language models. Sending a screenshot to a vision model on every single click adds up fast, both in time and in API costs. A system that can recognize "I've done this before" and skip most of that analysis could make web automation agents significantly cheaper and faster to run at scale, which matters a lot if you're a business running thousands of automated workflows.
For everyday users, this kind of technology is the infrastructure behind AI assistants that can book appointments, fill out forms, or manage accounts on your behalf. The better these agents get at reusing learned patterns, the more reliably they can handle repetitive web tasks without burning through compute or timing out mid-task.
This is a practical engineering patent, not a moonshot. Microsoft is solving a real cost problem with AI web agents: vision-model calls are expensive, and doing them on every action for every task is wasteful when the task has been done before. The embedding-based matching approach is well-established in other contexts, and applying it here as a caching layer for agent workflows is sensible rather than flashy. This almost certainly feeds into Microsoft's Copilot and Power Automate product lines.
Get one Big Tech patent every Sunday
Plain English, intelligent commentary, no hype. Free.
Editorial commentary on a publicly published patent application. Not legal advice.