IBM Patents an AI Agent That Turns Plain-English Instructions Into Web Browser Actions
Imagine telling a computer 'submit my expense report in the HR portal' and having it figure out every click and form field on its own. That's the core idea behind IBM's latest patent filing.
What IBM's web-task automation system actually does
You know how tedious it is to walk through a multi-step web form, especially one buried inside enterprise software? IBM is working on a system where you just describe what you want done, and an AI agent figures out all the individual steps required to do it inside a web application.
The system works by studying existing recordings of how people navigate websites (called UI flows), learning the building blocks of those journeys, and building a kind of map. When you give it a new instruction, a planning agent looks at that map, figures out a route, and hands it off to an execution agent that actually clicks through the steps.
There's also a built-in error checker that reviews the plan before anything is executed. If the plan has a structural problem (like asking for a step that doesn't exist), the system flags it and asks the planner to try again, rather than blindly running bad instructions.
How the planning agent maps words to browser steps
The patent describes a two-agent architecture designed to automate tasks inside web applications using natural-language instructions.
Step one is learning the map. The system ingests recorded UI flows (think: screen-capture walkthroughs of someone completing a task) and parses them into named states (where you are in a web app), actions (what you can do), and parameters (the specific values involved, like a date or a dollar amount). Each element gets a human-readable name that reflects its purpose, not just a technical identifier.
Step two is planning. A planning agent (an AI model that reasons about sequences of steps) takes a user's plain-English utterance and searches the learned map, called the semantic UI state-action space (UI SAS), for a valid route that completes the requested task. Think of the UI SAS as a transit map where every station is a screen and every train line is an action.
Step three is validation. Before anything runs, a syntactic error parser checks whether the generated action sequence is structurally valid. If it finds problems, it sends feedback back to the planning agent to produce a corrected plan.
Step four is execution. A separate execution agent then carries out the validated plan inside the actual web application.
What this means for AI-driven workplace software
Enterprise software is full of repetitive, multi-step web workflows: filing requests, updating records, pulling reports. Today, automating those tasks usually requires specialized tools (like RPA platforms) or custom code written by developers. A system that learns from existing usage recordings and then responds to plain-English instructions could lower that bar considerably.
For IBM specifically, this fits squarely into its push to sell AI-assisted automation to large organizations through products like watsonx. The patent signals IBM is thinking about how AI agents can handle complete workplace tasks, not just answer questions, which is where the bigger commercial opportunity in enterprise AI currently sits.
This is solid, incremental work on a real problem: enterprise web automation is painful and expensive. The self-correcting feedback loop between the planner and the error checker is the most interesting design choice here, because it addresses one of the main failure modes of AI agents running loose inside software. It's not flashy, but IBM is staking out defensible ground in the agentic-AI space before that market matures.
Get one Big Tech patent every Sunday
Plain English, intelligent commentary, no hype. Free.
Editorial commentary on a publicly published patent application. Not legal advice.