Salesforce · Filed Jul 31, 2025 · Published May 21, 2026 · verified — real USPTO data

Salesforce Patents a User-Directed Pipeline for Building Multimodal AI Training Data

Teaching an AI to understand images is only as good as the training data you feed it — and most pipelines for generating that data are opaque black boxes. Salesforce is patenting a system that makes the whole process transparent and user-configurable.

Salesforce Patent: Multimodal LLM Training Pipeline — figure from US 2026/0141697 A1
FIG. 1A — rendered from the official USPTO publication PDF.
Publication number US 2026/0141697 A1
Applicant Salesforce, Inc.
Filing date Jul 31, 2025
Publication date May 21, 2026
Inventors Jieyu Zhang, Le Xue, Jun Wang, Manli Shu, An Yan, Zeyuan Chen, Ran Xu
CPC classification 382/157
Grant likelihood Medium
Examiner CENTRAL, DOCKET (Art Unit OPAP)
Status Docketed New Case - Ready for Examination (Nov 25, 2025)
Parent application Claims priority from a provisional application 63720906 (filed 2024-11-15)
Document 20 claims

How Salesforce turns images into AI training questions

Imagine you want to teach an AI assistant to look at a photo and answer questions about it — things like 'what objects are in the scene?' or 'what is the red object sitting next to?' To do that, you need thousands of image-and-question pairs as training examples. The problem is that generating those pairs automatically, at scale, often introduces hidden biases or errors that are hard to trace.

Salesforce's approach puts a human in the loop at the start. You pick what kinds of questions you want the system to generate — about objects, their attributes, their relationships, or how they're spatially arranged. The system then analyzes each image, builds a structured map of everything in it (called a scene graph), and uses rule-based code scripts to generate question-answer pairs from that map.

Because the rules are explicit code rather than a neural net making fuzzy guesses, you can audit exactly why any given training pair was generated. The resulting dataset is then used to train a multimodal AI — one that can both see and read.

How scene graphs become structured Q&A training pairs

The patent describes a multimodal training data generation pipeline with three main stages.

First, a set of image recognition models analyzes each image in the training corpus. These specialized models handle distinct tasks:

  • Object detection (identifying what things are present)
  • Attribute recognition (color, size, texture)
  • Relationship detection (spatial or semantic links between objects)
  • Segmentation (pixel-level boundaries of each object)

Next, the outputs from those models are compiled into a scene graph — a data structure where each detected object becomes a node and each relationship between objects becomes an edge (think of it like a network diagram of a photo's contents). Scene graphs are a well-established representation in computer vision; the novelty here is using them as the canonical intermediate format before generating training text.

Finally, rule-based code scripts (not a neural net) traverse the scene graph and generate question-answer pairs. Because the logic is deterministic code, the provenance of every training example is fully auditable — you know exactly which rule produced which Q&A pair. A user interface lets practitioners select which annotation types to target, so the resulting dataset can be tuned toward specific downstream tasks. The trained model is then used to build a multimodal AI agent capable of visual content detection across multiple images.

What this means for bias in vision-language AI models

The biggest unsolved problem in training vision-language models isn't model architecture — it's data quality. Most large multimodal datasets (think LAION or similar) are scraped from the web and carry unknown biases baked in at collection time. A pipeline that generates training pairs from explicit rules, rather than crowd-sourced captions or another neural net's guesses, gives Salesforce's enterprise customers something valuable: a paper trail for why the AI behaves the way it does.

For Salesforce, which sells AI tooling to regulated industries like financial services and healthcare, that auditability angle is a real product differentiator. If your AI agent flags a piece of visual content incorrectly, you want to be able to explain exactly what training signal caused that behavior — and this system is designed to let you do exactly that.

Editorial take

This is solid, unglamorous AI infrastructure work — the kind of thing that rarely gets attention but genuinely matters when you're trying to build trustworthy enterprise AI. The scene-graph-to-Q&A approach isn't new to academic research, but Salesforce packaging it as a user-directed, auditable pipeline with a UI for selecting annotation types is a practical enterprise angle worth watching.

Get one Big Tech patent every Sunday

Plain English, intelligent commentary, no hype. Free.

Source. Full patent text and figures from the official USPTO publication PDF.

Editorial commentary on a publicly published patent application. Not legal advice.