Overcoming the data bottleneck: ProVision from Salesforce accelerates multimodal AI training

Subscribe to our daily and weekly newsletters to receive the latest updates and exclusive content on industry-leading AI reporting. Learn more

As companies around the world double down on their AI projects, the availability of high-quality training data has become a major bottleneck. While the The public internet is largely exhausted Large players such as OpenAI and Google are securing themselves as a data source exclusive partnerships to expand their proprietary data sets, further restricting access for others.

To address this growing concern, Salesforce has taken a big step in the field of visual training data. The company just launched ProVision, a novel framework that programmatically generates visual guidance data. These datasets are systematically synthesized to enable the training of powerful multimodal language models (MLMs) that can answer questions about images.

The company has already released the ProVision-10M dataset with this approach and is using it to increase the performance and accuracy of various multimodal AI models.

For data professionals, this framework represents a significant advance. By programmatically generating high-quality visual guidance data, ProVision reduces reliance on limited or inconsistently labeled data sets, a common challenge when training multimodal systems.

In addition, the ability to systematically synthesize data sets ensures greater control, scalability and consistency, enabling faster iteration cycles and reducing the cost of collecting domain-specific data. This work complements ongoing research in the area of synthetic data generation and is published just one day later Nvidia’s launch of Cosmosa set of World Foundation models specifically designed to generate physics-based videos from a combination of inputs such as text, image and video for physical AI training.

Visual instructional data: a key ingredient for multimodal AI

Today, command datasets are at the core of AI pre-training or fine-tuning. These specialized data sets help models follow and respond effectively to specific instructions or queries. In the case of multimodal AI, models gain the ability to analyze content such as images after learning from a series of different data points, accompanied by question-answer pairs – or visual instruction data – that describe them.

Well, here’s the thing: creating these visual instruction datasets is quite tedious. If a company manually creates the data for each training image, it will end up wasting a lot of time and human resources to complete the project. On the other hand, if it chooses to use proprietary language models for the task, it has to deal with high computational costs and the risk of hallucinations, where the quality and accuracy of the question-answer pairs may not be good enough.

In addition, the use of proprietary models is also a black box mechanism because it is difficult to interpret the data generation process and precisely control or adjust outputs.

Enter Salesforce ProVision

To address these gaps, Salesforce’s AI research team developed ProVision, a framework that uses scene diagrams in conjunction with human-written programs to systematically synthesize vision-centric command data.

At its core, a scene diagram can be described as a structured representation of image semantics, where the objects in the content are represented as nodes. Each object’s attributes – such as color or size – are mapped directly to their respective nodes, while the relationships between these objects are represented as directed edges connecting the corresponding nodes. These representations can come from manually annotated datasets such as Visual Genome or can be generated using a scene graph generation pipeline that combines various state-of-the-art vision models covering different aspects of image semantics, from object and attribute detection to depth estimation.

Once ready, scene diagrams support programs written with Python and text templates that can serve as full-fledged data generators and create question-answer pairs for AI training pipelines.

“Each (data) generator leverages hundreds of predefined templates that systematically integrate these annotations to produce diverse command data. “These generators are designed to… compare, retrieve, and reason about basic visual concepts of objects, attributes, and relationships based on the detailed information encoded in each scene diagram,” the researchers behind the framework wrote in a Paper.

Command data generation with Salesforce ProVision

ProVision 10M dataset for AI training

In its work, Salesforce used both approaches—extending manually annotated scene graphs and generating them from scratch—to set up scene graphs that support 24 single-image data generators and 14 multi-image generators.

“These data generators allow us to automatically synthesize questions and answers based on the scene graph of an image. For example, given an image of a busy street, ProVision can generate questions such as: “What is the relationship between the pedestrian and the car?” or “Which object is closer to the red building, (the) car or the pedestrian?” Lead researchers Jieyu Zhang and Le Xue stated in one Blog post.

The first approach’s data generators, which augmented Visual Genome’s scene graphs with depth and segmentation annotations from Depth Anything V2 and SAM-2, helped them create 1.5 million single-image instruction data points and 4.2 million multi-image instruction data points. The other generated 2.3 million single-image instruction data points and 4.2 million multi-image instruction data points using 120,000 high-resolution images from the DataComp dataset and models such as Yolo-World, Coca, Llava-1.5 and Osprey.

In total, the four splits together form ProVision-10M, a data set with more than 10 million unique command data points. It is now available on Hugging face and is already proving to be very effective in AI training pipelines.

In particular, when the company incorporated ProVision-10M into multimodal AI fine-tuning recipes – LLaVA-1.5 for single-image instruction data and Mantis-SigLIP-8B for multi-image instruction data – it saw notable improvements in the average performance of the models, which are higher than when fine-tuning without ProVision -Data.

“When adopted in the instruction optimization phase, our single-image instruction data achieves up to 7% improvement over CVBench’s 2D split and 8% over CVBench’s 3D split, as well as 3% performance improvement on QBench2, RealWorldQA and MMMU. Our multi-image instruction data results in an 8% improvement over Mantis-Eval,” the researchers noted in the paper.

Fintuning with ProVision data set — Fine-tuning with the ProVision data set

Synthetic data is here to stay

Although there are several Tools And Platformsincluding Nvidia’s new Cosmos World Foundation models to generate various data modalities (from images to videos) that can be used for multimodal AI training, few have addressed the problem of creating the command datasets that go with these data are linked.

Salesforce addresses this bottleneck with ProVision, giving companies the ability to go beyond manual labeling or black box language models. The approach of generating command data programmatically ensures the interpretability and controllability of the generation process and enables efficient scaling while maintaining factual accuracy.

In the long term, the company hopes researchers can build on this work to improve scene graph generation pipelines and create more data generators that cover new types of command data, such as those for video.

Daily insights into business use cases with VB Daily

If you want to impress your boss, VB Daily is for you. We give you the inside scoop on what companies are doing with generative AI, from regulatory changes to practical deployments, so you can share insights for maximum ROI.

Read ours Privacy Policy

Thank you for your subscription. Check out more VB newsletter here.

An error has occurred.