ByteDance’s UI-TARS can take over your computer and outperforms GPT-4o and Claude

ByteDance’s UI-TARS can take over your computer and outperforms GPT-4o and Claude

Subscribe to our daily and weekly newsletters to receive the latest updates and exclusive content on industry-leading AI reporting. Learn more


A new AI agent has emerged from TikTok’s parent company to take control of your computer and carry out complex workflows.

Similar to Anthropic Computer useByteDance’s new UI-TARS understands graphical user interfaces (GUIs), applies reasoning, and takes autonomous, step-by-step actions.

The PC/MacOS agents are trained on around 50 B tokens and are offered in parameter versions 7 B and 72 B. They achieve state-of-the-art (SOTA) performance in more than 10 GUI benchmarks across performance, perception, grounding, and overall agent capabilities, consistently outperforming OpenAI’s GPT-4o, Claude, and Google’s twins.

“Through iterative training and reflective optimization, UI-TARS continually learns from its mistakes and adapts to unforeseen situations with minimal human intervention,” researchers from ByteDance and Tsinghua University wrote in one new research work.

Source: Arxiv

How UI-TARS explains its thinking

UI-TARS works across desktop, mobile and web applications and uses multimodal inputs (text, images, interactions) to understand visual environments.

The interface has two tabs – one on the left shows step-by-step “thinking” and a larger one on the right that accesses files, websites and apps and automatically takes action.

For example, in a demo video released today, the model asks: “Find round-trip flights from SEA to NYC on the 5th and return flights on the 10th of next month, filtering by price in ascending order.”

In response, UI-TARS navigates to the Delta Airlines website, fills in the “from” and “to” fields, clicks on the relevant data, sorts and filters by price, and explains each step in its thinking box before taking action.

Another scenario instructs to install an autoDocstring extension in VS Code. Here is the step-by-step guide to complete this task:

  • First it says: “To proceed with installing the autoDocstring extension in VS Code, I first need to open the VS Code application.”
  • The model acknowledges that the VS Code window did not fully load and notes that “it is necessary to wait for the application to complete initialization before proceeding to install the autoDocstring extension.” This ensures that all required functionality is available for the next steps.”
  • It then explains that the Extensions view needs to be accessed in VS Code, which can be done by clicking a tab in the left sidebar.
  • When it encounters a small error, it reasons: “The previous attempt to click the Extensions tab may not have been precise enough, so I’ll try clicking it again to make sure the Extensions view opens successfully.” .”
  • In the Extensions view, it enters “autoDocstring” and concludes: “Since the installation process has already been initiated, no further action is required at this time. I will wait for the installation to complete and confirm success once the process is complete.”

Outperform his competitors

Across various benchmarks, researchers report that UI-TARS consistently outperformed OpenAI’s GPT-4o; Anthropic’s Claude 3.5 Sonnet; Gemini-1.5-Pro ​​and Gemini-2.0; four Qwen models; and numerous academic models.

For example, UI-TARS 72B scored 82.8% on VisualWebBench – which measures a model’s ability to anchor web elements, including web page quality assurance and optical character recognition – outperforming GPT-4o (78.5%) and Claude 3.5 ( 78.2%).

It also performed significantly better in the WebSRC benchmarks (understanding of semantic content and layout in web contexts) and ScreenQA-short (understanding of complex mobile screen layouts and web structure). UI-TARS-7B achieved peak scores of 93.6% on WebSRC, while UI-TARS-72B achieved 88.6% on ScreenQA-short, outperforming Qwen, Gemini, Claude 3.5, and GPT-4o.

“These results demonstrate the superior perception and comprehension capabilities of UI-TARS in web and mobile environments,” the researchers write. “Such perceptual capabilities lay the foundation for agentic tasks where an accurate understanding of the environment is critical to task execution and decision making.”

UI-TARS also showed impressive results in ScreenSpot Pro and ScreenSpot v2, which evaluate a model’s ability to understand and localize elements in GUIs. In addition, the researchers tested its capabilities in planning multi-step actions and low-level tasks in mobile environments and compared it to OSWorld (which evaluates open computing tasks) and AndroidWorld (which evaluates autonomous agents on 116 programmatic tasks in 20 mobile apps). ).

Source: Arxiv
Source: Arxiv

Under the hood

To perform step-by-step actions and recognize what it sees, UI-TARS was trained on a large dataset of screenshots analyzing metadata, including element description and type, visual description, bounding box (position information), and element function and text from various websites, applications and operating systems. This allows the model to provide a comprehensive, detailed description of a screenshot, capturing not only elements but also spatial relationships and the overall layout.

The model also uses state transition labeling to identify and describe the differences between two consecutive screenshots and determine whether an action, such as a mouse click or keyboard entry, has occurred. Using the Set-of-Mark (SoM) prompt, different marks (letters, numbers) can be overlaid on specific areas of an image.

The model is equipped with both short-term and long-term memory to handle tasks at hand while retaining historical interactions to improve subsequent decision making. The researchers trained the model to reason in both System 1 (fast, automatic and intuitive) and System 2 (slow and deliberate). This enables multi-level decision making, “reflective thinking,” milestone detection, and error correction.

The researchers emphasized that it is critical that the model be able to maintain consistent goals and use trial and error to hypothesize, test, and evaluate potential actions before completing a task. To support this, they introduced two types of data: error correction and post-reflection data. To correct errors, they identified errors and labeled corrective actions; For post-reflection, they simulated recovery steps.

“This strategy ensures that the agent not only learns to avoid errors but also dynamically adapts when they occur,” the researchers write.

UI-TARS clearly has impressive capabilities, and it will be interesting to watch its evolving use cases in the increasingly competitive AI agent space. As the researchers note: “Looking forward, while native agents represent a significant advance, the future lies in the integration of active and lifelong learning, where agents autonomously advance their own learning through continuous interactions in the real world.”

Researchers note that Claude Computer Use “performs well in web-based tasks but has significant problems in mobile scenarios, suggesting that Claude’s GUI operability has not translated well to the mobile domain.”

In contrast, “UI-TARS shows excellent performance on both website and mobile.”



Source link
Spread the love
Leave a Comment

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *