🦿 New robot model can handle unfamiliar objects with natural language

🦿 New robot model can handle unfamiliar objects with natural language

Figure robots equipped with Helix can pick up almost any small household object by following natural language commands. Helix can successfully handle thousands of new items in cluttered environments without any prior demonstrations or custom programming.

WALL-Y
WALL-Y

Share this story!

  • Figure robots equipped with Helix can pick up almost any small household object by following natural language commands.
  • Helix can successfully handle thousands of new items in cluttered environments—from glassware and toys to tools and clothing—without any prior demonstrations or custom programming.
  • The system enables collaboration between two robots that together can solve tasks with objects they have never seen before.

Complete control of the upper body

A new robot model called Helix combines visual perception, language understanding, and learned control to overcome several longstanding challenges in robotics. The model can get robots to handle a variety of objects they have never encountered before, simply by receiving instructions in natural language.

Helix is the first Vision-Language-Action (VLA) model that can control the entire upper part of a humanoid robot at high speed. This includes wrists, torso, head, and individual fingers. The system can coordinate a 35-dimensional action space at 200 Hz, enabling precise movements for grasping objects.

Unlike previous solutions, Helix uses a single neural network to learn all behaviors, without any task-specific fine-tuning. This allows the robot to pick up and place items, use drawers and refrigerators, and interact with other robots.

Collaboration between robots

One of the most impressive features of Helix is the ability to enable collaboration between multiple robots. In tests, two robots using identical Helix models have successfully collaborated to store completely new groceries.

The robots can coordinate their actions through natural language prompts such as "Hand the bag of cookies to the robot on your right" or "Receive the bag of cookies from the robot on your left and place it in the open drawer."

System with dual components

Helix uses an architecture with two complementary systems:

  1. System 2: A vision-language model that works at 7-9 Hz for scene and language understanding, enabling broad generalization across objects and contexts.
  2. System 1: A fast reactive policy that translates the semantic representations from System 2 into precise continuous robot actions at 200 Hz.

This division allows each system to function at its optimal timescale. System 2 can "think slowly" about overall goals, while System 1 can "think quickly" to execute and adjust actions in real-time.

Training and data usage

The model has been trained on approximately 500 hours of high-quality data from various robots and operators, which is a fraction of what previous VLA systems required. To generate training pairs with natural language, an automatic VLM was used to create instructions in hindsight.

Helix is trained completely end-to-end, mapping from raw pixels and text commands to continuous actions with a standard loss function. This requires no task-specific adaptation.

Despite the relatively limited amount of data, Helix can successfully handle thousands of new items in cluttered environments—from glassware and toys to tools and clothing—without any prior demonstrations or custom programming.