Agility Robotics explains how to train a whole-body control foundation model

The Robot Report Staff
2025-09-14 08:30:00
www.therobotreport.com

Agility Robotics has developed a whole-body control foundation model for its Digit humanoid, shown here.

Agility Robotics has developed a whole-body control foundation model for its Digit humanoid. Source: Agility Robotics

A humanoid robot’s most significant advantage is the ability to perform a wide variety of tasks in highly constrained environments, according to a recent Agility Robotics blog post. The built world was designed with people in mind, and machines that can match our general capability will be the most valuable.

Of course, that doesn’t mean humanoid robots have to look exactly like us, which the company has covered before.

The task at hand is relatively clear, but performing the work poses a variety of challenges. The robot must keep its balance, move about smoothly, and be robust to environmental disturbances, all while reaching, grasping, and manipulating with a high degree of precision. Above all, the robot must never fall.

Agility Robotics gives Digit a ‘motor cortex’

To this end, Agility Robotics has developed a whole-body control foundation model for its Digit humanoid robot to ensure it remains safe and stable while performing a wide variety of tasks. It functions much like the motor cortex in the human brain, taking signals from different parts of the brain — different levels of Agility’s control hierarchy in Digit’s case — to control voluntary movements and fine motor skills.

This model is a relatively small LSTM (long short-term memory) neural network with fewer than 1 million parameters, which is trained in NVIDIA’s Isaac Sim physics simulator for decades of simulated time over three or four days.

Digit’s motor cortex is learned purely in simulation and transfers zero-shot to the real world. It can be prompted with dense free-space position and orientation objectives for the arms and torso to accomplish different goals, including walking around and picking and placing heavy objects. It can learn downstream dexterous manipulation skills on top of the motor cortex, and similarly coordinate complex behaviors using large language models (LLMs).

Agility Robotics can prompt its model using various methods to accomplish a diverse set of tasks.

In March, the company deployed an early version of this work at NVIDIA’s GTC event to show Digit shopping for groceries. For this demo, the policy was prompted by object detections from an open-vocabulary object detector, lifted to 3D, and execution was managed via a state machine planning loop:

Digit demonstrates grocery handling at NVIDIA GTC. Source: Agility Robotics

This is a significant step in making Digit more robust to disturbances, even while executing complex manipulation plans, said Agility. The company was also able to prompt the motor cortex using a Gemini research preview:

The controller is also robust to picking up very heavy objects:

Digit's controller model can handle the humanoid picking up heavy objects.

Source: Agility Robotics

Why is this hard?

To perform useful work, a robot must be able to position and move its end effectors in the world robustly, said Agility Robotics. For fixed-base robots, this is a well-understood problem; simple controllers making use of model-based algorithms like inverse kinematics (IK) or inverse dynamics (ID) have been widely used to great effect for decades.

From the perspective of a user of such a system, one needs only provide the desired pose of the end effector, and the fixed-base robot will quickly move to match that target pose.

Agility said it wants to create the same interface for humanoid robots: one where the robot is informed simply by the desired motion of its end effectors, and the robot quickly and efficiently positions itself to achieve those targets.

Unfortunately, this capability is not so trivial on a legged robot, it said. The physics of legged locomotion have two modes: one where a leg is swinging freely through the air, and another when the leg is planted on the ground and applying forces to the rest of the robot. These two sets of physics are separated by a notion of making or breaking ‘”contact,” which happens when the robot places its foot or lifts its foot off the ground.

This complexity has a heavy computational cost, and simplifying assumptions are almost always employed in order to make the control and planning problem easier. One such commonly-made assumption is that the robot’s legs will remain in contact with the ground while the end effectors are reaching for their targets or manipulating objects in the world — essentially turning a legged platform into a fixed-base one.

This is a powerful heuristic that is the foundation of many of the impressive advances that the field of humanoid robotics has recently enjoyed, but is also a fundamental limitation on the performance of modern robots, wrote Agility Robotics. Preventing the robot from dynamically adjusting its foot placement will limit its manipulation workspace and prevent it from intelligently and naturally reacting to disturbance forces that it may encounter during the course of its work.

A more ideal interface would be one where the robot simply tracks the hand motions desired by the user while taking steps on its own when necessary, avoiding collisions with the environment, and trying really hard not to fall over. The difficulty of generating dynamically feasible whole-body motion plans in real-time has historically made this sort of interface intractable for humanoid robots, at least, until recently.

Enter reinforcement learning

Deep reinforcement learning (RL) is quickly emerging as the dominant control paradigm for humanoid robots. Rather than explicitly modeling the equations of motion of the hybrid dynamics of the robot in the controller or making simplifying assumptions about the contact state of the whole system, it is possible to train a neural network in a physics simulator to act as a controller for the purpose of tracking a whole-body motion, and then deploy that neural network on hardware.

Recent results on humanoid whole-body control are impressive, highly dynamic, and require far less expertise and effort to get working on hardware than using more traditional model-based control methods. However, most of this body of work focuses on dynamic whole-body motions such as dancing and fails to achieve the precise tracking required for mobile manipulation.

Agility Robotics is focusing on the ability to apply forces on the environment with both hands as well as feet so Digit is able to lift and maneuver heavy things.

Controllers must cover the workspace

Most existing work training whole-body controllers for humanoid robots with RL (such as GMT or TWIST) use large, offline datasets of human motion capture data like AMASS or LAFAN1 to train whole-body control networks. This has led to some incredibly impressive and lifelike results, but these trajectory distributions often don’t cover the entire desired manipulation workspace, observed Agility Robotics.

Inevitably, the hand pose targets in these offline datasets will densely cover some well-represented region of the workspace, and they leave the extremes of the workspace or other key regions sparsely populated. However, Agility asserted that it is important that the training distribution of hand targets uniformly covers the full workspace to make sure that the robot can perform complex manipulations that utilize its whole extent reliably.

To ensure that its motor cortex is able to reach any point in our workspace with high precision and reliability, Agility employs a random sampling scheme wherein we randomly pick positions and orientations uniformly from the workspace, and generate random translational and rotational movement speeds between those points to create time-indexed trajectories for both hands and the torso. It train its motor cortex to reach those points using a reward term that considers the translational and rotational error between the current hand pose and the target hand pose.

Agility Robotics is training Digit using a whole-body controller model. Source: Agility Robotics

Position > velocity

In order to train a reinforcement learning policy that can precisely interact with the world in a variety of ways, Agility Robotics said it wants the robot to be able to walk to specific positions in the world and stay there.

However, the vast majority of literature on applying RL to legged robots trains neural networks to track target root velocities (“match this speed and direction”) rather than root positions (“go to this point”), as in velocity-based VMP, AMOR, FALCON, ExBody and ExBody2, HumanPlus, GMT, R2S2, MHC versus position-capable CLONE, OmniH2O and HOVER.

Training controllers to track velocities rather than positions for the purpose of locomotion provides a much richer and more forgiving reward signal, so it is often a pragmatic choice, said Agility. In addition, tracking a root position on hardware at test-time requires (non-trivially) some sort of odometry, such as an implementation of a floating base pose estimator.

However, a locomotion controller that is conditioned on a target velocity rather than a target position requires a higher-level planner object or a human teleoperator to provide constant guidance to correct position drift or account for position error. Ideally, a user would tell the controller where it should be in free space, and it would navigate there and return naturally to that position even if shoved or otherwise perturbed.

Prompting in task space is better than configuration space

Another issue is that in prior work, the upper body target setpoints are often parameterized in joint space. This means that, to actually use the whole-body controller, you need either a motion capture suit and a mapping from human configuration space to robot configuration space, or a planner or model for generating upper-body motion plans, said Agility Robotics.

It noted that these are serious limitations because they require that the higher-level planning layer, human teleoperator, or LLM needs to have a model of the robot’s kinematic state. It would be much more convenient for downstream scripting, teleoperation, and behavior cloning that build on top of the controller if the interface were simply free-space positions and orientations.

This also has the advantage of being much closer to the representation used for state-of-the-art imitation learning methods, and means that we can predict object locations directly using models like Gemini to provide the policy for execution.

Agility Robotics layers on behaviors

Agility Robotics said it is actively building and deploying intelligent humanoid robots that can perform useful work in human environments. Digit’s “motor cortex” is a part of this, the company said. It functions as an “always on” safety layer that enables reactive and intuitive control of our robots.

“We can build complex behaviors on top of this lower-level motor cortex, including learning dexterous mobile manipulation behaviors,” it added. “We consider this to be the first step towards a safe and robust motion foundation model for real-world humanoid robots.”

To learn more about this topic, check out this article from Dr. Alan Fern, co-director of the Dynamic Robotics and Artificial Intelligence Lab at Oregon State University: “The Emerging Humanoid Motor Cortex: An Inventory of RL-Trained Controllers.”

Editor’s notes: This post is syndicated from Agility Robotics’ blog with permission.

Pras Velagapudi, chief technology officer at Agility Robotics, will speak about “Lessons Learned From the First Humanoid Deployments” at RoboBusiness 2025. The premier event for commercial robotics developers and suppliers will be on Oct. 15 and 16 in Santa Clara, Calif. It will feature tracks on humanoids, physical AI, enabling technologies, design and development, field robotics, and business. Registration is now open.

SITE AD for the 2025 RoboBusiness registration open.

Source Link

Keep track of your essentials with the Apple AirTag 4 Pack, the ultimate tracking solution for your belongings. With over 5,972 ratings and a stellar 4.7-star average, this product has quickly become a customer favorite. Over 10,000 units were purchased in the past month, solidifying its status as a highly rated Amazon Choice product.

For just $79.98, you can enjoy peace of mind knowing your items are always within reach. Order now for only $79.98 at Amazon!

Unlock unlimited streaming with a free Amazon Prime trial!
Sign up today!

Help Power Techcratic’s Future – Scan To Support

If Techcratic’s content and insights have helped you, consider giving back by supporting the platform with crypto. Every contribution makes a difference, whether it’s for high-quality content, server maintenance, or future updates. Techcratic is constantly evolving, and your support helps drive that progress.

As a solo operator who wears all the hats, creating content, managing the tech, and running the site, your support allows me to stay focused on delivering valuable resources. Your support keeps everything running smoothly and enables me to continue creating the content you love. I’m deeply grateful for your support, it truly means the world to me! Thank you!

BITCOIN

Bitcoin Logo