Robot Trains Robot: Automatic Real-World Policy Adaptation and Learning for Humanoids

Simulation-based reinforcement learning (RL) has significantly advanced humanoid locomotion tasks, yet direct real-world RL from scratch or adapting from pretrained policies remains rare, limiting the full potential of humanoid robots. Real-world learning, despite being crucial for overcoming the sim-to-real gap, faces substantial challenges related to safety, reward design, and learning efficiency. To address these limitations, we propose Robot-Trains-Robot (RTR), a novel framework where a robotic arm teacher actively supports and guides a humanoid robot student. The RTR system provides protection, learning schedule, reward, perturbation, failure detection, and automatic resets. It enables efficient long-term real-world humanoid training with minimal human intervention. Furthermore, we propose a novel RL pipeline that facilitates and stabilizes sim-to-real transfer by optimizing a single dynamics-encoded latent variable in the real world. We validate our method through two challenging real-world humanoid tasks: fine-tuning a walking policy for precise speed tracking and learning a humanoid swing-up task from scratch, illustrating the promising capabilities of real-world humanoid learning realized by RTR-style systems.

The system consists of two groups: robot teachers and robot students. The teachers include a robot arm with an F/T sensor, a mini PC, and an optional treadmill for locomotion tasks; the students include a humanoid robot and a workstation for policy training. The four types of lines represent physical interaction, data transmission, control commands, and neural network parameters, respectively.

We illustrate our sim-to-real finetuning process. First, we train a dynamics-aware policy in simulation via domain randomization (DR), encoding environment physics into a latent vector. Next, we optimize a universal latent across diverse simulation environments to initialize real-world training. Finally, we refine the latent and train a new critic in the real world. Orange denotes trainable components in three stages; blue indicates frozen ones.

A video explanation for the algorithm is shown below.

This experiment aims to evaluate the effectiveness of arm feedback control and latent vector finetuning. We present the linear velocity tracking rewards during training and evaluation, with the arm schedule shown at the bottom center. All variants are tested under the same condition: the arm uses an XY Compliant policy with Z fixed at a ∆ position of −0.02 m. We conduct each experiment with three random seeds and show the mean and standard deviation in the plot. Unless otherwise specified, Finetune z, XY Compliant, and Z Schedule are assumed.

We illustrate the swing-up setup and experiment results. (a) The humanoid is suspended from a robot arm and uses its legs to build momentum and maximize rope angle. (b) We compare helping and perturbing arm schedules against a fixed-arm baseline and also evaluate helping without a pretrained critic. Each experiment is run with three random seeds, and plots show the mean and standard deviation. (c) We show three arm schedules, where helping and perturbing occur during the middle phase, with the arm fixed at the beginning and end.

Stanford University

^*Equal contribution, ^†Equal advising

BibTeX

@inproceedings{hu2025robot,
    title={Robot Trains Robot: Automatic Real-World Policy Adaptation and Learning for Humanoids},
    author={Hu, Kaizhe and Shi, Haochen and He, Yao and Wang, Weizhuo and Liu, C. Karen and Song, Shuran},
    booktitle={Conference on Robot Learning (CoRL)},
    year={2025}
}

Acknowledgement

The authors would like to express their great gratitude to Yifan Hou for providing valuable help and feedback on the robot arm compliance controller. We thank Sirui Chen, Pei Xu, Lei Kun, Albert Wu, Ruocheng Wang, and Ziang Cao for their input on humanoid reinforcement learning. Finally, we appreciate the helpful discussions from all members of TML and REALab.

This work was supported in part by the NSF Award \#2143601, \#2037101, and \#2132519, Sloan Fellowship. We would like to thank Google for the UR5 robot hardware. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of the sponsors.

: Automatic Real-World Policy Adaptation and Learning for Humanoids

Abstract

RTR System Setup

XY Direction Following and Z Direction Curriculum

Automatic Reset

Sim-to-real Fine-tuning Algorithm

Real-World Experiments

Walking

Swing-up

Team

BibTeX

Acknowledgement