Robot Trains Robot: Automatic Real-World Policy Adaptation and Learning for Humanoids

CoRL 2025
Stanford University

* Equal contribution Equal advising

Abstract

Simulation-based reinforcement learning (RL) has significantly advanced humanoid locomotion tasks, yet direct real-world RL from scratch or adapting from pretrained policies remains rare, limiting the full potential of humanoid robots. Real-world learning, despite being crucial for overcoming the sim-to-real gap, faces substantial challenges related to safety, reward design, and learning efficiency. To address these limitations, we propose Robot-Trains-Robot (RTR), a novel framework where a robotic arm teacher actively supports and guides a humanoid robot student. The RTR system provides protection, learning schedule, reward, perturbation, failure detection, and automatic resets. It enables efficient long-term real-world humanoid training with minimal human intervention. Furthermore, we propose a novel RL pipeline that facilitates and stabilizes sim-to-real transfer by optimizing a single dynamics-encoded latent variable in the real world. We validate our method through two challenging real-world humanoid tasks: fine-tuning a walking policy for precise speed tracking and learning a humanoid swing-up task from scratch, illustrating the promising capabilities of real-world humanoid learning realized by RTR-style systems.


RTR System Setup

The system consists of two groups: robot teachers and robot students. The teachers include a robot arm with an F/T sensor, a mini PC, and an optional treadmill for locomotion tasks; the students include a humanoid robot and a workstation for policy training. The four types of lines represent physical interaction, data transmission, control commands, and neural network parameters, respectively.



XY Direction Following and Z Direction Curriculum


The robot arm could move freely in the XY plane, following the robot's movement compliantly via the F/T sensor, and gradually lower itself in the Z direction to set a learning curriculum for the humanoid.

Automatic Reset


The system could detect falls and dangerous movement, and automatically reset the robot without human intervention.


Sim-to-real Fine-tuning Algorithm

We illustrate our sim-to-real finetuning process. First, we train a dynamics-aware policy in simulation via domain randomization (DR), encoding environment physics into a latent vector. Next, we optimize a universal latent across diverse simulation environments to initialize real-world training. Finally, we refine the latent and train a new critic in the real world. Orange denotes trainable components in three stages; blue indicates frozen ones.



A video explanation for the algorithm is shown below.



Real-World Experiments

Walking

This experiment aims to evaluate the effectiveness of arm feedback control and latent vector finetuning. We present the linear velocity tracking rewards during training and evaluation, with the arm schedule shown at the bottom center. All variants are tested under the same condition: the arm uses an XY Compliant policy with Z fixed at a ∆ position of −0.02 m. We conduct each experiment with three random seeds and show the mean and standard deviation in the plot. Unless otherwise specified, Finetune z, XY Compliant, and Z Schedule are assumed.


Swing-up


We illustrate the swing-up setup and experiment results. (a) The humanoid is suspended from a robot arm and uses its legs to build momentum and maximize rope angle. (b) We compare helping and perturbing arm schedules against a fixed-arm baseline and also evaluate helping without a pretrained critic. Each experiment is run with three random seeds, and plots show the mean and standard deviation. (c) We show three arm schedules, where helping and perturbing occur during the middle phase, with the arm fixed at the beginning and end.


Team

Stanford University

*Equal contribution, Equal advising

BibTeX

@inproceedings{hu2025robot,
    title={Robot Trains Robot: Automatic Real-World Policy Adaptation and Learning for Humanoids},
    author={Hu, Kaizhe and Shi, Haochen and He, Yao and Wang, Weizhuo and Liu, C. Karen and Song, Shuran},
    booktitle={Conference on Robot Learning (CoRL)},
    year={2025}
}

Acknowledgement

The authors would like to express their great gratitude to Yifan Hou for providing valuable help and feedback on the robot arm compliance controller. We thank Sirui Chen, Pei Xu, Lei Kun, Albert Wu, Ruocheng Wang, and Ziang Cao for their input on humanoid reinforcement learning. Finally, we appreciate the helpful discussions from all members of TML and REALab.

This work was supported in part by the NSF Award \#2143601, \#2037101, and \#2132519, Sloan Fellowship. We would like to thank Google for the UR5 robot hardware. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of the sponsors.