Robust Humanoid Walking on Compliant and Uneven Terrain with Deep Reinforcement Learning

Essence

Fig. 1: HRP-5P humanoid bipedal locomotion (clockwise) on flat rigid

Deep RL을 이용하여 humanoid robot HRP-5P가 시뮬레이션에서 terrain randomization으로 학습한 정책을 실제 환경의 compliant하고 uneven한 terrain에서도 robust하게 보행하도록 하는 연구이다.

Motivation

Known: Quadrupedal 로봇에서는 deep RL이 model-based 방식을 능가하는 성과를 보였지만, life-sized humanoid의 challenging terrain 보행은 주로 flat surface에 한정되어 있다. 기존 model-based 접근법은 gait parameter tuning과 terrain classification이 필요했다.
Gap: Large mass와 bulky legs를 가진 life-sized humanoid인 HRP-5P에서 compliant surface와 uneven terrain에 대한 robust end-to-end RL 정책의 실제 구현 사례가 부족하다. 또한 blind locomotion 상황에서 proprioceptive feedback만으로 terrain adaptation을 달성하는 연구가 제한적이다.
Why: Real-world deployment를 위해서는 parameter tuning 없이 다양한 terrain type에 자동으로 적응하는 unified controller가 필수적이며, 특히 compliant surface에서의 robust control은 humanoid의 large inertia로 인해 기술적으로 도전적이다.
Approach: Sim-to-real deep RL 접근법으로 simulation에서 randomized terrain으로 training curriculum을 구성하여 정책을 학습하고, adaptive gait frequency를 위한 clock signal modulation을 제안하여 aperiodic gait를 가능하게 한다.

Achievement

Fig. 1: HRP-5P humanoid bipedal locomotion (clockwise) on flat rigid

Sim-to-real transfer 성공: Training curriculum 기반 terrain randomization으로 simulation에서 학습한 single policy가 실제 HRP-5P robot에서 parameter tuning 없이 다양한 terrain (soft cushion, uneven blocks, grass, paved street)에서 robust walking을 달성
Proprioceptive feedback 기반 제어: Vision이나 terrain classification 없이 joint encoders, IMU, motor current sensors로부터의 정보만으로 terrain 적응
Adaptive gait frequency: Clock signal modulation을 통해 swing과 stance duration을 동적으로 조절하여 challenging terrain에서 보행 robustness 향상
다중 보행 모드: Standing, stepping in-place, forward walking 등 다양한 보행 모드를 단일 정책으로 구현
재현성 보장: Code와 demo video 공개로 연구 재현성 확보

How

Fig. 2: Overview of our training framework. (L) We propose to train a feedforward RL agent while exposing it to randomiz

Model-free deep RL을 이용한 end-to-end policy 학습
Simulation 환경에서 compliant/uneven terrain randomization curriculum으로 training
Observation space: proprioceptive measurements (joint positions/velocities, IMU, motor currents), walking mode (one-hot 3D), speed reference (1D scalar), clock signal (sin/cos 기반 cyclic phase variable)
Action space: 12D joint position commands (6 legs × 2 joints), fixed motor offsets와 low-gain PD controller로 tracking
Reward function: Bipedal gait terms + mode command tracking + realistic motion terms (Table III 참조)
Clock control policy: Phase variable φ의 modulation을 학습하여 cycle period L을 동적으로 조절
40Hz policy execution + 1000Hz PD control loop
Early termination: Root height 60cm 미만 또는 self-collision 시 episode 종료

Originality

Life-sized humanoid HRP-5P에서 compliant surface와 irregular terrain에 대한 end-to-end RL 정책의 실제 성공 구현 (기존은 flat surface나 simulation에 주로 한정)
Clock signal modulation을 통한 adaptive gait frequency 개념 도입으로 aperiodic motion 실현 (기존 fixed cycle 기반 접근과 차별화)
Blind locomotion (exteroceptive sensor 없음) 환경에서 proprioceptive feedback만으로 terrain adaptation 달성
Terrain randomization curriculum의 효과적인 설계로 sim-to-real transfer의 robustness 입증

Limitation & Further Study

Simulation과 real robot 간의 physics mismatch로 인한 sim-to-real gap에 대한 상세한 분석 부족
Clock control policy의 개선 정도를 quantitative하게 비교한 결과 제시 부족 (simulation에서의 systematic evaluation만 제시)
극도로 soft한 surface나 매우 높은 step obstacle에 대한 failure case 분석 미흡
Computational cost와 training time에 대한 보고 부족
다른 humanoid platform (TOCABI 등)으로의 generalization 가능성에 대한 논의 부족
Real robot에서 adaptive clock control policy의 실제 효과 검증 부재 (simulation에서만 검증)

Evaluation

Novelty: 4/5 Technical Soundness: 3/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

총평: Life-sized humanoid의 challenging terrain 보행을 위한 deep RL 기반 접근법의 실제 구현을 성공적으로 입증했으며, sim-to-real transfer와 adaptive gait control의 효과를 명확히 보여준 의미 있는 연구이다. 다만 clock control 정책의 실제 적용 효과 검증과 failure case 분석이 보강되면 더욱 완성도 높은 작업이 될 수 있다.