Reduced-Order Model-Guided Reinforcement Learning for Demonstration-Free Humanoid Locomotion

Essence

Figure 1: Overview of the ROM-GRL framework. In Stage 1, a 4-DOF ROM policy is trained in Box2D: the policy

ROM-GRL은 모션캡처 데이터 없이 4-DOF reduced-order model로 생성한 gait template을 이용해 full-body humanoid 정책을 학습하는 2단계 강화학습 프레임워크이다. Adversarial discriminator를 통해 ROM의 5-dimensional gait feature 분포를 따르도록 유도하여 자연스러운 보행을 실현한다.

Motivation

Known: 순수 보상 기반 RL은 자세한 보상 설계가 필요하고 부자연스러운 움직임이 발생할 수 있으며, motion capture 기반 모방학습은 mocap 데이터 의존성이 높지만 높은 사실성을 달성한다.
Gap: 모션캡처 데이터 없이도 자연스럽고 안정적인 humanoid 보행을 생성하는 방법이 부족하며, 복잡한 보상 설계 없이 demonstration-free 학습을 달성하기 어렵다.
Why: Humanoid 로봇의 실제 배포에서 mocap 데이터 수집이 어렵고 비용이 높으며, 보상 설계의 불확실성을 줄이면서도 자연스러운 보행 동작을 얻는 것이 중요하다.
Approach: 2단계 프레임워크로 먼저 경량 ROM을 PPO로 학습하여 에너지 효율적인 gait template을 생성하고, 이를 Soft Actor-Critic과 adversarial discriminator를 통해 full-body 정책으로 증류한다.

Achievement

Figure 3 visualizes pelvis and foot trajectories for the ROM-GRL policy (blue) and the pure-reward baseline (orange),

Demonstration-free learning: 모션캡처 데이터나 elaborate reward shaping 없이 자연스러운 보행 학습
다중 속도 검증: 1 m/s와 4 m/s에서 안정적이고 대칭적인 gait 생성
낮은 추적 오류: 순수 보상 기반 baseline보다 상당히 낮은 tracking error 달성
패러다임 통합: 보상 중심 및 모방 기반 방법 간의 간격을 좁혀 versatile humanoid 행동 실현

How

Figure 2: Schematic of the planar ROM used to generate reference walking trajectories. The ROM consists of a central

Stage 1: 4-DOF planar ROM에 PPO 적용하여 compact gait template 생성
Stage 2: ROM의 궤적으로부터 5-dimensional gait feature (pelvis/foot 궤적 등) 추출
Soft Actor-Critic에 adversarial discriminator 통합하여 학생 정책의 feature 분포를 teacher의 분포에 맞춤
Hierarchical decomposition으로 high-level gait planning과 low-level control 분리
Physics simulation 기반 fully differentiable training pipeline 활용

Originality

ROM을 motion capture의 대체재로 활용하는 창의적 접근으로 demonstration-free 학습과 자연스러운 동작을 동시에 달성
Adversarial discriminator를 통한 gait feature 분포 매칭으로 imitation learning 원칙을 mocap 없이 구현
경량 teacher 모델의 guidance를 고차원 정책으로 증류하는 novel distillation 스킴
보상 설계와 모방 학습의 장점을 결합한 하이브리드 패러다임 제시

Limitation & Further Study

4-DOF ROM의 단순화로 인해 복잡한 동적 움직임(예: 점프, 회전)의 적용 가능성 미평가
1 m/s와 4 m/s의 제한된 속도 범위에서만 검증되어 광범위한 속도 적응성 불명확
5-dimensional gait feature로 제한된 제약이 모든 보행 특성을 충분히 포착하는지 미검증
실제 로봇에 대한 sim-to-real transfer 성능 미평가
ROM의 동역학 모델이 실제 humanoid와 완벽히 일치하지 않을 경우의 영향 분석 부재
후속연구: 더 높은 DOF의 ROM, 다양한 보행 스타일 또는 동작으로 확장, 실물 로봇 실험, domain randomization을 통한 robustness 증대

Evaluation

Novelty: 4/5 Technical Soundness: 3/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

총평: ROM-GRL은 reduced-order model을 creative하게 활용해 motion capture 의존성을 제거하면서 자연스럽고 안정적인 humanoid 보행을 달성하는 novel 프레임워크이다. 보상 설계와 모방 학습 간 간격을 효과적으로 줄였으나, 제한된 속도 범위와 실제 로봇 검증 부재가 일반화 가능성의 의문을 남긴다.