$Ψ_0$: An Open Foundation Model Towards Universal Humanoid Loco-Manipulation

Essence

Fig. 2: Model Training and Deployment: First, we pre-train the VLM on the EgoDex [20] dataset to autoregressively predic

Ψ0는 인간 중심 egocentric 비디오로 VLM을 사전학습한 후 humanoid 로봇 데이터로 flow-based action expert를 post-train하는 2단계 학습 패러다임을 통해 humanoid loco-manipulation을 위한 foundation model을 제안한다.

Motivation

Known: 기존 VLA 방식들은 인간과 humanoid 로봇의 대규모 이질적 데이터를 함께 학습하거나 대량의 robot teleoperation 데이터에 의존한다. 최근 foundation model들(RT, OpenVLA, GR00T, π0 등)은 대규모 데이터 스케일링으로 manipulation 성능을 개선하고 있다.
Gap: 인간과 humanoid 로봇 간의 근본적인 kinematic 및 motion 차이로 인해 단일 monolithic policy로 두 개의 다른 action distribution을 동시에 모델링하는 것은 비효율적이며, 기존 방식들은 data efficiency와 성능이 만족스럽지 않다.
Why: Humanoid 로봇의 대규모 teleoperation 데이터 수집은 비용이 많이 들기 때문에, 확장 가능한 인간 egocentric 비디오에서 motion priors를 효과적으로 추출하면서도 embodiment gap을 극복하는 것이 humanoid 로봇의 복잡한 조작 작업 능력 확보에 중요하다.
Approach: VLM을 인간 egocentric 비디오에서 next-action prediction으로 사전학습하여 시각-행동 표현을 습득하고, MM-DiT 기반 flow-based action expert를 humanoid 로봇 데이터로 post-train하여 robot joint space에서의 정확한 제어를 학습한다. 또한 real-time action chunking으로 inference 지연으로 인한 motion jitter를 완화한다.

Achievement

데이터 효율성: 약 800시간의 인간 egocentric 비디오와 30시간의 실제 로봇 데이터만으로 10배 이상의 데이터로 학습한 baseline들을 40% 이상 상회하는 성공률 달성
2단계 학습 패러다임: VLM 사전학습(task-level motion priors 학습)과 action expert post-training(embodiment-specific dynamics 학습)을 분리하여 이질적 데이터 활용도 극대화
실시간 제어: 실시간 action chunking 메커니즘으로 inference latency에 의한 motion jitter 제거 및 smooth whole-body control 구현
개방형 생태계: 데이터 처리 및 학습 파이프라인, humanoid foundation model 가중치, 실시간 action inference engine 등 전체 시스템 오픈소스 공개
복잡한 task 성능: Pull tray, pour water, wipe table, grasp and place bottle 등 장기 horizon의 dexterous loco-manipulation task에서 실제 humanoid 로봇으로 검증

How

VLM 사전학습: Qwen3-VL-2B를 EgoDex 데이터셋의 인간 egocentric 비디오에서 autoregressive next-action token prediction으로 학습하여 task-level motion priors와 시각 표현 획득
MM-DiT 기반 action expert: Multi-modal diffusion transformer를 post-training에 사용하여 VLM 시각-언어 features를 조건으로 joint space에서 action chunks를 효율적으로 예측
2단계 post-training: 먼저 cross-task humanoid 데이터로 task-agnostic 학습을 수행한 후 in-domain teleoperated demonstrations로 task-specific fine-tuning 진행
Real-time action chunking: Inference latency로 인한 motion jitter를 완화하기 위해 training-time에서 action chunking을 도입하고 lower-body controller를 활용한 smooth whole-body control 구현
최적화된 teleoperation pipeline: MANUS gloves 기반 VR teleoperation 파이프라인을 manipulation 중심으로 최적화하여 lower-body stability 개선
고품질 데이터 선별: 노이즈가 많은 인터넷 클립이나 이질적 cross-embodiment 로봇 데이터 대신 high-quality egocentric 인간 manipulation 비디오 선택

Originality

기존의 단일 monolithic policy를 통한 인간-humanoid 데이터 co-training을 거부하고, 2단계 분리 학습 패러다임을 제안하여 embodiment gap 문제에 대한 새로운 관점 제시
데이터 스케일링보다 데이터 품질과 선택의 중요성을 강조하는 critical data recipe 식별 및 검증
VLM 사전학습과 flow-based action expert post-training의 분리를 통해 task-level semantics 학습과 embodiment-specific dynamics 학습을 명확하게 구분
Real-time action chunking으로 inference latency 문제를 train-time에서 해결하는 실용적 기법 제시
10배 이상의 데이터를 사용한 baseline을 40% 상회하는 성능으로 효율성 측면에서의 significant 개선 달성

Limitation & Further Study

EgoDex 데이터셋의 인간 egocentric 비디오가 특정 manipulation task에 편향되어 있을 가능성에 대한 분석 부재
30시간의 real-world humanoid 데이터 수집 과정에 대한 상세 분석 및 generalization 한계에 대한 논의 부족
다양한 humanoid robot morphology(Boston Dynamics Atlas, Tesla Optimus 등)에 대한 확장성 검증 미흡
MM-DiT architecture의 선택 근거와 다른 flow-based model과의 comparative analysis 부재
후속 연구로 더 다양한 embodiment의 long-horizon task에 대한 transfer learning 성능 평가, 더 효율적인 teleoperation 데이터 수집 방법론 개발, sim-to-real gap 감소 전략 등이 필요

Evaluation

Novelty: 4/5 Technical Soundness: 3/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

총평: Ψ0는 인간-humanoid embodiment gap을 극복하기 위한 명확한 2단계 학습 패러다임과 고품질 데이터 선택의 중요성을 새롭게 제시하며, 10배 이상의 데이터 효율 개선으로 humanoid loco-manipulation 분야에 significant contribution을 제공한다.