World Models for Robotic Manipulation: A Survey

Essence

Fig. 2. Representation spectrum of world models. The five families are ordered by increasing structured inductive bias,

로봇 조작을 위한 world model에 대한 포괄적 서베이다. 세 가지 질문(어떤 미래 표현을 예측하는가, 예측을 행동에 어떻게 연결하는가, 학습 파이프라인의 어느 단계에서 사용되는가)을 중심으로 action-conditioned predictive system으로서의 world model을 정의하고, 다섯 가지 표현 계열과 기능적 분류를 제시한다.

Motivation

Known: World model의 개념은 모터 제어와 model-based reinforcement learning에서 forward model 아이디어로부터 비롯되었으며, 최근 video generation, geometric modeling, physics-informed simulator 등으로 확장되었다. 그러나 용어의 모호성으로 인해 latent dynamics model, action-conditioned video generator, 3D/4D scene predictor, VLA 시스템 내 예측 모듈 등이 혼동되어 왔다.
Gap: 기존 서베이들이 world model을 부분적으로만 다루거나(일반적인 world model 서베이는 자율주행 중심), VLA 시스템에 국한되거나(언어-조건화된 모델), reinforcement learning/imitation learning 전통과의 연결을 놓쳤다. 조작 중심의 통합된 predictive modeling 프레임워크가 부재했다.
Why: 로봇 조작은 물리적 상호작용, 폐쇄, 부분 관찰성 등의 제약 속에서 행동 전 미래를 예측해야 하며, 예측의 정확도(visual plausibility vs. contact preservation)가 행동 효율성과 반드시 일치하지 않기 때문에 이 설계 공간을 명확히 하는 것이 중요하다.
Approach: World model을 세 가지 축으로 조직화한다: (1) representation family (image/video prediction, latent dynamics, motion fields, scene flow, 3D/4D structure, physics-informed dynamics), (2) prediction-action connection (integrated vs. explicit planners), (3) pipeline stage (pretraining, post-training, inference). 34개의 manipulation dataset을 검토하고 predictive fidelity, task performance, simulator reliability에 대한 평가 프로토콜을 종합한다.

Achievement

Fig. 4. Five functional roles of infrastructure world models for robotic manipulation: synthetic experience generation,

조작용 world model의 명확한 정의: action-conditioned predictive system으로 perception, inverse model, policy, reward와 구분
다섯 가지 representation family의 체계적 분석: 각 family의 fidelity, planning horizon, computational cost, robustness 간 trade-off 분석
기능적 분류 개발: integrated prediction-action model과 explicit predictive planner 구분
인프라 역할의 특성화: synthetic experience generation, candidate filtering, search-based evaluation, learned environment, outcome verification
lifecycle 통합: pretraining, post-training, inference adaptation 단계에서의 role mapping
종합 평가 프로토콜: 예측 충실도, 하위 작업 성능, simulator 신뢰도 평가 방법론 제시

How

Fig. 5. World models across the robot-learning lifecycle. During pretraining, predictive objectives learn reusable laten

Action-conditioned predictive system으로 operational definition 제시
표현 가족을 spectrum 상에 배치하여 비교
Prediction-action 연결 방식을 functional taxonomy로 분류
학습 파이프라인의 각 단계에서 world model의 역할을 명확히
34개 dataset 검토 및 분류
평가 protocol을 predictive fidelity, task performance, simulator reliability로 분류

Originality

조작-중심 설계 공간의 최초 통합: reinforcement learning, imitation learning, video generation, geometry, physics, VLA를 단일 framework로 연결
세 가지 직교 축: representation, prediction-action connection, pipeline stage의 분리는 기존 방식(단순 perception-prediction-control)을 넘어 설계 선택을 명확화
infrastructure perspective: world model을 narrow dynamics predictor에서 general robot learning infrastructure로의 진화를 포착
조작-특정 고려사항: contact modeling, hallucination, action alignment, closed-loop evaluation 등 조작의 고유 문제 강조

Limitation & Further Study

개념적 경계의 모호성: perception과 prediction, action과 planning의 경계는 여전히 논쟁의 여지가 있으며, operational definition이 모든 경우를 명확히 해결하지 못함
closed-loop 평가의 부족: 많은 방법들이 open-loop으로 평가되며, closed-loop 성능과의 연관성이 불충분함
contact modeling: 폐쇄 루프 조작에서 critical한 contact 모델링이 현재 world model에서 충분히 다루어지지 않음
할루시네이션 제어: video generator와 VLA 시스템에서의 hallucination 통제 방법이 미성숙함
후속 연구: (1) contact-aware world model의 개발 가속화, (2) closed-loop 벤치마크 표준화, (3) 물리 정보와 학습 기반 예측의 통합 개선, (4) 언어-기초 reasoning과 geometric prediction의 연결

Evaluation

Novelty: 4/5 Technical Soundness: 4/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

총평: 이 서베이는 로봇 조작 분야에서 fragmented된 world model 문헌을 통합하는 중요한 기여다. 세 가지 직교 축의 framework와 명확한 operational definition은 향후 연구의 설계 선택을 가이드할 수 있으며, 34개 dataset 검토와 종합 평가 프로토콜은 실질적 가치를 제공한다. 다만 closed-loop 평가 부족과 contact modeling 등 조작 고유의 도전이 여전히 미해결되어 있고, 개념적 경계의 모호성도 완전히 제거되지 않았다. 전체적으로 조작 중심의 predictive modeling을 이해하는 데 필수적인 참고문헌이지만, 구체적인 기술 혁신보다는 종합 정리의 성격이 강하다.