WHOLE: World-Grounded Hand-Object Lifted from Egocentric Videos

Essence

Figure 2. Reconstruction Using the Generative Motion Prior. Given a metric-SLAMed egocentric videos, and the object temp

WHOLE는 손잡이와 물체의 상호작용을 joint generative motion prior를 통해 이용하여 egocentric 비디오에서 world space로의 hand-object 궤적을 holistically 재구성한다.

Motivation

Known: 기존 방법들은 hand pose estimation과 object pose estimation을 독립적으로 수행하거나, short temporal window에서 상세한 기하학을 복원하는 데 초점을 맞춘다.
Gap: 손과 물체의 상호작용을 joint으로 모델링하면서 global 3D world frame에서 장시간의 coherent 궤적을 복원하는 통합적 접근이 부재하다.
Why: egocentric manipulation 이해는 robot learning from demonstrations 및 AR/VR 환경 등 downstream applications에 필수적이며, 손과 물체의 관계적 일관성을 확보해야 정확한 interaction modeling이 가능하다.
Approach: diffusion 기반 generative motion prior를 학습하여 hand-object interaction의 mutual dynamics를 모델링하고, test time에 visual observations와 VLM-derived contact cues로 guided generation을 수행하여 globally consistent trajectories를 생성한다.

Achievement

Figure 1. Given a metric-SLAMed egocentric video of a person interacting with the scene and the corresponding object tem

Joint Generative Prior: hand와 object의 상호의존적 동작을 jointly reason하는 diffusion-based motion prior 학습으로 separate prediction의 inconsistency 문제 해결
VLM-enhanced Contact Detection: spatially grounded visual prompts로 enhanced vision-language model이 cluttered scenes에서도 robust contact localization 달성
State-of-the-art Performance: hand motion estimation, 6D object pose estimation, interaction reconstruction 모두에서 baseline methods를 크게 초과하는 성능
Global 4D Motion Reconstruction: metric-SLAM을 활용하여 world coordinate frame에서 long temporal sequences의 coherent hand-object trajectories 복원

How

Figure 2. Reconstruction Using the Generative Motion Prior. Given a metric-SLAMed egocentric videos, and the object temp

Diffusion model을 gravity-aware local frame에서 hand-object motion의 conditional distribution p(H, T, C | O, H̄)로 학습
Off-the-shelf hand estimator로부터 approximate hand trajectory H̄을 초기 조건으로 활용
Test time에 diffusion과 guidance step을 번갈아 수행하여 iterative refinement
2D segmentation masks를 visual observation으로 하여 reprojection guidance objective 구성
VLM (vision-language model)에 spatial prompt engineering을 적용하여 자동 contact label 생성
Contact labels를 binary indicator Ct=1:T로 모델링하여 interaction constraint로 활용
Fixed-length time window (T=120)에서 처리하여 computational efficiency 확보

Originality

Hand-object interaction을 joint generative prior로 모델링하는 novel formulation - 기존 isolated pose estimation과 차별화
Guided generation framework를 통해 generative prior를 test-time observation으로 condition하는 새로운 inference strategy
VLM의 spatial grounding capability를 강화하는 visual prompt design로 자동 contact annotation 가능하게 함
Global 4D world-space trajectory reconstruction에 joint interaction modeling을 처음 체계적으로 적용

Limitation & Further Study

Object template 제공이 필수 requirement - template-free approaches에 비해 제약이 있음
Metric-SLAM 입력에 의존하므로 camera localization 오류가 누적될 수 있음
T=120 fixed-length window로 인한 temporal flexibility 제약 - 매우 길거나 복잡한 interaction sequence 처리 어려움
VLM 기반 contact labeling의 robustness가 visual complexity에 따라 변할 수 있음
후속 연구: template-free object reconstruction 통합, sliding window를 이용한 arbitrary length sequence 처리, more sophisticated temporal modeling (e.g., Transformer-based prior)

Evaluation

Novelty: 4/5 Technical Soundness: 4/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

총평: WHOLE는 hand-object interaction을 joint generative prior로 모델링하여 egocentric video에서 globally consistent world-space trajectories를 복원하는 혁신적 접근으로, 기존 isolated method들의 inconsistency 문제를 근본적으로 해결하며 practical application에 중요한 기여를 한다.