WholeBodyVLA: Towards Unified Latent VLA for Whole-Body Loco-Manipulation Control

์ €์ž: Haoran Jiang, Jin Chen, Qingwen Bu, Li Chen, Modi Shi, Yanjie Zhang, Delong Li, Chuanzhe Suo, Chuang Wang, Zhihui Peng, Hongyang Li | ๋‚ ์งœ: 2025-12-11 | URL: https://arxiv.org/abs/2512.11047 📄 PDF


Essence

Figure 1

Figure 1: Introducing WholeBodyVLA, a humanoid system that operates on Agibot X2 robot and

WholeBodyVLA๋Š” Vision-Language-Action ํ”„๋ ˆ์ž„์›Œํฌ๋กœ humanoid ๋กœ๋ด‡์˜ ๋Œ€๊ทœ๋ชจ ๊ณต๊ฐ„์—์„œ end-to-end ์ „์‹  ์กฐ์ž‘-์ด๋™(loco-manipulation) ์ œ์–ด๋ฅผ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•œ๋‹ค. Unified latent learning์œผ๋กœ ์ €๋น„์šฉ ์˜์ƒ์—์„œ ํ•™์Šตํ•˜๊ณ  LMO RL policy๋กœ ์ •ํ™•ํ•œ ์ด๋™ ์‹คํ–‰์„ ๋ณด์žฅํ•œ๋‹ค.

Motivation

Achievement

Figure 1

Figure 1: Introducing WholeBodyVLA, a humanoid system that operates on Agibot X2 robot and

How

Figure 2

Figure 2: Pipeline of WholeBodyVLA. LAM is pretrained on manipulation and manipulation-

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 3/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: WholeBodyVLA๋Š” humanoid loco-manipulation์˜ ์˜ค๋žœ ๊ณผ์ œ๋ฅผ action-free ์˜์ƒ ํ•™์Šต๊ณผ ๋งž์ถคํ˜• RL policy๋กœ ์ฐฝ์˜์ ์œผ๋กœ ํ•ด๊ฒฐํ•œ ๊ฐ•๋ ฅํ•œ ๊ธฐ์—ฌ์ด๋‹ค. ์‹ค์ œ ๋กœ๋ด‡์—์„œ์˜ ์ž…์ฆ๊ณผ 21.3% ์„ฑ๋Šฅ ํ–ฅ์ƒ์ด ์‹ค์งˆ์  ๊ฐ€์น˜๋ฅผ ์ฆ๋ช…ํ•˜๋‚˜, ๋‹จ์ผ ํ”Œ๋žซํผ ๊ฒ€์ฆ๊ณผ ์ด์‚ฐ ๋ช…๋ น ์ œ์•ฝ์€ ํ–ฅํ›„ ๊ฐœ์„  ๋Œ€์ƒ์ด๋‹ค.

← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •