LeVERB: Humanoid Whole-Body Control with Latent Vision-Language Instruction

์ €์ž: Haoru Xue, Xiaoyu Huang, Dantong Niu, Qiayuan Liao, Thomas Kragerud, Jan Tommy Gravdahl, Xue Bin Peng, Guanya Shi, Trevor Darrell, Koushil Sreenath, Shankar Sastry | ๋‚ ์งœ: 2025-06-16 | URL: https://arxiv.org/abs/2506.13751 📄 PDF


Essence

Figure 1

Figure 1: Overview of our contributions. Top: we create a photorealistic and dynamically accurate

LeVERB๋Š” humanoid ๋กœ๋ด‡์˜ ์ „์‹  ์ œ์–ด๋ฅผ ์œ„ํ•ด vision-language ์ž…๋ ฅ์„ latent action ๊ณต๊ฐ„์œผ๋กœ ์ธ์ฝ”๋”ฉํ•˜๋Š” ๊ณ„์ธต์  ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์•ˆํ•˜๋ฉฐ, 150๊ฐœ ์ด์ƒ์˜ task๋กœ ๊ตฌ์„ฑ๋œ ์ฒซ ๋ฒˆ์งธ sim-to-real ์ค€๋น„ ๋ฒค์น˜๋งˆํฌ๋ฅผ ์ œ์‹œํ•œ๋‹ค.

Motivation

Achievement

Figure 1

Figure 1: Overview of our contributions. Top: we create a photorealistic and dynamically accurate

How

Figure 3

Figure 3: Details of our data collection and training pipeline. Step 1: we collect a synthetic,

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 3/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: LeVERB๋Š” humanoid WBC๋ฅผ ์œ„ํ•œ vision-language ์ œ์–ด์—์„œ ์ค‘์š”ํ•œ ์ง„์ „์„ ์ด๋ฃจ์—ˆ์œผ๋ฉฐ, ์ฒซ latent instruction-following framework์™€ comprehensive sim-to-real ๋ฒค์น˜๋งˆํฌ๋ฅผ ์ œ์‹œํ•˜์—ฌ ์ด ๋ถ„์•ผ์˜ ๊ธฐ์ดˆ๋ฅผ ๋‹ค์กŒ๋‹ค. ๋‹ค๋งŒ ์‹ค์ œ ๋ฐฐํฌ ์„ฑ๋Šฅ์˜ ์ถ”๊ฐ€ ๊ฐœ์„ ๊ณผ ๋” ๊ด‘๋ฒ”์œ„ํ•œ task ํ‰๊ฐ€๋ฅผ ํ†ตํ•œ ๊ฒ€์ฆ์ด ํ•„์š”ํ•˜๋‹ค.

← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •