Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models

์ €์ž: Lucy Xiaoyang Shi, Brian Ichter, Michael Equi, Liyiming Ke, Karl Pertsch, Quan Vuong, James Tanner, Anna Walling, Haohuan Wang, Niccolo Fusai, Adrian Li-Bell, Danny Driess, Lachy Groom, Sergey Levine, Chelsea Finn | ๋‚ ์งœ: 2025-02-26 | URL: https://arxiv.org/abs/2502.19417 📄 PDF


Essence

Figure 1

Figure 1: Open-ended instruction following. Hi Robot enables robots to follow multi-stage instructions, adapt to real-ti

Hi Robot๋Š” ๊ณ„์ธต์  vision-language model ๊ตฌ์กฐ๋ฅผ ํ†ตํ•ด ๋กœ๋ด‡์ด ๋ณต์žกํ•œ ์ž์—ฐ์–ด ์ง€์‹œ์‚ฌํ•ญ๊ณผ ์‹ค์‹œ๊ฐ„ ํ”ผ๋“œ๋ฐฑ์„ ์ฒ˜๋ฆฌํ•˜์—ฌ ๊ฐœ๋ฐฉํ˜• ๊ณผ์ œ๋ฅผ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋Š” ์‹œ์Šคํ…œ์ด๋‹ค. ๊ณ ์ˆ˜์ค€ VLM์ด ๋ณต์žกํ•œ ํ”„๋กฌํ”„ํŠธ๋ฅผ ํ•ด์„ํ•˜์—ฌ ์›์ž์  ๋ช…๋ น์–ด๋ฅผ ์ƒ์„ฑํ•˜๊ณ , VLA ์ •์ฑ…์ด ์ด๋ฅผ ์‹คํ–‰ํ•˜๋Š” ๋‘ ๋‹จ๊ณ„ ๊ณ„์ธต ๊ตฌ์กฐ๋ฅผ ์ œ์•ˆํ•œ๋‹ค.

Motivation

Achievement

Figure 5

Figure 5: Comparisons to Prior Methods. Hi Robot outperforms GPT-4o and flat VLA on Table Bussing, Sandwich Making, and

How

Figure 2

Figure 2: Overview of hierarchical VLA. The policy consists

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 3/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: Hi Robot์€ ๊ณ„์ธต์  VLM-VLA ๊ตฌ์กฐ์™€ ํ•ฉ์„ฑ ํ”„๋กฌํ”„ํŠธ ์ƒ์„ฑ์„ ํ†ตํ•ด ๋กœ๋ด‡์˜ ๋ณต์žกํ•œ ์ง€์‹œ ๋”ฐ๋ฅด๊ธฐ์™€ ์‹ค์‹œ๊ฐ„ ํ”ผ๋“œ๋ฐฑ ํ†ตํ•ฉ ๋Šฅ๋ ฅ์„ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œํ‚จ ์ค‘์š”ํ•œ ๊ธฐ์—ฌ์ด๋‹ค. ๋‹ค์–‘ํ•œ ํ”Œ๋žซํผ์—์„œ์˜ ์‹คํ—˜ ๊ฒ€์ฆ๊ณผ ๊ธฐ์กด ๋ฐฉ๋ฒ• ๋Œ€๋น„ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ์ง€๋งŒ, ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ์˜ ํ’ˆ์งˆ, ์ €์ˆ˜์ค€ ์ •์ฑ…์˜ ํ•œ๊ณ„, ๊ณ„์‚ฐ ๋น„์šฉ ๋“ฑ์— ๋Œ€ํ•œ ๊ฐœ์„ ์ด ํ•„์š”ํ•˜๋‹ค.

← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •