NaVILA: Legged Robot Vision-Language-Action Model for Navigation

์ €์ž: An-Chieh Cheng, Yandong Ji, Zhaojing Yang, Zaitian Gongye, Xueyan Zou, Jan Kautz, Erdem Bฤฑyฤฑk, Hongxu Yin, Sifei Liu, Xiaolong Wang | ๋‚ ์งœ: 2024-12-05 | URL: https://arxiv.org/abs/2412.04453 📄 PDF


Essence

Figure 2

Fig. 2: NaVILA is a two-level framework combining high-level visual language understanding with low-level locomotion con

NaVILA๋Š” Vision-Language-Action ๋ชจ๋ธ๊ณผ locomotion RL policy๋ฅผ ํ†ตํ•ฉํ•œ 2-๋‹จ๊ณ„ ํ”„๋ ˆ์ž„์›Œํฌ๋กœ, ์ธ๊ฐ„ ์–ธ์–ด ๋ช…๋ น์„ legged ๋กœ๋ด‡์˜ ์ €์ˆ˜์ค€ ๊ด€์ ˆ ์ œ์–ด๋กœ ๋ฒˆ์—ญํ•˜์—ฌ ๋ณต์žกํ•œ ํ™˜๊ฒฝ์—์„œ์˜ ์‹œ๊ฐ-์–ธ์–ด ๋„ค๋น„๊ฒŒ์ด์…˜์„ ์‹คํ˜„ํ•œ๋‹ค.

Motivation

Achievement

Figure 1

Fig. 1: Real-world demonstration of NaVILA: Upon receiving human instructions, NaVILA uses a vision-language model to pr

How

Figure 2

Fig. 2: NaVILA is a two-level framework combining high-level visual language understanding with low-level locomotion con

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 4/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: NaVILA๋Š” ์–ธ์–ด ๊ธฐ๋ฐ˜ ๊ณ ์ˆ˜์ค€ ์ถ”๋ก ๊ณผ ์ €์ˆ˜์ค€ ๋กœ๋ด‡ ์ œ์–ด๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ๋ถ„๋ฆฌํ•˜๋Š” ํ˜์‹ ์  ํ”„๋ ˆ์ž„์›Œํฌ๋กœ, ๊ด‘๋ฒ”์œ„ํ•œ ๋ฒค์น˜๋งˆํฌ ๊ฐœ์„ , ์‹ค์„ธ๊ณ„ ๊ฒ€์ฆ, ๋กœ๋ด‡ ๊ฐ„ ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ์„ ํ†ตํ•ด legged ๋กœ๋ด‡ ๋‚ด๋น„๊ฒŒ์ด์…˜์˜ ์‹ค์งˆ์  ์ง„์ „์„ ์ด๋ฃฌ ์šฐ์ˆ˜ํ•œ ์—ฐ๊ตฌ์ด๋‹ค.

← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •