NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation

์ €์ž: Jiazhao Zhang, Kunyu Wang, Rongtao Xu, Gengze Zhou, Yicong Hong, Xiaomeng Fang, Qi Wu, Zhizheng Zhang, He Wang | ๋‚ ์งœ: 2024-02-24 | URL: https://arxiv.org/abs/2402.15852 📄 PDF


Essence

Figure 2

Fig. 2: The overview of NaVid. The inputs of NaVid consist of the RGB frames from the online video observation {x0, ยท ยท

NaVid๋Š” ๋น„๋””์˜ค ๊ธฐ๋ฐ˜ ๋Œ€๊ทœ๋ชจ VLM์„ ํ™œ์šฉํ•˜์—ฌ ์‹œ๊ฐ-์–ธ์–ด ๋„ค๋น„๊ฒŒ์ด์…˜์—์„œ RGB ์นด๋ฉ”๋ผ ์ž…๋ ฅ๋งŒ์œผ๋กœ ๋กœ๋ด‡์˜ ๋‹ค์Œ ํ–‰๋™์„ ๊ณ„ํšํ•˜๋Š” ์ฒซ ์‹œ๋„์ด๋ฉฐ, ์ง€๋„๋‚˜ ๊นŠ์ด ์ •๋ณด ์—†์ด ์‹œ๋ฎฌ๋ ˆ์ด์…˜๊ณผ ์‹ค์ œ ํ™˜๊ฒฝ ๋ชจ๋‘์—์„œ ์ตœ๊ณ  ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•œ๋‹ค.

Motivation

Achievement

Figure 4

Fig. 4: (a) Success Rate of NaVid on different steps during

How

Figure 2

Fig. 2: The overview of NaVid. The inputs of NaVid consist of the RGB frames from the online video observation {x0, ยท ยท

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 3/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: NaVid๋Š” VLM์˜ ๊ฐ•๋ ฅํ•œ ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ์„ VLN์— ์„ฑ๊ณต์ ์œผ๋กœ ์ ์šฉํ•œ ํ˜์‹ ์  ์—ฐ๊ตฌ๋กœ, RGB๋งŒ์œผ๋กœ ์—ฐ์† ํ™˜๊ฒฝ์—์„œ ์‹ค์ œ ๋กœ๋ด‡ ๋„ค๋น„๊ฒŒ์ด์…˜์„ ์ˆ˜ํ–‰ํ•˜๋Š” ์ฒซ ์‹ค์šฉ์  VLA ๋ชจ๋ธ์ด๋‹ค. Sim-to-Real ์ „์ด์˜ ์˜ค๋žœ ๋ฌธ์ œ๋ฅผ ์šฐ์•„ํ•˜๊ฒŒ ํ•ด๊ฒฐํ•˜๊ณ  ์šฐ์ˆ˜ํ•œ ํฌ๋กœ์Šค ๋ฐ์ดํ„ฐ์…‹ ์ผ๋ฐ˜ํ™”๋ฅผ ๋ณด์—ฌ์ค€๋‹ค.

← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •