WMNav: Integrating Vision-Language Models into World Models for Object Goal Navigation

์ €์ž: Dujun Nie, Xianda Guo, Yiqun Duan, Ruijun Zhang, Long Chen | ๋‚ ์งœ: 2025-03-04 | URL: https://arxiv.org/abs/2503.02247 📄 PDF


Essence

Figure 2

Fig. 2: The WMNav framework. After acquiring the RGB-D panoramic image and pose information at step t, the

Vision-Language Model์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ world model์„ ์„ค๊ณ„ํ•˜์—ฌ Object Goal Navigation ์ž‘์—…์—์„œ ๋ฏธ๋ž˜ ์ƒํƒœ๋ฅผ ์˜ˆ์ธกํ•˜๊ณ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ†ตํ•ด ์ •์ฑ…์„ ๊ฐœ์„ ํ•˜๋Š” WMNav ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์•ˆํ•œ๋‹ค. Curiosity Value Map์ด๋ผ๋Š” ์˜จ๋ผ์ธ ์œ ์ง€ ๋ฉ”๋ชจ๋ฆฌ ๊ตฌ์กฐ์™€ ๋‘ ๋‹จ๊ณ„ ํ–‰๋™ ์ œ์•ˆ ์ „๋žต์œผ๋กœ VLM์˜ hallucination์„ ์™„ํ™”ํ•˜๋ฉด์„œ ํƒ์ƒ‰ ํšจ์œจ์„ฑ์„ ํ–ฅ์ƒ์‹œํ‚จ๋‹ค.

Motivation

Achievement

Figure 2

Fig. 2: The WMNav framework. After acquiring the RGB-D panoramic image and pose information at step t, the

How

Figure 2

Fig. 2: The WMNav framework. After acquiring the RGB-D panoramic image and pose information at step t, the

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 3/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: ๋ณธ ๋…ผ๋ฌธ์€ VLM์„ world model๋กœ ํ™œ์šฉํ•˜๋Š” ํ˜์‹ ์ ์ธ ์ ‘๊ทผ์œผ๋กœ zero-shot object navigation์—์„œ ์ƒˆ๋กœ์šด ๋ฐฉํ–ฅ์„ ์ œ์‹œํ•˜๋ฉฐ, Curiosity Value Map ๋ฐ ๋‘ ๋‹จ๊ณ„ ํ–‰๋™ ์ œ์•ˆ ์ „๋žต์ด ํšจ๊ณผ์ ์œผ๋กœ ํƒ์ƒ‰ ํšจ์œจ์„ฑ์„ ๋†’์ธ๋‹ค. ์ฒด๊ณ„์ ์ธ ์„ค๊ณ„์™€ ๊ฐ•๋ ฅํ•œ ์‹คํ—˜ ๊ฒฐ๊ณผ๋กœ embodied AI ๋ถ„์•ผ์— ์ค‘์š”ํ•œ ๊ธฐ์—ฌ๋ฅผ ํ•œ๋‹ค.

← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •