Ground Slow, Move Fast: A Dual-System Foundation Model for Generalizable Vision-and-Language Navigation

์ €์ž: Meng Wei, Chenyang Wan, Jiaqi Peng, Xiqian Yu, Yuqiang Yang, Delin Feng, Wenzhe Cai, Chenming Zhu, Tai Wang, Jiangmiao Pang, Xihui Liu | ๋‚ ์งœ: 2025-12-09 | URL: https://arxiv.org/abs/2512.08186 📄 PDF


Essence

Figure 1

Figure 1: The proposed dual-system framework decouples high-level reasoning from low-level con-

DualVLN์€ Vision-Language Navigation์„ ์œ„ํ•ด ๊ณ ์ˆ˜์ค€ ์ถ”๋ก (System 2)๊ณผ ์ €์ˆ˜์ค€ ์ œ์–ด(System 1)๋ฅผ ๋ถ„๋ฆฌํ•œ ์ตœ์ดˆ์˜ dual-system foundation model์œผ๋กœ, VLM ๊ธฐ๋ฐ˜ global planner์™€ Diffusion Transformer ๊ธฐ๋ฐ˜ policy์˜ ๋น„๋™๊ธฐ ํ˜‘๋ ฅ์„ ํ†ตํ•ด ์‹ค์‹œ๊ฐ„ ์ œ์–ด์™€ ๋™์  ์žฅ์• ๋ฌผ ํšŒํ”ผ๋ฅผ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•œ๋‹ค.

Motivation

Achievement

Figure 1

Figure 1: The proposed dual-system framework decouples high-level reasoning from low-level con-

How

Figure 2

Figure 2: Overview of DualVLN. System 2 takes as input a sequence of egocentric images and the

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 4/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: DualVLN์€ Vision-Language Navigation ๋ถ„์•ผ์—์„œ VLM์˜ reasoning ๋Šฅ๋ ฅ๊ณผ diffusion policy์˜ real-time control ๋Šฅ๋ ฅ์„ ์ฒด๊ณ„์ ์œผ๋กœ ๊ฒฐํ•ฉํ•œ ํ˜์‹ ์  ์ ‘๊ทผ๋ฒ•์œผ๋กœ, ๋ฒค์น˜๋งˆํฌ์™€ ์‹ค์„ธ๊ณ„ ์‹คํ—˜ ๋ชจ๋‘์—์„œ ๋›ฐ์–ด๋‚œ ์„ฑ๊ณผ๋ฅผ ์ž…์ฆํ•˜๋ฉฐ ๋กœ๋ด‡ ๋„ค๋น„๊ฒŒ์ด์…˜์˜ ์‹ค์šฉ์  ๋ฐฐํฌ์— ํฐ ๊ธฐ์—ฌ๋ฅผ ํ•œ๋‹ค.

← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •