Multimodal Spatial Language Maps for Robot Navigation and Manipulation

์ €์ž: Chenguang Huang, Oier Mees, Andy Zeng, Wolfram Burgard | ๋‚ ์งœ: 2025-06-07 | URL: https://arxiv.org/abs/2506.06862 📄 PDF


Essence

Figure 1

Figure 1. AVLMaps provide an open-vocabulary 3D map

๋กœ๋ด‡ ๋„ค๋น„๊ฒŒ์ด์…˜๊ณผ ์กฐ์ž‘์„ ์œ„ํ•ด pretrained multimodal foundation model์˜ ํŠน์ง•์„ 3D ํ™˜๊ฒฝ ์žฌ๊ตฌ์„ฑ๊ณผ ์œตํ•ฉํ•œ spatial language map (VLMaps, AVLMaps)์„ ์ œ์•ˆํ•œ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ์ž์—ฐ์–ด, ์ด๋ฏธ์ง€, ์˜ค๋””์˜ค ๋“ฑ ๋‹ค์ค‘๋ชจ๋‹ฌ ์ฟผ๋ฆฌ๋ฅผ ๊ณต๊ฐ„์ƒ์˜ ๋ชฉํ‘œ ์œ„์น˜๋กœ ๊ทธ๋ผ์šด๋”ฉํ•  ์ˆ˜ ์žˆ๋‹ค.

Motivation

Achievement

Figure 1

Figure 1. AVLMaps provide an open-vocabulary 3D map

How

Figure 2

Figure 2. The creation and language-conditioned indexing of a VLMap. A VLMap is created by fusing pretrained visual-lang

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 3/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: ๋ณธ ๋…ผ๋ฌธ์€ multimodal foundation models์„ 3D spatial map์— ์ฐฝ์˜์ ์œผ๋กœ ํ†ตํ•ฉํ•˜์—ฌ ๊ธฐ์กด ๋ฐฉ๋ฒ•์˜ ๊ณต๊ฐ„ ์ •๋ฐ€๋„์™€ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ดํ•ด ํ•œ๊ณ„๋ฅผ ๋™์‹œ์— ํ•ด๊ฒฐํ•œ ์˜๋ฏธ ์žˆ๋Š” ๊ธฐ์—ฌ๋‹ค. Audio modality์˜ ๋„์ž…๊ณผ ๋‹ค์–‘ํ•œ ๋กœ๋ด‡ ํ”Œ๋žซํผ ์ง€์›์œผ๋กœ ์‹ค์šฉ์  ํ™•์žฅ์„ฑ์ด ์šฐ์ˆ˜ํ•˜๋ฉฐ, 50% ์„ฑ๋Šฅ ํ–ฅ์ƒ ๋“ฑ ์ •๋Ÿ‰์  ๊ฒฐ๊ณผ๋„ ๊ฐ•๋ ฅํ•˜๋‹ค.

← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •