L3MVN: Leveraging Large Language Models for Visual Target Navigation

์ €์ž: Bangguo Yu, Hamidreza Kasaei, Ming Cao | ๋‚ ์งœ: 2023-04-11 | URL: https://arxiv.org/abs/2304.05501 📄 PDF


Essence

Figure 2

Fig. 2: The architecture of the target navigation framework. The framework takes RGB-D images as input to generate a

๋Œ€ํ˜• ์–ธ์–ด๋ชจ๋ธ(LLM)์„ ํ™œ์šฉํ•˜์—ฌ ์˜๋ฏธ์  ๋งต๊ณผ ํ”„๋ก ํ‹ฐ์–ด ์„ ํƒ์„ ํ†ตํ•ด ๋ฏธ์ง€์˜ ํ™˜๊ฒฝ์—์„œ ์‹œ๊ฐ์  ๋ชฉํ‘œ ํ•ญ๋ฒ•์„ ์ˆ˜ํ–‰ํ•˜๋Š” ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์•ˆํ•œ๋‹ค. Zero-shot๊ณผ feed-forward ๋‘ ๊ฐ€์ง€ ํŒจ๋Ÿฌ๋‹ค์ž„์œผ๋กœ ์ƒ์‹์  ์ถ”๋ก ์„ ์ด์šฉํ•œ ํšจ์œจ์  ํƒ์ƒ‰์„ ๋‹ฌ์„ฑํ•œ๋‹ค.

Motivation

Achievement

Figure 1

Fig. 1: Visual target navigation example. The robot explores

How

Figure 2

Fig. 2: The architecture of the target navigation framework. The framework takes RGB-D images as input to generate a

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 3/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: LLM์˜ ์ƒ์‹์  ์ง€์‹์„ ์˜๋ฏธ์  ํƒ์ƒ‰์— ํ™œ์šฉํ•˜๋Š” ์ฐฝ์˜์ ์ธ ์ ‘๊ทผ์œผ๋กœ ํ•™์Šต ๋น„์šฉ์„ ํฌ๊ฒŒ ์ ˆ๊ฐํ•˜๋ฉด์„œ๋„ ์šฐ์ˆ˜ํ•œ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ–ˆ๋‹ค. Zero-shot ํ•™์Šต ๋Šฅ๋ ฅ๊ณผ ์‹ค์ œ ๋กœ๋ด‡ ์‹คํ—˜์„ ํ†ตํ•ด ์‹ค์šฉ์„ฑ์„ ์ž…์ฆํ•œ ์˜๋ฏธ ์žˆ๋Š” ์—ฐ๊ตฌ์ด๋‚˜, ์‹ค์‹œ๊ฐ„ ์„ฑ๋Šฅ๊ณผ ๋‹ค์–‘ํ•œ ํ™˜๊ฒฝ์—์„œ์˜ ํ™•์žฅ์„ฑ ๊ฒ€์ฆ์ด ํ•„์š”ํ•˜๋‹ค.

← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •