LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action

์ €์ž: Dhruv Shah, Blazej Osinski, Brian Ichter, Sergey Levine | ๋‚ ์งœ: 2022-07-10 | URL: https://arxiv.org/abs/2207.04429 📄 PDF


Essence

Figure 1

Figure 1: Embodied instruction following with LM-Nav: Our system takes as input a set of raw observations

LM-Nav๋Š” GPT-3, CLIP, ViNG ์„ธ ๊ฐ€์ง€ ์‚ฌ์ „ํ•™์Šต๋œ ๋ชจ๋ธ์„ ์กฐํ•ฉํ•˜์—ฌ ์ž์—ฐ์–ธ์–ด ๋ช…๋ น์œผ๋กœ ๋กœ๋ด‡์ด ์‹ค์ œ ํ™˜๊ฒฝ์—์„œ ๋„ค๋น„๊ฒŒ์ด์…˜์„ ์ˆ˜ํ–‰ํ•˜๋Š” ์‹œ์Šคํ…œ์ด๋‹ค. ๋กœ๋ด‡ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ์–ธ์–ด ์ฃผ์„ ์—†์ด๋„ ๋ณต์žกํ•œ ์‹ค์™ธ ํ™˜๊ฒฝ์—์„œ ์žฅ๊ฑฐ๋ฆฌ ๋„ค๋น„๊ฒŒ์ด์…˜์„ ์‹คํ˜„ํ•œ๋‹ค.

Motivation

Achievement

Figure 4

Figure 4: Qualitative examples of LM-Nav in real-world environments executing textual instructions (left).

How

Figure 2

Figure 2: LM-Nav uses VLM to infer a joint probability distribu-

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 3/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: LM-Nav๋Š” ์‚ฌ์ „ํ•™์Šต ๋Œ€๊ทœ๋ชจ ๋ชจ๋ธ์˜ ํš๊ธฐ์  ์กฐํ•ฉ์„ ํ†ตํ•ด ๋กœ๋ด‡ ํ•™์Šต์˜ ์ฃผ์š” ๋ณ‘๋ชฉ(์–ธ์–ด ์ฃผ์„)์„ ์ œ๊ฑฐํ•˜๋ฉด์„œ๋„ ์‹ค์ œ ํ™˜๊ฒฝ์—์„œ์˜ ์ž์—ฐ์–ธ์–ด ๋„ค๋น„๊ฒŒ์ด์…˜์„ ๋‹ฌ์„ฑํ•œ ํ˜์‹ ์  ์—ฐ๊ตฌ๋‹ค. ํŒŒ์ธํŠœ๋‹ ์—†๋Š” ๋ชจ๋“ˆ์‹ ์„ค๊ณ„์™€ ์‹ค์ œ ํ™˜๊ฒฝ ๊ฒ€์ฆ์ด ํ•™๊ณ„์™€ ์‚ฐ์—… ์–‘์ชฝ ๋ชจ๋‘์— ๋†’์€ ์˜ํ–ฅ๋ ฅ์„ ์ œ์‹œํ•œ๋‹ค.

← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •