VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

์ €์ž: Wenlong Huang, Chen Wang, Ruohan Zhang, Yunzhu Li, Jiajun Wu, Li Fei-Fei | ๋‚ ์งœ: 2023-07-12 | URL: https://arxiv.org/abs/2307.05973 📄 PDF


Essence

Figure 1

Figure 1: VOXPOSER extracts language-conditioned affordances and constraints from LLMs and grounds

LLM์˜ affordance ์ถ”๋ก  ๋Šฅ๋ ฅ๊ณผ code-writing ๋Šฅ๋ ฅ์„ ํ™œ์šฉํ•˜์—ฌ 3D value map์„ ์ƒ์„ฑํ•˜๊ณ , ์ด๋ฅผ model-based planning์œผ๋กœ ๋กœ๋ด‡ trajectory ํ•ฉ์„ฑ์— ํ™œ์šฉํ•˜๋Š” zero-shot ๋กœ๋ด‡ ์กฐ์ž‘ ๋ฐฉ๋ฒ•๋ก .

Motivation

Achievement

Figure 3

Figure 3: Visualization of composed 3D value maps and rollouts in real-world environments. The top row

How

Figure 2

Figure 2: Overview of VOXPOSER. Given the RGB-D observation of the environment and a language in-

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 3/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: VoxPoser๋Š” LLM์˜ ๋†’์€ ์ˆ˜์ค€ ์ถ”๋ก ๊ณผ code ์ƒ์„ฑ ๋Šฅ๋ ฅ์„ 3D ๋กœ๋ด‡ ์กฐ์ž‘์— ์ฒ˜์Œ์œผ๋กœ ํšจ๊ณผ์ ์œผ๋กœ ์—ฐ๊ฒฐํ•œ ํ˜์‹ ์  ๋ฐฉ๋ฒ•์œผ๋กœ, zero-shot ์ผ๋ฐ˜ํ™”์™€ ์‹ค์ œ ๋กœ๋ด‡ ์ ์šฉ ๊ฐ€๋Šฅ์„ฑ์„ ๋ณด์—ฌ์ฃผ๋Š” ์˜๋ฏธ ์žˆ๋Š” ๊ธฐ์—ฌ์ด๋‹ค. ๋‹ค๋งŒ affordance ์ •ํ™•์„ฑ, ์žฅ๊ธฐ ๊ณ„ํš, ๊ณ„์‚ฐ ํšจ์œจ์„ฑ ์ธก๋ฉด์˜ ๊ฐœ์„ ์ด ํ•„์š”ํ•˜๋‹ค.

← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •