VIMA: General Robot Manipulation with Multimodal Prompts

์ €์ž: Yunfan Jiang, Agrim Gupta, Zichen Zhang, Guanzhi Wang, Yongqiang Dou, Yanjun Chen, Li Fei-Fei, Anima Anandkumar, Yuke Zhu, Linxi Fan | ๋‚ ์งœ: 2022-10-06 | URL: https://arxiv.org/abs/2210.03094 📄 PDF


Essence

Figure 1

Figure 1: Multimodal prompts for task specification. We observe that many robot manipulation tasks can be expressed as

๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ํ”„๋กฌํ”„ํŠธ(ํ…์ŠคํŠธ์™€ ์ด๋ฏธ์ง€ ํ˜ผํ•ฉ)๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋‹ค์–‘ํ•œ ๋กœ๋ด‡ ์กฐ์ž‘ ์ž‘์—…์„ ํ†ต์ผ๋œ ์‹œํ€€์Šค ๋ชจ๋ธ๋ง ๋ฌธ์ œ๋กœ ํ‘œํ˜„ํ•˜๊ณ , ์ด๋ฅผ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋Š” transformer ๊ธฐ๋ฐ˜ ๋กœ๋ด‡ ์—์ด์ „ํŠธ VIMA๋ฅผ ์ œ์‹œํ•œ๋‹ค.

Motivation

Achievement

Figure 4

Figure 4: Scaling model and data. Top: We compare performance of different methods with model sizes ranging from 2M

How

Figure 3

Figure 3: VIMA Architecture. We encode the multimodal prompts with a pre-trained T5 model, and condition the

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 3/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ํ”„๋กฌํ”„ํŠธ๋ฅผ ํ†ตํ•ด ๋‹ค์–‘ํ•œ ๋กœ๋ด‡ ์กฐ์ž‘ ์ž‘์—…์„ ํ†ต์ผ๋œ ํ”„๋ ˆ์ž„์›Œํฌ๋กœ ํ‘œํ˜„ํ•œ ํš๊ธฐ์  ์ ‘๊ทผ๋ฒ•์œผ๋กœ, ์ฒด๊ณ„์ ์ธ ๋ฒค์น˜๋งˆํฌ์™€ ํ•จ๊ป˜ ๋†’์€ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•˜์˜€๋‹ค. ๋กœ๋ด‡ ํ•™์Šต์˜ task specification ๋ฌธ์ œ์— ๋Œ€ํ•œ ์ฐฝ์˜์  ํ•ด๊ฒฐ์ฑ…์„ ์ œ์‹œํ•˜๋ฉฐ ๊ฐœ๋ฐฉํ˜• ์žฌํ˜„ ์ž๋ฃŒ๋ฅผ ํ†ตํ•ด ์ปค๋ฎค๋‹ˆํ‹ฐ ๊ธฐ์—ฌ๋„ ๋†’๋‹ค.

← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •