Robotic Skill Acquisition via Instruction Augmentation with Vision-Language Models

์ €์ž: Ted Xiao, Harris Chan, Pierre Sermanet, Ayzaan Wahid, Anthony Brohan, Karol Hausman, Sergey Levine, Jonathan Tompson | ๋‚ ์งœ: 2022-11-21 | URL: https://arxiv.org/abs/2211.11736 📄 PDF


Essence

Figure 1

Fig. 1: DIAL consists of three steps: (1) Contrastive fine-tuning of a vision-language model (VLM) such as CLIP [39] on

Vision-Language Model (CLIP)์„ ๋ฏธ์„ธ์กฐ์ •ํ•˜์—ฌ ์ฃผ์„์ด ์—†๋Š” ๋Œ€๊ทœ๋ชจ ๋กœ๋ด‡ ์กฐ์ž‘ ๋ฐ์ดํ„ฐ์…‹์— ์ž๋™์œผ๋กœ ์ž์—ฐ์–ด ๋ช…๋ น์–ด๋ฅผ ์ƒ์„ฑํ•˜๊ณ , ์ด๋ฅผ ํ†ตํ•ด ์–ธ์–ด ์กฐ๊ฑด๋ถ€ ์ •์ฑ…์„ ํ•™์Šตํ•˜๋Š” DIAL ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค.

Motivation

Achievement

Figure 5

Fig. 5: Given the same starting scene, DIAL follows the instructions of (a) pick can which is on the right of

How

Figure 1

Fig. 1: DIAL consists of three steps: (1) Contrastive fine-tuning of a vision-language model (VLM) such as CLIP [39] on

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 3/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: VLM์„ ๋ฐ์ดํ„ฐ ์ฃผ์„ ๋„๊ตฌ๋กœ ํ™œ์šฉํ•˜๋Š” ์‹ค์šฉ์ ์ด๊ณ  ํ™•์žฅ ๊ฐ€๋Šฅํ•œ ๋ฐฉ๋ฒ•์„ ์ œ์‹œํ•˜๋ฉฐ, 1,300ํšŒ ์ด์ƒ์˜ ์‹ค์ œ ๋กœ๋ด‡ ํ‰๊ฐ€๋ฅผ ํ†ตํ•ด ํšจ๊ณผ๋ฅผ ์ž…์ฆํ–ˆ๋‹ค. ๋กœ๋ด‡ ํ•™์Šต์˜ ๋น„์šฉ ํšจ์œจ์„ฑ์„ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ๋Š” ๊ฐ€์น˜ ์žˆ๋Š” ๊ธฐ์—ฌ์ด๋‹ค.

← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •