Vision-Language Foundation Models as Effective Robot Imitators

์ €์ž: Xinghang Li, Minghuan Liu, Hanbo Zhang, Cunjun Yu, Jie Xu, Hongtao Wu, Chilam Cheang, Ya Jing, Weinan Zhang, Huaping Liu, Hang Li, Tao Kong | ๋‚ ์งœ: 2023-11-02 | URL: https://arxiv.org/abs/2311.01378 📄 PDF


Essence

Figure 1

Figure 1: Comparison among RoboFlamingo and existing vision-language manipulation solutions.

RoboFlamingo๋Š” ๊ณต๊ฐœ ์†Œ์Šค VLM์ธ OpenFlamingo๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜์—ฌ ๋กœ๋ด‡ ์กฐ์ž‘ ์ •์ฑ…์„ ๊ตฌ์ถ•ํ•˜๋Š” ํ”„๋ ˆ์ž„์›Œํฌ๋กœ, ์‹œ๊ฐ-์–ธ์–ด ์ดํ•ด์™€ ์˜์‚ฌ๊ฒฐ์ •์„ ๋ถ„๋ฆฌํ•˜๊ณ  ์ตœ์†Œํ•œ์˜ ๋ฏธ์„ธ์กฐ์ •์œผ๋กœ ๋†’์€ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•œ๋‹ค.

Motivation

Achievement

Figure 3

Figure 3: Ablation studies on the ABCD โ†’D setting.

How

Figure 2

Figure 2: The illustration of the proposed RoboFlamingo framework. The Flamingo backbone models

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 3/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: RoboFlamingo๋Š” ๊ณต๊ฐœ ์†Œ์Šค VLM์„ ํ™œ์šฉํ•˜์—ฌ ์ €๋น„์šฉ์ด๋ฉด์„œ๋„ ๋†’์€ ์„ฑ๋Šฅ์˜ ๋กœ๋ด‡ ์กฐ์ž‘ ์ •์ฑ…์„ ๊ตฌํ˜„ํ•  ์ˆ˜ ์žˆ๋Š” ํšจ๊ณผ์ ์ธ ๋ฐฉ๋ฒ•์„ ์ œ์‹œํ•˜๋ฉฐ, ์‹œ๊ฐ-์–ธ์–ด ์ดํ•ด์™€ ์ •์ฑ… ํ•™์Šต์˜ ๋ถ„๋ฆฌ๋ผ๋Š” ๋ช…ํ™•ํ•œ ์„ค๊ณ„ ์ฒ ํ•™์œผ๋กœ ๋กœ๋ด‡ ๊ณตํ•™์˜ ๋ฏผ์ฃผํ™”์— ๊ธฐ์—ฌํ•œ๋‹ค.

← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •