Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation

์ €์ž: Hongtao Wu, Ya Jing, Chilam Cheang, Guangzeng Chen, Jiafeng Xu, Xinghang Li, Minghuan Liu, Hang Li, Tao Kong | ๋‚ ์งœ: 2023-12-20 | URL: https://arxiv.org/abs/2312.13139 📄 PDF


Essence

Figure 1

Figure 1: Overview of GR-1. GR-1 is first pre-trained on the task of video prediction with a large-

GR-1์€ ๋Œ€๊ทœ๋ชจ ๋น„๋””์˜ค ์ƒ์„ฑ ์‚ฌ์ „ํ•™์Šต์„ ํ™œ์šฉํ•˜์—ฌ ๋ฉ€ํ‹ฐํƒœ์Šคํฌ ์–ธ์–ด-์กฐ๊ฑด๋ถ€ ์‹œ๊ฐ ๋กœ๋ด‡ ์กฐ์ž‘์„ ํ•™์Šตํ•˜๋Š” GPT-์Šคํƒ€์ผ transformer ๋ชจ๋ธ์ด๋‹ค. ๋กœ๋ด‡์€ ์–ธ์–ด ์ง€์‹œ, ๊ด€์ฐฐ ์ด๋ฏธ์ง€, ๋กœ๋ด‡ ์ƒํƒœ๋ฅผ ์ž…๋ ฅ๋ฐ›์•„ ๋กœ๋ด‡ ์•ก์…˜๊ณผ ๋ฏธ๋ž˜ ์ด๋ฏธ์ง€๋ฅผ end-to-end ๋ฐฉ์‹์œผ๋กœ ์˜ˆ์ธกํ•œ๋‹ค.

Motivation

Achievement

Figure 3

Figure 3: CALVIN Benchmark Results. We show examples of multi-task learning trained on

How

Figure 1

Figure 1: Overview of GR-1. GR-1 is first pre-trained on the task of video prediction with a large-

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 3/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: GR-1์€ ๋Œ€๊ทœ๋ชจ ๋น„๋””์˜ค ์ƒ์„ฑ ์‚ฌ์ „ํ•™์Šต์„ ๋กœ๋ด‡ ์กฐ์ž‘์— ์ ์šฉํ•˜์—ฌ ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ๊ณผ ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ์„ ๋ณด์ธ ์˜๋ฏธ ์žˆ๋Š” ์—ฐ๊ตฌ์ด๋‹ค. Unified GPT-์Šคํƒ€์ผ ์•„ํ‚คํ…์ฒ˜์˜ ๋‹จ์ˆœ์„ฑ๊ณผ CALVIN ๋ฒค์น˜๋งˆํฌ์—์„œ์˜ ์šฐ์ˆ˜ํ•œ ์„ฑ๊ณผ, ๊ทธ๋ฆฌ๊ณ  ์‹ค์ œ ๋กœ๋ด‡์—์„œ์˜ ๊ฒ€์ฆ์ด ๊ฐ•์ ์ด๋ฉฐ, ๋กœ๋ด‡ ํ•™์Šต์—์„œ ์ƒ์„ฑ ๋ชจ๋ธ์˜ ๊ฐ€๋Šฅ์„ฑ์„ ์ฒ˜์Œ์œผ๋กœ ์ฒด๊ณ„์ ์œผ๋กœ ์ž…์ฆํ–ˆ๋‹ค๋Š” ์ ์—์„œ ๊ฐ€์น˜ ์žˆ๋‹ค.

← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •