Learning Universal Policies via Text-Guided Video Generation

์ €์ž: Yilun Du, Mengjiao Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Joshua B. Tenenbaum, Dale Schuurmans, Pieter Abbeel | ๋‚ ์งœ: 2023-01-31 | URL: https://arxiv.org/abs/2302.00111 📄 PDF


Essence

Figure 1

Figure 1: Text-Conditional Video Generation as Universal Policies. Text-conditional video generations

ํ…์ŠคํŠธ ์กฐ๊ฑด๋ถ€ video generation์„ ์‚ฌ์šฉํ•˜์—ฌ ๋‹ค์–‘ํ•œ ํ™˜๊ฒฝ์—์„œ ์ž‘๋™ํ•˜๋Š” ๋ฒ”์šฉ ์ •์ฑ…์„ ํ•™์Šตํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•˜๋ฉฐ, ํ˜„์žฌ ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ ๋ชฉํ‘œ ์„ค๋ช…์œผ๋กœ๋ถ€ํ„ฐ ๋ฏธ๋ž˜ ํ”„๋ ˆ์ž„ ์‹œํ€€์Šค๋ฅผ ์ƒ์„ฑํ•œ ํ›„ inverse dynamics model๋กœ ์•ก์…˜์„ ์ถ”์ถœํ•œ๋‹ค.

Motivation

Achievement

Figure 3

Figure 3: Combinatorial Video Generation. Generated videos for unseen language goals at test time.

How

Figure 2

Figure 2: Given an input observation and text instruction, we

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 3/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: ๋ณธ ๋…ผ๋ฌธ์€ video generation์„ ํ†ตํ•œ ๋ฒ”์šฉ ์ •์ฑ… ํ•™์Šต์ด๋ผ๋Š” ์ฐฝ์˜์ ์ธ ์ ‘๊ทผ์œผ๋กœ ํ™˜๊ฒฝ ๋‹ค์–‘์„ฑ๊ณผ reward ์„ค๊ณ„ ๋ฌธ์ œ๋ฅผ ์šฐ์•„ํ•˜๊ฒŒ ํ•ด๊ฒฐํ•˜๋ฉฐ, ์กฐํ•ฉ์  ์ผ๋ฐ˜ํ™”์™€ ์ธํ„ฐ๋„ท ๊ทœ๋ชจ ์ง€์‹ ์ „์ด๋ฅผ ํ†ตํ•ด ๊ฐ•ํ™”ํ•™์Šต ๋ถ„์•ผ์— ์ƒ๋‹นํ•œ ๊ธฐ์—ฌ๋ฅผ ํ•œ๋‹ค.

← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •