Structured World Models from Human Videos

์ €์ž: Russell Mendonca, Shikhar Bahl, Deepak Pathak | ๋‚ ์งœ: 2023-08-21 | URL: https://arxiv.org/abs/2308.10901 📄 PDF


Essence

Figure 2

Fig. 2: Overview of SWIM. We first pre-train the world model on a large set of human videos. We finetune this on many ro

๋ณธ ๋…ผ๋ฌธ์€ ๋Œ€๊ทœ๋ชจ ์ธ๊ฐ„ ๋น„๋””์˜ค ๋ฐ์ดํ„ฐ๋กœ ์‚ฌ์ „ํ•™์Šตํ•œ ๊ตฌ์กฐํ™”๋œ world model์„ ๋กœ๋ด‡์˜ ์กฐ์ž‘ ์ž‘์—…์— ๋ฏธ์„ธ์กฐ์ •ํ•˜์—ฌ, 30๋ถ„ ์ด๋‚ด์˜ ์‹ค์ œ ์ƒํ˜ธ์ž‘์šฉ์œผ๋กœ ๋ณต์žกํ•œ ์กฐ์ž‘ ๊ธฐ์ˆ ์„ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋Š” SWIM ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์•ˆํ•œ๋‹ค.

Motivation

Achievement

Figure 4

Fig. 4: We evaluate SWIM on six different real-world manipulation tasks on two different robot systems (shown on the lef

How

Figure 3

Fig. 3: World Model Training: Images and actions are encoded into

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 3/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: ๋ณธ ๋…ผ๋ฌธ์€ ํ˜•ํƒœํ•™์ ์œผ๋กœ ๋ถˆ๋ณ€์ธ ๊ตฌ์กฐํ™” ํ–‰๋™ ๊ณต๊ฐ„์ด๋ผ๋Š” ์ฐฝ์˜์ ์ธ ์•„์ด๋””์–ด๋กœ ๋Œ€๊ทœ๋ชจ ์ธ๊ฐ„ ๋น„๋””์˜ค ๋ฐ์ดํ„ฐ๋ฅผ ์‹ค์ œ ๋กœ๋ด‡ ํ•™์Šต์— ์„ฑ๊ณต์ ์œผ๋กœ ์—ฐ๊ฒฐํ•˜์˜€์œผ๋ฉฐ, ๊ด‘๋ฒ”์œ„ํ•œ ์‹คํ—˜์„ ํ†ตํ•ด ์ƒ˜ํ”Œ ํšจ์œจ์„ฑ๊ณผ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์„ ๋ชจ๋‘ ์ž…์ฆํ•˜์—ฌ ๋กœ๋ด‡ ์กฐ์ž‘ ํ•™์Šต ๋ถ„์•ผ์— ์˜๋ฏธ ์žˆ๋Š” ๊ธฐ์—ฌ๋ฅผ ํ•˜์˜€๋‹ค.

← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •