Humanoid Locomotion as Next Token Prediction

์ €์ž: Ilija Radosavovic, Bike Zhang, Baifeng Shi, Jathushan Rajasegaran, Sarthak Kamat, Trevor Darrell, Koushil Sreenath, Jitendra Malik | ๋‚ ์งœ: 2024-02-29 | URL: https://arxiv.org/abs/2402.19469 📄 PDF


Essence

Figure 2

Figure 2: Humanoid locomotion as next token prediction. We collect a dataset on trajectories from various sources, such

์ด ๋…ผ๋ฌธ์€ ์ธ๊ฐ„ํ˜• ๋กœ๋ด‡์˜ ๋ณดํ–‰ ์ œ์–ด๋ฅผ ์–ธ์–ด ๋ชจ๋ธ๋ง์˜ next token prediction ๋ฌธ์ œ๋กœ ์žฌํ•ด์„ํ•œ ์—ฐ๊ตฌ์ด๋‹ค. causal transformer๋ฅผ ์ด์šฉํ•ด sensorimotor trajectories๋ฅผ ์ž๋™ํšŒ๊ท€์ ์œผ๋กœ ์˜ˆ์ธกํ•˜๋˜, ๋ถˆ์™„์ „ํ•œ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ(์˜ˆ: ์•ก์…˜ ์—†๋Š” ๋น„๋””์˜ค)๋„ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋„๋ก ์„ค๊ณ„ํ–ˆ๋‹ค.

Motivation

Achievement

Figure 4

Figure 4: Training dataset. To train our model, we construct a dataset of trajectories coming from four different source

์ œ๋กœ์ƒท ์‹ค์ œ ํ™˜๊ฒฝ ๋ฐฐํฌ: San Francisco์˜ ๋‹ค์–‘ํ•œ ์ง€ํ˜•์—์„œ ํ•™์Šต๋œ ์ •์ฑ…์ด ์ถ”๊ฐ€ ํ›ˆ๋ จ ์—†์ด ๋ณดํ–‰ ์„ฑ๊ณต. ๋ฐ์ดํ„ฐ ํšจ์œจ์„ฑ: 27์‹œ๊ฐ„์˜ ๋ณดํ–‰ ๋ฐ์ดํ„ฐ๋งŒ์œผ๋กœ ์‹ค์ œ ํ™˜๊ฒฝ ์ ์‘ ๊ฐ€๋Šฅ. ๋ช…๋ น ์ผ๋ฐ˜ํ™”: ํ•™์Šต ์ค‘ ๋ณด์ง€ ๋ชปํ•œ ํ›„์ง„ ๋ณดํ–‰ ๋“ฑ์˜ ์ƒˆ๋กœ์šด ๋ช…๋ น์— ๋Œ€ํ•œ ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ ์ž…์ฆ. ๋ถˆ์™„์ „ ๋ฐ์ดํ„ฐ ํ™œ์šฉ: ๋ชจ์…˜์บก์ฒ˜ ๋ฐ์ดํ„ฐ, YouTube ์ธ๊ฐ„ ๋น„๋””์˜ค ๋“ฑ ์ด์งˆ์  ์†Œ์Šค๋ฅผ ํ†ตํ•ฉ ํ•™์Šต. ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ์„ฑ๋Šฅ: ๊ฐ•ํ™”ํ•™์Šต ๊ธฐ๋ฐ˜ ์ตœ์‹  ๊ธฐ๋ฒ•๊ณผ ๋น„๊ต ๊ฐ€๋Šฅํ•œ ์„ฑ๋Šฅ ๋‹ฌ์„ฑ.

How

Figure 3

Figure 3: A general framework for training with different data sources. Our data modeling allows us to train our

โ€ข sensorimotor ๊ถค์  T = (oโ‚,aโ‚,oโ‚‚,aโ‚‚,...,oโ‚œ,aโ‚œ)๋ฅผ K๊ฐœ ํ† ํฐ์œผ๋กœ ํ† ํฐํ™”

โ€ข ์ž๋™ํšŒ๊ท€ ํ™•๋ฅ  ๋ชจ๋ธ: p(t) = โˆ p(tโ‚–|tโ‚–โ‚‹โ‚,...,tโ‚)

โ€ข ์Œ์˜ ๋กœ๊ทธ ์šฐ๋„๋กœ ํ›ˆ๋ จํ•˜๋˜, Gaussian ๋ถ„ํฌ ๊ฐ€์ • ํ•˜์— MSE ์†์‹ค ์‚ฌ์šฉ

โ€ข ์™„์ „ํ•œ ๊ถค์ (neural network policy, model-based controller)๊ณผ ๋ถˆ์™„์ „ํ•œ ๊ถค์ (motion capture, YouTube)์„ mask token์œผ๋กœ ํ†ตํ•ฉ

โ€ข ํ…Œ์ŠคํŠธ ์‹œ ์ž๋™ํšŒ๊ท€์ ์œผ๋กœ ์•ก์…˜ ์‹คํ–‰ ํ›„ ๊ฐ๊ฐ ์˜ˆ์ธก์€ ๋ฌด์‹œ

โ€ข ๋‹ค์–‘ํ•œ ๋ฐ์ดํ„ฐ ์†Œ์Šค๋ฅผ ๊ฒฐํ•ฉ ํ›ˆ๋ จ(joint training) ๋˜๋Š” ๋‹จ๊ณ„์  ์‚ฌ์ „ํ›ˆ๋ จ(pre-training)

Originality

โ€ข ๋กœ๋ณดํ‹ฑ ์ œ์–ด๋ฅผ next token prediction์œผ๋กœ ๋ช…ํ™•ํžˆ ์žฌ๊ตฌ์„ฑํ•˜์—ฌ NLP์˜ ์„ฑ๊ณต ์‚ฌ๋ก€๋ฅผ ์ฒด๊ณ„์ ์œผ๋กœ ์ด์ „

โ€ข ๊ฐ์ง€์™€ ๋ชจํ„ฐ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๋ฅผ ํ•จ๊ป˜ ๋ชจ๋ธ๋ง(์กฐ๊ฑด๋ถ€ ์•ก์…˜ ๋ถ„ํฌ ๋Œ€์‹  ๊ฒฐํ•ฉ ๋ถ„ํฌ ํ•™์Šต)

โ€ข mask token์„ ํ†ตํ•œ ๋‹ค์–‘ํ•œ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ ๋ถˆ์™„์ „์„ฑ ์ฒ˜๋ฆฌ ๋ฐฉ์‹์ด ๊ฐ„๋‹จํ•˜๋ฉด์„œ๋„ ํšจ๊ณผ์ 

โ€ข ์ธํ„ฐ๋„ท ๋น„๋””์˜ค ๊ฐ™์€ ์™„์ „ํžˆ ๋‹ค๋ฅธ ํ˜•ํƒœ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋กœ๋ณดํ‹ฑ ์ •์ฑ… ํ•™์Šต์— ์ฒด๊ณ„์ ์œผ๋กœ ํ†ตํ•ฉ

Limitation & Further Study

โ€ข 27์‹œ๊ฐ„์˜ ํ•™์Šต ๋ฐ์ดํ„ฐ๋กœ๋„ ๋ณต์žกํ•œ ์‹œ๋‚˜๋ฆฌ์˜ค์— ๋Œ€ํ•œ ๊ฒฌ๊ณ ์„ฑ์ด ์ถฉ๋ถ„ํ•œ์ง€ ๋ฏธ์ƒ. ํ›„์† ์—ฐ๊ตฌ ๋ฐฉํ–ฅ: (1) ๋‹ค์–‘ํ•œ ๋กœ๋ด‡ ํ˜•ํƒœ(์‚ฌ์กฑ ๋กœ๋ด‡, ์กฐ์ž‘ ๋กœ๋ด‡)๋กœ์˜ ํ™•์žฅ ํ•„์š”, (2) ๋™์  ํ™˜๊ฒฝ์ด๋‚˜ ์žฅ์• ๋ฌผ ํšŒํ”ผ ๊ฐ™์€ ๋” ๋ณต์žกํ•œ ๊ณผ์ œ์— ๋Œ€ํ•œ ์„ฑ๋Šฅ ํ‰๊ฐ€, (3) ์‹ค์ œ ํ™˜๊ฒฝ์—์„œ์˜ ์žฌํ›ˆ๋ จ(adaptation) ๋ฉ”์ปค๋‹ˆ์ฆ˜ ๊ฐœ๋ฐœ, (4) ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ ๋ถˆ์™„์ „์„ฑ์ด ๊ทน์‹ฌํ•œ ๊ฒฝ์šฐ์˜ ์„ฑ๋Šฅ ๋ถ„์„

Evaluation

Novelty: 4/5 Technical Soundness: 4/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: ์ด ๋…ผ๋ฌธ์€ ์–ธ์–ด ๋ชจ๋ธ๋ง ํŒจ๋Ÿฌ๋‹ค์ž„์„ ๋กœ๋ด‡ ์ œ์–ด์— ํšจ๊ณผ์ ์œผ๋กœ ์ ์šฉํ•œ ๊ฐ•๋ ฅํ•œ ์—ฐ๊ตฌ์ด๋‹ค. ์ œ๋กœ์ƒท ์‹ค์ œ ํ™˜๊ฒฝ ๋ฐฐํฌ, ๋ถˆ์™„์ „ํ•œ ๋ฐ์ดํ„ฐ์˜ ์ฐฝ์˜์  ํ™œ์šฉ, ๋‹ค์–‘ํ•œ ์†Œ์Šค ํ†ตํ•ฉ ๋“ฑ์—์„œ ๋ช…ํ™•ํ•œ ๊ธฐ์—ฌ๋ฅผ ๋ณด์—ฌ์ฃผ๋ฉฐ, ๊ธฐ์ˆ ์ ์œผ๋กœ๋„ ๊ฑด์ „ํ•˜๊ณ  ์‹คํ—˜ ๊ฒฐ๊ณผ๋„ ์„ค๋“๋ ฅ ์žˆ๋‹ค.

← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •