Scaling Proprioceptive-Visual Learning with Heterogeneous Pre-trained Transformers

์ €์ž: Lirui Wang, Xinlei Chen, Jialiang Zhao, Kaiming He | ๋‚ ์งœ: 2024.09 | DOI: N/A 📄 PDF


Essence

์ด ๋…ผ๋ฌธ์€ heterogeneous robot embodiments ๋ฐ tasks์— ๊ฑธ์ณ ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ๋กœ ์‚ฌ์ „ํ•™์Šตํ•˜์—ฌ ๋กœ๋ด‡ ์ •์ฑ…์˜ generalization ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๋Š” Heterogeneous Pre-trained Transformers (HPT)๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ์„œ๋กœ ๋‹ค๋ฅธ ์„ผ์„œ์™€ ๊ตฌ๋™๊ธฐ๋ฅผ ๊ฐ€์ง„ ๋‹ค์–‘ํ•œ ๋กœ๋ด‡ embodiments์˜ proprioception๊ณผ vision ์ •๋ณด๋ฅผ shared latent space๋กœ ์ •๋ ฌํ•˜์—ฌ task-agnostic, embodiment-agnosticํ•œ ๊ธฐ์ดˆ ๋ชจ๋ธ์„ ํ•™์Šตํ•œ๋‹ค.

Motivation

Achievement

Figure 5

Figure 5: Data Scaling. We run scaling HPT experiments along dataset sizes and the number of datasets. Each

ํ™•์žฅ์„ฑ ๊ฒ€์ฆ: ๋ฐ์ดํ„ฐ์…‹ ๊ทœ๋ชจ, ํ›ˆ๋ จ ์—ํฌํฌ, ๋ชจ๋ธ ํฌ๊ธฐ์— ๋”ฐ๋ฅธ scaling laws๋ฅผ ์‹ค์ฆ์ ์œผ๋กœ ์ž…์ฆํ•˜์—ฌ ๋กœ๋ด‡ ์ •์ฑ… ํ•™์Šต์—์„œ๋„ foundation models๊ณผ ์œ ์‚ฌํ•œ scaling ํ–‰๋™์ด ์กด์žฌํ•จ์„ ๋ณด์˜€๋‹ค. ์„ฑ๋Šฅ ํ–ฅ์ƒ: ์—ฌ๋Ÿฌ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๋ฒค์น˜๋งˆํฌ(CALVIN, BRIDGE, Metaworld ๋“ฑ)์™€ ์‹ค์ œ ๋กœ๋ด‡ dexterous tasks์—์„œ from-scratch baselines ๋Œ€๋น„ 20% ์ด์ƒ์˜ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ๋‹ฌ์„ฑํ–ˆ๋‹ค. ๋ฐ์ดํ„ฐ ํšจ์œจ์„ฑ: ์‚ฌ์ „ํ•™์Šต๋œ ํ‘œํ˜„์ด ์ƒˆ๋กœ์šด embodiments๋กœ์˜ transfer ์‹œ ํ•„์š”ํ•œ ๋ฐ์ดํ„ฐ๋Ÿ‰๊ณผ ํ›ˆ๋ จ ์‹œ๊ฐ„์„ ๋Œ€ํญ ๊ฐ์†Œ์‹œํ‚จ๋‹ค. ๊ด‘๋ฒ”์œ„ํ•œ ๋ฐ์ดํ„ฐ ํ†ตํ•ฉ: ์‹ค์ œ ๋กœ๋ด‡ ๋ฐ์ดํ„ฐ, ์‹œ๋ฎฌ๋ ˆ์ด์…˜, ์ธ๊ฐ„ ๋น„๋””์˜ค ๋“ฑ ์ด์งˆ์ ์ธ embodiment ๋„๋ฉ”์ธ์˜ 52๊ฐœ datasets์„ ํšจ๊ณผ์ ์œผ๋กœ ํ†ตํ•ฉํ–ˆ๋‹ค.

How

Figure 5

Figure 5: Data Scaling. We run scaling HPT experiments along dataset sizes and the number of datasets. Each

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 4/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: ์ด ๋…ผ๋ฌธ์€ ๋กœ๋ด‡ ํ•™์Šต์˜ ์ค‘์š”ํ•œ ๊ณผ์ œ์ธ heterogeneous embodiments ๊ฐ„ knowledge transfer๋ฅผ multimodal alignment์™€ ๋Œ€๊ทœ๋ชจ ์‚ฌ์ „ํ•™์Šต์œผ๋กœ ํ•ด๊ฒฐํ•˜๋Š” ์‹ค์งˆ์ ์ด๊ณ  ์ฒด๊ณ„์ ์ธ ๋ฐฉ๋ฒ•์„ ์ œ์‹œํ•œ๋‹ค. 52๊ฐœ datasets์„ ํ†ตํ•œ ๊ด‘๋ฒ”์œ„ํ•œ ์‹คํ—˜๊ณผ scaling laws์˜ ์ž…์ฆ์€ ๋กœ๋ด‡ ๋„๋ฉ”์ธ์—์„œ์˜ ๊ท€์ค‘ํ•œ ๊ธฐ์—ฌ์ด๋‹ค. ๋‹ค๋งŒ tokenizer ์„ค๊ณ„์˜ ์ผ๋ฐ˜์„ฑ, sim-to-real gap, ํ‘œํ˜„ ๊ณต๊ฐ„์— ๋Œ€ํ•œ ๊นŠ์ด ์žˆ๋Š” ๋ถ„์„ ๋“ฑ์—์„œ ๊ฐœ์„  ์—ฌ์ง€๊ฐ€ ์žˆ๋‹ค.

← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •