Moto: Latent Motion Token as the Bridging Language for Learning Robot Manipulation from Videos

์ €์ž: Yi Chen, Yuying Ge, Weiliang Tang, Yizhuo Li, Yixiao Ge, Mingyu Ding, Ying Shan, Xihui Liu | ๋‚ ์งœ: 2024-12-05 | URL: https://arxiv.org/abs/2412.04445 📄 PDF


Essence

Figure 2

Figure 2. Overview of Motoโ€™s three training stages: (1) The Latent Motion Tokenizer encodes key visual motions between v

์ด ๋…ผ๋ฌธ์€ ๋น„๋””์˜ค์—์„œ ๋น„์ง€๋„ ํ•™์Šต์œผ๋กœ latent motion token์„ ํ•™์Šตํ•˜์—ฌ ๋กœ๋ด‡ ์กฐ์ž‘ ํƒœ์Šคํฌ๋ฅผ ์œ„ํ•œ ์‚ฌ์ „ํ•™์Šต์˜ ์ค‘๊ฐ„ ํ‘œํ˜„์œผ๋กœ ์‚ฌ์šฉํ•˜๊ณ , Moto-GPT๋ฅผ ํ†ตํ•ด motion token์˜ ์ž๋™ํšŒ๊ท€ ์˜ˆ์ธก์œผ๋กœ motion prior๋ฅผ ํ•™์Šตํ•œ ํ›„ co-fine-tuning์œผ๋กœ ์‹ค์ œ ๋กœ๋ด‡ ์ œ์–ด๋กœ ์ „์ดํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค.

Motivation

Achievement

Figure 1

Figure 1. The overview of Moto, which utilizes Latent Motion Tokens as a bridging โ€œlanguageโ€ for autoregressive pretrain

How

Figure 2

Figure 2. Overview of Motoโ€™s three training stages: (1) The Latent Motion Tokenizer encodes key visual motions between v

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 3/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: ์ด ๋…ผ๋ฌธ์€ latent motion token์„ ํ†ตํ•ด ๋น„๋””์˜ค ์‚ฌ์ „ํ•™์Šต๊ณผ ๋กœ๋ด‡ ์ œ์–ด๋ฅผ ์šฐ์•„ํ•˜๊ฒŒ ์—ฐ๊ฒฐํ•˜๋Š” ์ฐฝ์˜์ ์ธ ์ ‘๊ทผ์„ ์ œ์‹œํ•˜๋ฉฐ, motion prior์˜ ํ•™์Šต๊ณผ ์ „์ด์— ๋Œ€ํ•œ ๋ช…ํ™•ํ•œ ๊ฒ€์ฆ์„ ์ œ๊ณตํ•œ๋‹ค. ๋ฐ์ดํ„ฐ ํšจ์œจ์„ฑ๊ณผ ํ•ด์„ ๊ฐ€๋Šฅ์„ฑ ์ธก๋ฉด์—์„œ ๋กœ๋ด‡ ํ•™์Šต์— ์˜๋ฏธ ์žˆ๋Š” ๊ธฐ์—ฌ๋ฅผ ํ•˜์ง€๋งŒ, ์‹ค์ œ ๋กœ๋ด‡ ํ™˜๊ฒฝ์—์„œ์˜ ๊ด‘๋ฒ”์œ„ํ•œ ๊ฒ€์ฆ๊ณผ ๋‹ค์–‘ํ•œ ์กฐ์ž‘ ๋ณต์žก๋„์— ๋Œ€ํ•œ ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ ์ฆ๋ช…์ด ํ•„์š”ํ•˜๋‹ค.

← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •