Motus: A Unified Latent Action World Model

์ €์ž: Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, Chendong Xiang, Yinze Rong, Hongyan Zhao, Hanyu Liu, Zhizhong Su, Lei Ma, Hang Su, Jun Zhu | ๋‚ ์งœ: 2025-12-15 | URL: https://arxiv.org/abs/2512.13030 📄 PDF


Essence

Figure 1

Figure 1. Motus Architecture. Here, at . . . at+k are actions, zt . . . zt+k are latent actions, and ฯ„v and ฯ„a are the r

Motus๋Š” vision-language-action ๋ชจ๋ธ, world ๋ชจ๋ธ, inverse dynamics ๋ชจ๋ธ, video generation ๋ชจ๋ธ์„ unified latent action world model๋กœ ํ†ตํ•ฉํ•˜๋Š” embodied agent ํ”„๋ ˆ์ž„์›Œํฌ์ด๋ฉฐ, Mixture-of-Transformer ์•„ํ‚คํ…์ฒ˜์™€ optical flow ๊ธฐ๋ฐ˜ latent action์„ ํ†ตํ•ด ๋Œ€๊ทœ๋ชจ ์ด์งˆ์  ๋ฐ์ดํ„ฐ ํ•™์Šต์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•œ๋‹ค.

Motivation

Achievement

Figure 1

Figure 1. Motus Architecture. Here, at . . . at+k are actions, zt . . . zt+k are latent actions, and ฯ„v and ฯ„a are the r

How

Figure 1

Figure 1. Motus Architecture. Here, at . . . at+k are actions, zt . . . zt+k are latent actions, and ฯ„v and ฯ„a are the r

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 4/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: Motus๋Š” ๋ถ„์‚ฐ๋œ embodied agent ์•„ํ‚คํ…์ฒ˜๋ฅผ unified model๋กœ ํ†ตํ•ฉํ•˜๋ฉด์„œ optical flow ๊ธฐ๋ฐ˜ latent action๊ณผ ์ฒด๊ณ„์ ์ธ multi-stage ํ•™์Šต์œผ๋กœ ๋Œ€๊ทœ๋ชจ ์ด์งˆ์  ๋ฐ์ดํ„ฐ ํ™œ์šฉ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•œ ํ˜์‹ ์  ์—ฐ๊ตฌ์ด๋ฉฐ, ๊ฐ•๋ ฅํ•œ ์‹คํ—˜ ์„ฑ๊ณผ์™€ ํ•จ๊ป˜ embodied AI์˜ ํ†ตํ•ฉ ๋ชจ๋ธ๋ง์— ๋Œ€ํ•œ ์ƒˆ๋กœ์šด ํŒจ๋Ÿฌ๋‹ค์ž„์„ ์ œ์‹œํ•œ๋‹ค.

← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •