์ ์: Jiaming Liu, Hao Chen, Pengju An, Zhuoyang Liu, Renrui Zhang, Chenyang Gu, Xiaoqi Li, Ziyu Guo, Sixiang Chen, Mengzhen Liu, Chengkai Hou, Mengdi Zhao, KC alex Zhou, Pheng-Ann Heng, Shanghang Zhang | ๋ ์ง: 2025-03-13 | URL: https://arxiv.org/abs/2503.10631 📄 PDF
Figure 1: (a) Unlike recent diffusion-based VLA methods [12, 13, 14] that attach a separate diffusion
HybridVLA๋ diffusion ๊ธฐ๋ฐ action ์์ธก์ ์ฐ์์ฑ๊ณผ autoregressive VLM์ ์ถ๋ก ๋ฅ๋ ฅ์ ๋จ์ผ LLM ๋ด์์ ํตํฉํ๋ unified vision-language-action ๋ชจ๋ธ์ด๋ค. Collaborative training recipe์ adaptive action ensemble mechanism์ ํตํด ๋ ์์ฑ ํจ๋ฌ๋ค์์ ์ํธ ๊ฐํ๋ฅผ ์คํํ๋ค.
Figure 1: (a) Unlike recent diffusion-based VLA methods [12, 13, 14] that attach a separate diffusion
Figure 2: HybridVLA Framework. All multimodal inputs are encoded into tokens and subsequently
์ดํ: HybridVLA๋ diffusion๊ณผ autoregressive ๊ธฐ๋ฐ action ์์ฑ์ ๊ทผ๋ณธ์ ํ๊ณ๋ฅผ unified architecture์ collaborative training์ ํตํด ์ฐ์ํ๊ฒ ํด๊ฒฐํ๋ฉฐ, ๊ด๋ฒ์ํ ์คํ๊ณผ state-of-the-art ์ฑ๊ณผ๋ฅผ ํตํด ๋ก๋ด ์กฐ์ ๋ถ์ผ์ ์ค์ง์ ์ธ ์ง์ ์ ์ ์ํ๋ ๊ฒฌ๊ณ ํ ๋ ผ๋ฌธ์ด๋ค.