์ ์: Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, Xiaofan Wang, Bei Liu, Jianlong Fu, Jianmin Bao, Dong Chen, Yuanchun Shi, Jiaolong Yang, Baining Guo | ๋ ์ง: 2024-11-29 | URL: https://arxiv.org/abs/2411.19650 📄 PDF
Figure 1. (a) Success rate (%) comparison of our model against RT-1 [7], RT-1-X [48], RT-2-X [48], Octo [62], and OpenVL
CogACT๋ Vision-Language-Model์ ๊ธฐ๋ฐ์ผ๋ก ํ๋ cognition๊ณผ action์ ๋ถ๋ฆฌํ์ฌ specializing๋ diffusion action transformer ๋ชจ๋์ ํตํด ๋ก๋ด ์กฐ์์ ์ฑ๋ฅ์ ๋ํญ ํฅ์์ํจ VLA ๋ชจ๋ธ์ด๋ค.
Figure 1. (a) Success rate (%) comparison of our model against RT-1 [7], RT-1-X [48], RT-2-X [48], Octo [62], and OpenVL
Figure 2. Overview of our architecture. Our model is componentized into three parts: 1) a vision module encoding informa
์ดํ: CogACT๋ VLM๊ณผ diffusion action transformer์ effective synergy๋ฅผ ํตํด ๋ก๋ด ์กฐ์ ์ฑ๋ฅ์์ significant advancement๋ฅผ ๋ฌ์ฑํ well-motivated ์ฐ๊ตฌ์ด๋ฉฐ, componentized ์ํคํ ์ฒ์ ์ฒด๊ณ์ ์ธ ์คํ์ ํตํด ๋์ ์์ฐฝ์ฑ๊ณผ ์ค์ฉ์ ๊ฐ์น๋ฅผ ๋ณด์ฌ์ค๋ค.