CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

์ €์ž: Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, Xiaofan Wang, Bei Liu, Jianlong Fu, Jianmin Bao, Dong Chen, Yuanchun Shi, Jiaolong Yang, Baining Guo | ๋‚ ์งœ: 2024-11-29 | URL: https://arxiv.org/abs/2411.19650 📄 PDF


Essence

Figure 1

Figure 1. (a) Success rate (%) comparison of our model against RT-1 [7], RT-1-X [48], RT-2-X [48], Octo [62], and OpenVL

CogACT๋Š” Vision-Language-Model์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋˜ cognition๊ณผ action์„ ๋ถ„๋ฆฌํ•˜์—ฌ specializing๋œ diffusion action transformer ๋ชจ๋“ˆ์„ ํ†ตํ•ด ๋กœ๋ด‡ ์กฐ์ž‘์˜ ์„ฑ๋Šฅ์„ ๋Œ€ํญ ํ–ฅ์ƒ์‹œํ‚จ VLA ๋ชจ๋ธ์ด๋‹ค.

Motivation

Achievement

Figure 1

Figure 1. (a) Success rate (%) comparison of our model against RT-1 [7], RT-1-X [48], RT-2-X [48], Octo [62], and OpenVL

How

Figure 2

Figure 2. Overview of our architecture. Our model is componentized into three parts: 1) a vision module encoding informa

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 4/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: CogACT๋Š” VLM๊ณผ diffusion action transformer์˜ effective synergy๋ฅผ ํ†ตํ•ด ๋กœ๋ด‡ ์กฐ์ž‘ ์„ฑ๋Šฅ์—์„œ significant advancement๋ฅผ ๋‹ฌ์„ฑํ•œ well-motivated ์—ฐ๊ตฌ์ด๋ฉฐ, componentized ์•„ํ‚คํ…์ฒ˜์™€ ์ฒด๊ณ„์ ์ธ ์‹คํ—˜์„ ํ†ตํ•ด ๋†’์€ ์›์ฐฝ์„ฑ๊ณผ ์‹ค์šฉ์  ๊ฐ€์น˜๋ฅผ ๋ณด์—ฌ์ค€๋‹ค.

← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •