Pure Vision Language Action (VLA) Models: A Comprehensive Survey

์ €์ž: Dapeng Zhang, Jing Sun, Chenghui Hu, Xiaoyan Wu, Zhenlong Yuan, Rui Zhou, Fei Shen, Qingguo Zhou | ๋‚ ์งœ: 2025-09-23 | URL: https://arxiv.org/abs/2509.19012 📄 PDF


Essence

Figure 3

Fig. 3: Vision-Language-Action Taxonomy: From Autoregression-based, Diffusion-based, to Reinforcement-based and

๋ณธ ๋…ผ๋ฌธ์€ Vision Language Action (VLA) ๋ชจ๋ธ์„ ์ฒด๊ณ„์ ์œผ๋กœ ๋ถ„๋ฅ˜ํ•˜๊ณ  ๋ถ„์„ํ•˜๋Š” ํฌ๊ด„์  ์„œ๋ฒ ์ด๋กœ, autoregression-based, diffusion-based, reinforcement-based, hybrid, specialized methods๋กœ VLA ์ ‘๊ทผ๋ฒ•์„ ๋ถ„๋ฅ˜ํ•˜์—ฌ 300๊ฐœ ์ด์ƒ์˜ ์ตœ๊ทผ ์—ฐ๊ตฌ๋ฅผ ์ข…ํ•ฉํ•œ๋‹ค.

Motivation

Achievement

Figure 1

Fig. 1: Organization and Structure of the VLA Survey.

How

Figure 2

Fig. 2: Illustration of various VLA skeleton.

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 3/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: ๋ณธ ์„œ๋ฒ ์ด๋Š” VLA ๋ถ„์•ผ์˜ ๊ธ‰์†ํ•œ ๋ฐœ์ „ ์†์—์„œ ์ฒ˜์Œ์œผ๋กœ ์ฒด๊ณ„์ ์ธ ๋ถ„๋ฅ˜์ฒด๊ณ„๋ฅผ ์ œ์‹œํ•˜๊ณ  300๊ฐœ ์ด์ƒ์˜ ์—ฐ๊ตฌ๋ฅผ ์ข…ํ•ฉํ•˜์—ฌ ํ˜„ํ™ฉ ๋งตํ•‘์„ ์ œ๊ณตํ•จ์œผ๋กœ์จ, VLA ์—ฐ๊ตฌ์ž์™€ ๋กœ๋ด‡๊ณตํ•™์ž๋“ค์—๊ฒŒ ๋†’์€ ํ•™์ˆ ์  ๊ฐ€์น˜๋ฅผ ์ œ๊ณตํ•œ๋‹ค. ๋‹ค๋งŒ ์‹œ๋ฎฌ๋ ˆ์ด์…˜-ํ˜„์‹ค ๊ฐญ, ํ‰๊ฐ€ ๋ฉ”ํŠธ๋ฆญ ํ‘œ์ค€ํ™”, ์ตœ์‹  ๋ฐฉ๋ฒ•๋ก  ์ˆ˜์šฉ ์ธก๋ฉด์˜ ๊ฐœ์„ ์ด ํ–ฅํ›„ ํ•„์š”ํ•˜๋‹ค.

← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •