A Survey on Vision-Language-Action Models: An Action Tokenization Perspective

์ €์ž: Yifan Zhong, Fengshuo Bai, Shaofei Cai, Xuchuan Huang, Zhang Chen, Xiaowei Zhang, Yuanfei Wang, Shaoyang Guo, Tianrui Guan, Ka Nam Lui, Zhiquan Qi, Yitao Liang, Yuanpei Chen, Yaodong Yang | ๋‚ ์งœ: 2025-07-02 | URL: https://arxiv.org/abs/2507.01925 📄 PDF


Essence

Figure 1

Figure 1 | We present a unified framework of VLA from an action tokenization perspective. Action token refers

๋ณธ ๋…ผ๋ฌธ์€ vision-language-action (VLA) ๋ชจ๋ธ๋“ค์„ action tokenization ๊ด€์ ์—์„œ ํ†ตํ•ฉ์ ์œผ๋กœ ๋ถ„์„ํ•˜๋Š” ํฌ๊ด„์ ์ธ ์„œ๋ฒ ์ด์ด๋‹ค. ํ˜„์žฌ์˜ ๋‹ค์–‘ํ•œ VLA ๋ชจ๋ธ๋“ค์„ ๋‹จ์ผ ํ”„๋ ˆ์ž„์›Œํฌ๋กœ ํ†ตํ•ฉํ•˜๊ณ , action token์„ language description, code, affordance, trajectory, goal state, latent representation, raw action, reasoning ๋“ฑ 8๊ฐ€์ง€๋กœ ๋ถ„๋ฅ˜ํ•˜์—ฌ ์ฒด๊ณ„์ ์œผ๋กœ ์ •๋ฆฌํ•œ๋‹ค.

Motivation

Achievement

Figure 3

Figure 3 | Evolution timeline of foundation models, VLA models, and data sources. The U-shape reflects how

โ€ข ํ†ตํ•ฉ ํ”„๋ ˆ์ž„์›Œํฌ ์ œ์‹œ: ๋‹ค์–‘ํ•œ VLA ๋ชจ๋ธ๋“ค์„ action tokenization ๊ด€์ ์—์„œ ํ†ตํ•ฉ์ ์œผ๋กœ ๋ถ„์„ํ•˜๋Š” ํ”„๋ ˆ์ž„์›Œํฌ ์ œ์•ˆ

โ€ข Action Token ๋ถ„๋ฅ˜์ฒด๊ณ„: 8๊ฐ€์ง€ action token type (language description, code, affordance, trajectory, goal state, latent representation, raw action, reasoning)์˜ ํฌ๊ด„์  ๋ถ„๋ฅ˜ ๋ฐ ์ •์˜

โ€ข ๊ฐ token type๋ณ„ ์ƒ์„ธ ๋ถ„์„: ๊ฐ token type์˜ ๋ฐœ์ „ ๊ณผ์ •, ์ฃผ์š” ๋ฐฉ๋ฒ•๋ก , ์žฅ๋‹จ์ , ์ ์šฉ ๋ถ„์•ผ์— ๋Œ€ํ•œ ์‹ฌ์ธต ๋ถ„์„

โ€ข ํ–ฅํ›„ ๊ธฐ์ˆ  ํŠธ๋ Œ๋“œ ์‹๋ณ„: Hierarchical architecture, action-based reasoning, reinforcement learning ํ†ตํ•ฉ, VLA agent๋กœ์˜ ์ง„ํ™” ๋“ฑ ๋ฏธ๋ž˜ ๋ฐฉํ–ฅ ์ œ์‹œ

โ€ข ์‹ค์šฉ์  ๊ฐ€์ด๋“œ๋ผ์ธ ์ œ๊ณต: Model, data, hardware์˜ ํ˜‘์ง„ ํ•„์š”์„ฑ, safety์™€ alignment์˜ ์ค‘์š”์„ฑ ๊ฐ•์กฐ

How

Figure 2

Figure 2 | Visualization of action tokens in a single embodied task. Given the same vision and language

โ€ข 8๊ฐ€์ง€ action token type์— ๋Œ€ํ•ด ๋ณ„๋„์˜ ์„น์…˜์„ ํ• ๋‹นํ•˜์—ฌ ๊ฐ๊ฐ์˜ evolution timeline, key papers, advantages, limitations ๋“ฑ์„ ์ƒ์„ธํžˆ ๋ถ„์„

โ€ข ์‹ค์ œ VLA ๋ชจ๋ธ๋“ค (CodeAsPolicies, DriveVLM, VoxPoser, HiRobot, CoT-VLA, GO-1, VILA-U ๋“ฑ)์„ action token ๋ถ„๋ฅ˜์— ๋”ฐ๋ผ ์ฒด๊ณ„์ ์œผ๋กœ ๋ถ„๋ฅ˜ ๋ฐ ์‹œ๊ฐํ™”

โ€ข Executive summary์—์„œ action token trends, architecture trends, emerging research directions ๋“ฑ์„ ๋ช…ํ™•ํ•˜๊ฒŒ ์ •๋ฆฌ

โ€ข Table of contents ๋ฐ ์ •๋ ฌ๋œ ์„น์…˜ ๊ตฌ์กฐ๋กœ ๊ฐ ์ฃผ์ œ์— ๋Œ€ํ•œ ๋…ผ๋ฆฌ์  ์ „๊ฐœ

Originality

โ€ข Action tokenization ๊ด€์ ์˜ ์ƒˆ๋กœ์šด ๋ถ„์„ ํ‹€: ๊ธฐ์กด์— ๋ถ€์กฑํ–ˆ๋˜ action token์— ๋Œ€ํ•œ ํ†ตํ•ฉ์  ๊ด€์ ์„ ์ฒ˜์Œ์œผ๋กœ ์ œ์‹œ

โ€ข LLM์˜ language token๊ณผ VLA์˜ action token ๋Œ€์‘ ๊ด€๊ณ„ ์„ค์ •: ๋‘ ๋ถ„์•ผ์˜ ๋ณ‘๋ ฌ ๋ฐœ์ „์„ ํ†ตํ•ด ์ƒˆ๋กœ์šด ์ธ์‚ฌ์ดํŠธ ์ œ๊ณต

โ€ข Action token taxonomy์˜ ์ •๋ฆฝ: 8๊ฐ€์ง€ ์„ธ๋ถ€ ๋ถ„๋ฅ˜๋กœ ์ฒด๊ณ„์ ์ธ ๋ถ„๋ฅ˜์ฒด๊ณ„ ํ™•๋ฆฝ

โ€ข Hierarchical architecture ๋ฐ multi-token synergy ๊ฐœ๋… ๋„์ž…: ๋‹จ์ผ token ์ค‘์‹ฌ์ด ์•„๋‹Œ ์ „๋žต์  ์กฐํ•ฉ์˜ ํ•„์š”์„ฑ ๊ฐ•์กฐ

Limitation & Further Study

โ€ข ์‹คํ—˜์  ๊ฒ€์ฆ์˜ ๋ถ€์žฌ: ๋ณธ ์„œ๋ฒ ์ด๋Š” ์ •์„ฑ์  ๋ถ„์„์— ์ค‘์ ์„ ๋‘๊ณ  ์žˆ์œผ๋ฉฐ, ๊ฐ action token type๋“ค์˜ ์„ฑ๋Šฅ์„ ์ •๋Ÿ‰์ ์œผ๋กœ ๋น„๊ตํ•˜๋Š” ๋ฒค์น˜๋งˆํฌ ๊ฒฐ๊ณผ๊ฐ€ ๋ถ€์กฑํ•จ

โ€ข ๋ฐ์ดํ„ฐ ๋ฐ ํ•˜๋“œ์›จ์–ด ์ œ์•ฝ์˜ ๋…ผ์˜ ๋ถ€์กฑ: action tokenization ์„ ํƒ์ด data์™€ hardware ๊ฐ€์šฉ์„ฑ์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ์— ๋Œ€ํ•œ ์‹ฌํ™”๋œ ๋ถ„์„ ํ•„์š”

โ€ข ์‹ค์‹œ๊ฐ„ ์„ฑ๋Šฅ ํ‰๊ฐ€ ๋ถ€์žฌ: ๋ณต์žกํ•œ ์‹ค์ œ ํ™˜๊ฒฝ์—์„œ ๊ฐ token type์˜ ์‹คํ–‰ ํšจ์œจ์„ฑ, ์ง€์—ฐ ์‹œ๊ฐ„, ์‹คํŒจ์œจ ๋“ฑ์„ ๋น„๊ตํ•œ ์‹ค์ฆ์  ํ‰๊ฐ€ ํ•„์š”

โ€ข ํ›„์† ์—ฐ๊ตฌ: ๊ฐ action token type๋“ค ๊ฐ„์˜ ์„ฑ๋Šฅ ๋ฒค์น˜๋งˆํ‚น, ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์ ‘๊ทผ ๋ฐฉ์‹์˜ ์‹คํ—˜์  ๊ฒ€์ฆ, ์•ˆ์ „์„ฑ๊ณผ ์ •๋ ฌ ๋ฌธ์ œ์— ๋Œ€ํ•œ ๋” ๊นŠ์ด ์žˆ๋Š” ๋…ผ์˜ ํ•„์š”

Evaluation

Novelty: 4/5 Technical Soundness: 4/5 Significance: 5/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: ๋ณธ ์„œ๋ฒ ์ด๋Š” VLA ๋ถ„์•ผ์˜ ํ˜„ํ™ฉ์„ action tokenization์ด๋ผ๋Š” ํ†ตํ•ฉ์  ๋ Œ์ฆˆ๋กœ ๋ถ„์„ํ•˜์—ฌ ์ฒด๊ณ„์ ์ด๊ณ  ํฌ๊ด„์ ์ธ ์ดํ•ด๋ฅผ ์ œ๊ณตํ•œ๋‹ค. 8๊ฐ€์ง€ action token type์˜ ๋ถ„๋ฅ˜, ๊ฐ๊ฐ์˜ ์žฅ๋‹จ์  ๋ถ„์„, ๊ทธ๋ฆฌ๊ณ  ๋ฏธ๋ž˜ ๊ธฐ์ˆ  ํŠธ๋ Œ๋“œ์— ๋Œ€ํ•œ ์ธ์‚ฌ์ดํŠธ๋Š” VLA ์—ฐ๊ตฌ์˜ ๋ฐฉํ–ฅ์„ ์ œ์‹œํ•˜๋Š” ๋ฐ ๋งค์šฐ ๊ฐ€์น˜ ์žˆ๋‹ค. ๋‹ค๋งŒ ์ •๋Ÿ‰์ ์ธ ์„ฑ๋Šฅ ๋น„๊ต์™€ ์‹ค์ œ ํ™˜๊ฒฝ์—์„œ์˜ ๊ฒ€์ฆ์ด ๋ถ€์žฌํ•˜๋‹ค๋Š” ํ•œ๊ณ„๊ฐ€ ์žˆ์œผ๋ฉฐ, ์ด๋ฅผ ๋ณด์™„ํ•˜๋Š” ํ›„์† ์—ฐ๊ตฌ๊ฐ€ ํ•„์š”ํ•˜๋‹ค.

← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •