A Survey on Vision-Language-Action Models for Autonomous Driving

์ €์ž: Sicong Jiang, Zilin Huang, Kangan Qian, Ziang Luo, Tianze Zhu, Yang Zhong, Yihong Tang, Menglin Kong, Yunlong Wang, Siwen Jiao, Hao Ye, Zihao Sheng, Xin Zhao, Tuopu Wen, Zheng Fu, Sikai Chen, Kun Jiang, Diange Yang, Seongjin Choi, Lijun Sun | ๋‚ ์งœ: 2025-06-30 | URL: https://arxiv.org/abs/2506.24044 📄 PDF


Essence

Figure 1

Figure 1. Comparisons of autonomous driving paradigms. (a) End-to-end driving offers direct perception-to-control mappin

๋ณธ ๋…ผ๋ฌธ์€ Vision-Language-Action (VLA) ๋ชจ๋ธ์„ ์ž์œจ์ฃผํ–‰์— ์ ์šฉํ•˜๋Š” ์ตœ์ดˆ์˜ ์ข…ํ•ฉ ์„œ๋ฒ ์ด๋กœ, 20๊ฐœ ์ด์ƒ์˜ ๋Œ€ํ‘œ ๋ชจ๋ธ์„ ๋ถ„์„ํ•˜๊ณ  ์‹œ๊ฐ ์ธ์‹, ์ž์—ฐ์–ด ์ดํ•ด, ์ œ์–ด๋ฅผ ํ†ตํ•ฉํ•˜๋Š” ํŒจ๋Ÿฌ๋‹ค์ž„์˜ ๋ฐœ์ „ ๊ณผ์ •์„ ์ถ”์ ํ•œ๋‹ค.

Motivation

Achievement

Figure 2

Figure 2. Overview of the VLA4AD Architecture.

How

Figure 1

Figure 1. Comparisons of autonomous driving paradigms. (a) End-to-end driving offers direct perception-to-control mappin

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 3/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: ๋ณธ ๋…ผ๋ฌธ์€ VLA4AD ๋ถ„์•ผ์˜ ์ตœ์ดˆ์˜ ์ข…ํ•ฉ ์„œ๋ฒ ์ด๋กœ์„œ ์•„ํ‚คํ…์ฒ˜, ์ง„ํ™” ๊ณผ์ •, ๋ชจ๋ธ ๋น„๊ต๋ฅผ ์ฒด๊ณ„์ ์œผ๋กœ ์ •๋ฆฌํ•˜๊ณ  ๊ฐœ๋ฐฉ ๊ณผ์ œ๋ฅผ ๋ช…ํ™•ํžˆ ์ •์˜ํ•จ์œผ๋กœ์จ, ์„ค๋ช…๊ฐ€๋Šฅํ•˜๊ณ  ๊ฒฌ๊ณ ํ•œ ์ž์œจ์ฃผํ–‰ ์‹œ์Šคํ…œ ๊ฐœ๋ฐœ์„ ์œ„ํ•œ ์ค‘์š”ํ•œ ์ฐธ๊ณ  ์ž๋ฃŒ๋ฅผ ์ œ๊ณตํ•œ๋‹ค.

← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •