VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model

์ €์ž: Yihao Wang, Pengxiang Ding, Lingxiao Li, Can Cui, Zirui Ge, Xinyang Tong, Wenxuan Song, Han Zhao, Wei Zhao, Pengxu Hou, Siteng Huang, Yifan Tang, Wenhui Wang, Ru Zhang, Jianyi Liu, Donglin Wang | ๋‚ ์งœ: 2025-09-11 | URL: https://arxiv.org/abs/2509.09372 📄 PDF


Essence

VLA-Adapter๋Š” ๊ฒฝ๋Ÿ‰ ๋ฐฑ๋ณธ(0.5B ํŒŒ๋ผ๋ฏธํ„ฐ)์„ ์‚ฌ์šฉํ•˜์—ฌ ๋กœ๋ด‡ ๋ฐ์ดํ„ฐ ์‚ฌ์ „ํ•™์Šต ์—†์ด ์ตœ์ฒจ๋‹จ Vision-Language-Action ๋ชจ๋ธ์„ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋Š” ์ƒˆ๋กœ์šด ํŒจ๋Ÿฌ๋‹ค์ž„์„ ์ œ์‹œํ•œ๋‹ค. Bridge Attention์„ ํ†ตํ•ด ๋น„์ „-์–ธ์–ด ํ‘œํ˜„์„ ํ–‰๋™ ๊ณต๊ฐ„์— ํšจ๊ณผ์ ์œผ๋กœ ์—ฐ๊ฒฐํ•œ๋‹ค.

Motivation

Achievement

How

Figure 3

Figure 3: The proposed VLA framework. The key components are the effective condition explo-

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 4/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: VLA-Adapter๋Š” ๊ฒฝ๋Ÿ‰ ๋ฐฑ๋ณธ์œผ๋กœ๋„ ์ตœ์ฒจ๋‹จ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ฃผ๋ฉฐ, VL-A ๋ธŒ๋ฆฟ์ง•์˜ ๋ณธ์งˆ์— ๋Œ€ํ•œ ์ฒด๊ณ„์  ๋ถ„์„์„ ํ†ตํ•ด VLA ์„ค๊ณ„์˜ ์‹ค์งˆ์  ์ง€์นจ์„ ์ œ๊ณตํ•œ๋‹ค. ๋น ๋ฅธ ํ•™์Šต ์‹œ๊ฐ„๊ณผ ๋‚ฎ์€ ๊ณ„์‚ฐ ๋น„์šฉ์œผ๋กœ ๋กœ๋ด‡ ๊ณตํ•™์˜ ์ ‘๊ทผ์„ฑ์„ ํฌ๊ฒŒ ๋†’์ด๋Š” ์ค‘์š”ํ•œ ๊ธฐ์—ฌ์ด๋‹ค.

← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •