X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model

์ €์ž: Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang, Yuchun Feng, Yinan Zheng, Jiayin Zou, Yilun Chen, Jia Zeng, Ya-Qin Zhang, Jiangmiao Pang, Jingjing Liu, Tai Wang, Xianyuan Zhan | ๋‚ ์งœ: 2025-10-11 | URL: https://arxiv.org/abs/2510.10274 📄 PDF


Essence

Figure 1

Figure 1 | X-VLA employs distinctive learnable embeddings, referred to as soft prompt, to effectively

X-VLA๋Š” ์†Œํ”„ํŠธ ํ”„๋กฌํ”„ํŠธ(Soft Prompt) ๊ธฐ๋ฒ•์„ ๋„์ž…ํ•˜์—ฌ ์ด์งˆ์ ์ธ ๋กœ๋ด‡ ํ”Œ๋žซํผ ๊ฐ„ cross-embodiment ํ•™์Šต์„ ํšจ๊ณผ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๋Š” scalable Vision-Language-Action ๋ชจ๋ธ์ด๋‹ค. 0.9B ํŒŒ๋ผ๋ฏธํ„ฐ ๊ทœ๋ชจ๋กœ 6๊ฐœ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๋ฒค์น˜๋งˆํฌ์™€ 3๊ฐœ ์‹ค๋กœ๋ด‡์—์„œ SOTA ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•œ๋‹ค.

Motivation

Achievement

Figure 1

Figure 1 | X-VLA employs distinctive learnable embeddings, referred to as soft prompt, to effectively

How

Figure 2

Figure 2 | Comparison among four methods in handling heterogeneity in cross-embodiment training.

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 4/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: X-VLA๋Š” soft prompt๋ฅผ ํ†ตํ•œ ์šฐ์•„ํ•˜๊ณ  ํšจ์œจ์ ์ธ cross-embodiment ์ฒ˜๋ฆฌ ๋ฐฉ์‹์œผ๋กœ VLA ๋ถ„์•ผ์˜ ์ค‘์š”ํ•œ ์ง„์ „์„ ์ด๋ฃฌ๋‹ค. ํŒŒ๋ผ๋ฏธํ„ฐ ํšจ์œจ์„ฑ๊ณผ ๊ด‘๋ฒ”์œ„ํ•œ ์‹ค์ฆ ํ‰๊ฐ€๋ฅผ ํ†ตํ•ด ์‹ค์ œ ๋กœ๋ด‡ ์‘์šฉ ๋ถ„์•ผ์—์„œ์˜ ๋†’์€ ์‹ค์šฉ์„ฑ์„ ์ž…์ฆํ•˜๋ฉฐ, flow-matching ๊ธฐ๋ฐ˜ ์•„ํ‚คํ…์ฒ˜์˜ ์•ˆ์ •์„ฑ๊ณผ ํ™•์žฅ์„ฑ์€ ํ–ฅํ›„ generalist ๋กœ๋ด‡ ๋ชจ๋ธ ๊ฐœ๋ฐœ์˜ ์ฃผ์š” ๋ฐฉํ–ฅ์„ ์ œ์‹œํ•œ๋‹ค.

← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •