BitVLA: 1-bit Vision-Language-Action Models for Robotics Manipulation

์ €์ž: Hongyu Wang, Chuyan Xiong, Ruiping Wang, Xilin Chen | ๋‚ ์งœ: 2025-06-09 | URL: https://arxiv.org/abs/2506.07530 📄 PDF


Essence

Figure 1

Fig. 1: We introduce BitVLA, the first fully native 1-bit vision-language-action (VLA) model for robotic manipulation, i

๋กœ๋ด‡ ์กฐ์ž‘์„ ์œ„ํ•œ ์™„์ „ํ•œ 1-bit Vision-Language-Action ๋ชจ๋ธ์ธ BitVLA๋ฅผ ์ œ์•ˆํ•˜์—ฌ 11.0๋ฐฐ์˜ ๋ฉ”๋ชจ๋ฆฌ ๊ฐ์†Œ์™€ 4.4๋ฐฐ์˜ ์ง€์—ฐ ์‹œ๊ฐ„ ๋‹จ์ถ•์„ ๋‹ฌ์„ฑํ•˜๋ฉด์„œ๋„ full-precision ๊ธฐ์ค€ ๋ชจ๋ธ๊ณผ ๋น„์Šทํ•œ ์„ฑ๋Šฅ์„ ์œ ์ง€ํ•œ๋‹ค.

Motivation

Achievement

Figure 1

Fig. 1: We introduce BitVLA, the first fully native 1-bit vision-language-action (VLA) model for robotic manipulation, i

How

Figure 2

Fig. 2: Overview of the three-stage training pipeline in BitVLA. We first perform multimodal training with a 1-bit LLM

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 4/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: BitVLA๋Š” ๋กœ๋ด‡ ์กฐ์ž‘์šฉ VLA ๋ชจ๋ธ์˜ ๊ทน๋‹จ์  ์–‘์žํ™”์˜ ์ฒซ ์„ฑ๊ณต์  ์‚ฌ๋ก€๋กœ, Quantize-then-Distill์ด๋ผ๋Š” ํ˜์‹ ์  ํ›ˆ๋ จ ์ „๋žต์„ ํ†ตํ•ด 11๋ฐฐ ๋ฉ”๋ชจ๋ฆฌ ๊ฐ์†Œ์™€ 4.4๋ฐฐ ์†๋„ ํ–ฅ์ƒ์„ ๋‹ฌ์„ฑํ•˜๋ฉด์„œ๋„ ์„ฑ๋Šฅ์„ ์œ ์ง€ํ•˜์—ฌ ์—ฃ์ง€ ๋กœ๋ด‡ ๋ฐฐํฌ์˜ ์‹ค์งˆ์  ๊ฒฝ๋กœ๋ฅผ ์ œ์‹œํ•œ๋‹ค.

← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •