RLinf-VLA: A Unified and Efficient Framework for Reinforcement Learning of Vision-Language-Action Models

์ €์ž: Hongzhi Zang, Mingjie Wei, Si Xu, Yongji Wu, Zhen Guo, Yuanqing Wang, Hao Lin, Peihong Wang, Liangzhi Shi, Yuqing Xie, Zhexuan Xu, Zhihao Liu, Kang Chen, Wenhao Tang, Quanlu Zhang, Weinan Zhang, Chao Yu, Yu Wang | ๋‚ ์งœ: 2025-10-08 | URL: https://arxiv.org/abs/2510.06710 📄 PDF


Essence

Figure 1

Fig. 1:

RLinf-VLA๋Š” Vision-Language-Action ๋ชจ๋ธ์˜ ๊ฐ•ํ™”ํ•™์Šต ํ›ˆ๋ จ์„ ์œ„ํ•œ ํ†ตํ•ฉ๋˜๊ณ  ํšจ์œจ์ ์ธ ํ”„๋ ˆ์ž„์›Œํฌ๋กœ, ๋‹ค์–‘ํ•œ VLA ์•„ํ‚คํ…์ฒ˜, RL ์•Œ๊ณ ๋ฆฌ์ฆ˜, ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ๋ฅผ ์ง€์›ํ•˜๋ฉฐ GPU ํ• ๋‹น ์ตœ์ ํ™”๋ฅผ ํ†ตํ•ด 2.27๋ฐฐ ์†๋„ ํ–ฅ์ƒ์„ ๋‹ฌ์„ฑํ•œ๋‹ค.

Motivation

Achievement

Figure 1

Fig. 1:

How

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 3/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: RLinf-VLA๋Š” VLA ๊ฐ•ํ™”ํ•™์Šต ์—ฐ๊ตฌ์˜ ๋‹จํŽธํ™” ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๋Š” ํฌ๊ด„์  ํ†ตํ•ฉ ํ”„๋ ˆ์ž„์›Œํฌ์ด๋ฉฐ, GPU ํ• ๋‹น ์ตœ์ ํ™”๋ฅผ ํ†ตํ•œ ์‹ค์งˆ์  ํšจ์œจ์„ฑ ๊ฐœ์„ ๊ณผ ๊ฐ•๋ ฅํ•œ ์‹คํ—˜ ๊ฒฐ๊ณผ๋กœ ๊ตฌ์ฒดํ™” ์ธํ…”๋ฆฌ์ „์Šค ์—ฐ๊ตฌ์˜ ์ฃผ์š” ๊ธฐ์ดˆ ์‹œ์„ค๋กœ์„œ์˜ ๊ฐ€์น˜๋ฅผ ์ž…์ฆํ•œ๋‹ค.

← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •