RoboMonkey: Scaling Test-Time Sampling and Verification for Vision-Language-Action Models

์ €์ž: Jacky Kwok, Christopher Agia, Rohan Sinha, Matt Foutter, Shulu Li, Ion Stoica, Azalia Mirhoseini, Marco Pavone | ๋‚ ์งœ: 2025-06-21 | URL: https://arxiv.org/abs/2506.17811 📄 PDF


Essence

Figure 1

Figure 1: Inference-Time Scaling Law: We observe that action error consistently decreases as we

Vision-Language-Action (VLA) ๋ชจ๋ธ์˜ ํ…Œ์ŠคํŠธ ์‹œ๊ฐ„ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๊ธฐ ์œ„ํ•ด ์ƒ˜ํ”Œ๋ง๊ณผ ๊ฒ€์ฆ์„ ํ†ตํ•œ ์Šค์ผ€์ผ๋ง ๋ฐฉ๋ฒ•์„ ์ œ์‹œํ•˜๋ฉฐ, action error๊ฐ€ ์ƒ์„ฑ ์ƒ˜ํ”Œ ์ˆ˜์— ๋”ฐ๋ผ ์ง€์ˆ˜ ๊ฑฐ๋“ญ์ œ๊ณฑ ๋ฒ•์น™์„ ๋”ฐ๋ฅธ๋‹ค๋Š” inference-time scaling law๋ฅผ ๋ฐœ๊ฒฌํ–ˆ๋‹ค.

Motivation

Achievement

Figure 3

Figure 3: Scaling test-time compute significantly improves the precision and robustness of generalist robot

How

Figure 2

Figure 2: Stage 1: Training the Action Verifier. Given an imitation learning dataset, we sample N

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 4/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: VLA ๋ชจ๋ธ์˜ test-time scaling ๊ฐ€๋Šฅ์„ฑ์„ ์ฒด๊ณ„์ ์œผ๋กœ ๊ทœ๋ช…ํ•˜๊ณ  ์‹ค์šฉ์ ์ธ RoboMonkey ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์•ˆํ•œ ์šฐ์ˆ˜ํ•œ ์—ฐ๊ตฌ๋กœ, inference-time scaling law์˜ ๋ฐœ๊ฒฌ๊ณผ ์‹ค์ œ ๋กœ๋ด‡์—์„œ์˜ ์œ ์˜๋ฏธํ•œ ์„ฑ๋Šฅ ํ–ฅ์ƒ์ด ๋กœ๋ด‡ ์ œ์–ด ๋ถ„์•ผ์— ํฐ ๊ธฐ์—ฌ๋ฅผ ํ•œ๋‹ค.

← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •