Cross-Platform Scaling of Vision-Language-Action Models from Edge to Cloud GPUs

์ €์ž: Amir Taherin, Juyi Lin, Arash Akbari, Arman Akbari, Pu Zhao, Weiwei Chen, David Kaeli, Yanzhi Wang | ๋‚ ์งœ: 2025-09-15 | URL: https://arxiv.org/abs/2509.11480 📄 PDF


Essence

Figure 1

Fig. 1. Peak VRAM usage for each evaluated VLA model

Vision-Language-Action (VLA) ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ์—ฃ์ง€ ๋””๋ฐ”์ด์Šค๋ถ€ํ„ฐ ๋ฐ์ดํ„ฐ์„ผํ„ฐ GPU๊นŒ์ง€ ๋‹ค์–‘ํ•œ ํ•˜๋“œ์›จ์–ด ํ”Œ๋žซํผ์—์„œ ์ฒด๊ณ„์ ์œผ๋กœ ํ‰๊ฐ€ํ•˜์—ฌ, ์•„ํ‚คํ…์ฒ˜์™€ ํ•˜๋“œ์›จ์–ด ์ œ์•ฝ ์กฐ๊ฑด์— ๋”ฐ๋ฅธ ์ •ํ™•๋„, ๋ ˆ์ดํ„ด์‹œ, ์ฒ˜๋ฆฌ๋Ÿ‰, ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์˜ ํ™•์žฅ ์ถ”์ด๋ฅผ ๋ฐํ˜€๋‚ธ๋‹ค.

Motivation

Achievement

Figure 1

Fig. 1. Peak VRAM usage for each evaluated VLA model

How

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 3/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: ๋ณธ ๋…ผ๋ฌธ์€ VLA ๋ชจ๋ธ์˜ ํฌ๋กœ์Šค ํ”Œ๋žซํผ ์„ฑ๋Šฅ ํ™•์žฅ์„ ์ฒด๊ณ„์ ์œผ๋กœ ๋ถ„์„ํ•œ ์ค‘์š”ํ•œ ๋ฒค์น˜๋งˆํฌ ์—ฐ๊ตฌ๋กœ, ๋กœ๋ด‡ ๋ฐฐํฌ ์‹œ๋‚˜๋ฆฌ์˜ค์— ๋งž๋Š” ํ•˜๋“œ์›จ์–ด ์„ ํƒ๊ณผ ๋ชจ๋ธ ์ตœ์ ํ™”๋ฅผ ์œ„ํ•œ ์‹ค์šฉ์ ์ธ ํ†ต์ฐฐ๋ ฅ์„ ์ œ๊ณตํ•œ๋‹ค. ์—ฃ์ง€ ๋””๋ฐ”์ด์Šค์˜ ๊ฒฝ์Ÿ๋ ฅ์„ ์ž…์ฆํ•จ์œผ๋กœ์จ ๋กœ๋ด‡ ์‹œ์Šคํ…œ ์„ค๊ณ„์— ๋Œ€ํ•œ ์ƒˆ๋กœ์šด ๊ด€์ ์„ ์ œ์‹œํ•œ๋‹ค.

← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •