GraspVLA: a Grasping Foundation Model Pre-trained on Billion-scale Synthetic Action Data

์ €์ž: Shengliang Deng, Mi Yan, Songlin Wei, Haixin Ma, Yuxin Yang, Jiayi Chen, Zhiqi Zhang, Taoyu Yang, Xuheng Zhang, Wenhao Zhang, Heming Cui, Zhizheng Zhang, He Wang | ๋‚ ์งœ: 2025-05-06 | URL: https://arxiv.org/abs/2505.03233 📄 PDF


Essence

Figure 1

Figure 1: GraspVLA is a grasping foundation model pre-trained exclusively on billion-scale syn-

SynGrasp-1B๋ผ๋Š” 10์–ต ํ”„๋ ˆ์ž„ ๊ทœ๋ชจ์˜ ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ์…‹์„ ๊ธฐ๋ฐ˜์œผ๋กœ GraspVLA๋ผ๋Š” Vision-Language-Action ๊ธฐ๋ฐ˜ ์ง‘๊ธฐ ๋ชจ๋ธ์„ ์ œ์‹œํ•˜๋ฉฐ, ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ๋งŒ์œผ๋กœ ์‚ฌ์ „ํ•™์Šตํ•˜์—ฌ ์‹ค์„ธ๊ณ„์—์„œ ๊ฐ•๋ ฅํ•œ ์ œ๋กœ์ƒท ์ผ๋ฐ˜ํ™”์™€ ์†Œ์ˆ˜์ƒท ์ ์‘์„ฑ์„ ๋‹ฌ์„ฑํ•œ๋‹ค.

Motivation

Achievement

Figure 1

Figure 1: GraspVLA is a grasping foundation model pre-trained exclusively on billion-scale syn-

How

Figure 3

Figure 3: GraspVLA consists of an autoregressive vision-language backbone and a flow-matching

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 3/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: ์ด ๋…ผ๋ฌธ์€ ๋กœ๋ด‡ ์กฐ์ž‘ ํ•™์Šต์„ ์œ„ํ•œ ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ์˜ ๋Œ€๊ทœ๋ชจ ํ™œ์šฉ ๊ฐ€๋Šฅ์„ฑ์„ ์ตœ์ดˆ๋กœ ์ฒด๊ณ„์ ์œผ๋กœ ์ž…์ฆํ•˜๋ฉฐ, 10์–ต ํ”„๋ ˆ์ž„ ๊ทœ๋ชจ์˜ ๊ณ ํ’ˆ์งˆ ๋ฐ์ดํ„ฐ์…‹๊ณผ ํ˜์‹ ์ ์ธ Progressive Action Generation ๋ฉ”์ปค๋‹ˆ์ฆ˜์„ ํ†ตํ•ด ์‹ค์„ธ๊ณ„ ๋ฐฐํฌ ๊ฐ€๋Šฅํ•œ ๊ฐ•๋ ฅํ•œ ๊ธฐ๋ฐ˜ ๋ชจ๋ธ์„ ์ œ์‹œํ•œ๋‹ค.

← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •