DexGraspVLA: A Vision-Language-Action Framework Towards General Dexterous Grasping

์ €์ž: Yifan Zhong, Xuchuan Huang, Ruochong Li, Ceyao Zhang, Zhang Chen, Tianrui Guan, Fanlian Zeng, Ka Num Lui, Yuyao Ye, Yitao Liang, Yaodong Yang, Yuanpei Chen | ๋‚ ์งœ: 2025-02-28 | URL: https://arxiv.org/abs/2502.20900 📄 PDF


Essence

Figure 2

Figure 2: Overview of DexGraspVLA. A pre-trained VLM-based high-level planner (purple) decomposes prompts into object-

DexGraspVLA๋Š” Vision-Language model์„ ๊ณ ์ˆ˜์ค€ ๊ณ„ํš์ž๋กœ, diffusion ๊ธฐ๋ฐ˜ ์ €์ˆ˜์ค€ ํ–‰๋™ ์ปจํŠธ๋กค๋Ÿฌ๋ฅผ ํ•™์Šตํ•˜๋Š” ๊ณ„์ธต์  VLA ํ”„๋ ˆ์ž„์›Œํฌ๋กœ, foundation model์„ ํ†ตํ•ด ์–ธ์–ดยท์‹œ๊ฐ ์ž…๋ ฅ์„ ๋„๋ฉ”์ธ ๋ถˆ๋ณ€ ํ‘œํ˜„์œผ๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ ๋ชจ๋ฐฉ ํ•™์Šต์˜ ์ผ๋ฐ˜ํ™”๋ฅผ ๋‹ฌ์„ฑํ•œ๋‹ค.

Motivation

Achievement

Figure 1

Figure 1: We propose DexGraspVLA, a hierarchical VLA

How

Figure 2

Figure 2: Overview of DexGraspVLA. A pre-trained VLM-based high-level planner (purple) decomposes prompts into object-

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 3/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: DexGraspVLA๋Š” foundation model๊ณผ imitation learning์˜ ์ƒ๋ณด์  ๊ฐ•์ ์„ ๊ณ„์ธต์ ์œผ๋กœ ํ†ตํ•ฉํ•˜์—ฌ cluttered real-world scenario์—์„œ unprecedented 90+% ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•œ ์˜๋ฏธ ์žˆ๋Š” ๊ธฐ์—ฌ์ด๋ฉฐ, ์žฅ๊ธฐ task, adversarial robustness, failure recovery๋ฅผ ๋™์‹œ ๋‹ฌ์„ฑํ•จ์œผ๋กœ์จ ์‹ค์šฉ์  dexterous grasping ๋กœ๋ด‡์˜ ์‹คํ˜„ ๊ฐ€๋Šฅ์„ฑ์„ ํฌ๊ฒŒ ๋†’์˜€๋‹ค.

← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •