Diffusion-VLA: Generalizable and Interpretable Robot Foundation Model via Self-Generated Reasoning

์ €์ž: Junjie Wen, Minjie Zhu, Yichen Zhu, Zhibin Tang, Jinming Li, Zhongyi Zhou, Chengmeng Li, Xiaoyu Liu, Yaxin Peng, Chaomin Shen, Feifei Feng | ๋‚ ์งœ: 2024-12-04 | URL: https://arxiv.org/abs/2412.03293 📄 PDF


Essence

Figure 1

Figure 1: Our proposed DiffusionVLA model unifies autoregressive and diffusion modeling to enable self-reasoning and rob

DiffusionVLA๋Š” autoregressive ๋ชจ๋ธ์˜ ์ถ”๋ก  ๋Šฅ๋ ฅ๊ณผ diffusion ๋ชจ๋ธ์˜ ๊ฒฌ๊ณ ํ•œ ํ–‰๋™ ์ƒ์„ฑ์„ ๊ฒฐํ•ฉํ•œ ๋กœ๋ด‡ foundation ๋ชจ๋ธ๋กœ, reasoning injection ๋ชจ๋“ˆ์„ ํ†ตํ•ด ์ž๊ฐ€ ์ƒ์„ฑ๋œ ์ถ”๋ก ์„ ์ •์ฑ… ํ•™์Šต์— ์ง์ ‘ ํ†ตํ•ฉํ•œ๋‹ค.

Motivation

Achievement

Figure 3

Figure 3: Experimental Results for Factory Sorting. We compared our DiVLA with Diffusion Policy, Octo, TinyVLA, and Open

How

Figure 1

Figure 1: Our proposed DiffusionVLA model unifies autoregressive and diffusion modeling to enable self-reasoning and rob

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 3/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: DiffusionVLA๋Š” autoregressive์™€ diffusion ๋ชจ๋ธ์„ ์ฐฝ์˜์ ์œผ๋กœ ๊ฒฐํ•ฉํ•˜๊ณ  reasoning injection ๋ชจ๋“ˆ๋กœ ์ถ”๋ก ๊ณผ ํ–‰๋™ ์ƒ์„ฑ์„ ํšจ๊ณผ์ ์œผ๋กœ ํ†ตํ•ฉํ•จ์œผ๋กœ์จ, ํ•ด์„ ๊ฐ€๋Šฅ์„ฑ๊ณผ ๊ฐ•๊ฑดํ•œ ์ผ๋ฐ˜ํ™”๋ฅผ ๋™์‹œ์— ๋‹ฌ์„ฑํ•œ ํ˜์‹ ์ ์ธ ๋กœ๋ด‡ foundation ๋ชจ๋ธ์ด๋‹ค. ์‹ค์„ธ๊ณ„ ๋‹ค์ค‘ ๋กœ๋ด‡ ์‹คํ—˜๊ณผ ํ™•์žฅ์„ฑ ๊ฒ€์ฆ์„ ํ†ตํ•ด ์‹ค์šฉ์  ๊ฐ€์น˜๋ฅผ ์ž…์ฆํ–ˆ์œผ๋‚˜, ๋ชจ๋“ˆ ๊ฐ„ ์ƒํ˜ธ์ž‘์šฉ์— ๋Œ€ํ•œ ์‹ฌ์ธต ๋ถ„์„์ด ๋ณด๊ฐ•๋˜๋ฉด ๋”์šฑ ์™„์„ฑ๋„ ์žˆ์„ ๊ฒƒ์œผ๋กœ ํŒ๋‹จ๋œ๋‹ค.

← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •