VITA: Vision-to-Action Flow Matching Policy

์ €์ž: Dechen Gao, Boqi Zhao, Andrew Lee, Ian Chuang, Hanchu Zhou, Hang Wang, Zhe Zhao, Junshan Zhang, Iman Soltani | ๋‚ ์งœ: 2025-07-17 | URL: https://arxiv.org/abs/2507.13231 📄 PDF


Essence

VITA๋Š” ์‹œ๊ฐ ํ‘œํ˜„์—์„œ ์ž ์žฌ ํ–‰๋™์œผ๋กœ ์ง์ ‘ ํ๋ฅด๋Š” noise-free flow matching ์ •์ฑ…์œผ๋กœ, ๊ธฐ์กด์˜ ๋ฐ˜๋ณต์ ์ธ ์‹œ๊ฐ ์กฐ๊ฑดํ™” ๋ชจ๋“ˆ์„ ์ œ๊ฑฐํ•˜์—ฌ ์ถ”๋ก  ์†๋„์™€ ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์„ฑ์„ ํš๊ธฐ์ ์œผ๋กœ ํ–ฅ์ƒ์‹œํ‚จ๋‹ค.

Motivation

Achievement

Figure 4

Figure 4: Autonomous rollouts of VITA on five challenging real-world tasks, including two bimanual

How

Figure 2

Figure 2: An overview of the VITA architecture: The vision encoder maps observations into a source

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 3/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: VITA๋Š” flow matching์˜ ์ด๋ก ์  ์ž์œ ๋„๋ฅผ ์˜๋ฆฌํ•˜๊ฒŒ ํ™œ์šฉํ•˜์—ฌ visuomotor ์ •์ฑ…์˜ ํšจ์œจ์„ฑ๊ณผ ์„ฑ๋Šฅ์„ ๋™์‹œ์— ๋‹ฌ์„ฑํ•œ ์˜๋ฏธ ์žˆ๋Š” ๊ธฐ์—ฌ์ด๋ฉฐ, noise-free framework์™€ flow latent decoding์€ ๋…์ฐฝ์ ์ธ ๊ธฐ์ˆ ์  ํ˜์‹ ์œผ๋กœ์„œ ๋กœ๋ด‡ ์ œ์–ด ๋ถ„์•ผ์˜ ์‹ค์šฉ์„ฑ์„ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œํ‚จ๋‹ค.

← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •