CF-VLA: Efficient Coarse-to-Fine Action Generation for Vision-Language-Action Policies

์ €์ž: Fan Du, Feng Yan, Jianxiong Wu, Xinrun Xu, Weiye Zhang, Weinong Wang, Yu Guo, Bin Qian, Zhihai He, Fei Wang, Heng Yang | ๋‚ ์งœ: 2026-04-27 | URL: https://arxiv.org/abs/2604.24622 📄 PDF


Essence

Figure 1

Figure 1: Teaser of CF-VLA. Standard flow matching requires multiple iterative steps to recover action structure from un

๋ณธ ๋…ผ๋ฌธ์€ flow matching ๊ธฐ๋ฐ˜ VLA ์ •์ฑ…์˜ ๋น„ํšจ์œจ์„ฑ์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด coarse-to-fine ๋‘ ๋‹จ๊ณ„ ์ƒ์„ฑ ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ์ฒซ ๋ฒˆ์งธ ๋‹จ๊ณ„์—์„œ๋Š” Gaussian ๋…ธ์ด์ฆˆ๋ฅผ action-prior-guided ์ดˆ๊ธฐํ™”๋กœ ๋ณ€ํ™˜ํ•˜๊ณ , ๋‘ ๋ฒˆ์งธ ๋‹จ๊ณ„์—์„œ๋Š” ๋‹จ์ผ ์Šคํ… ๊ตญ์†Œ ์ •๊ตํ™”๋ฅผ ์ˆ˜ํ–‰ํ•˜์—ฌ ์ถ”๋ก  ์ง€์—ฐ์‹œ๊ฐ„์„ 75.4% ๊ฐ์†Œ์‹œํ‚ค๋ฉด์„œ ์„ฑ๋Šฅ์„ ์œ ์ง€ํ•œ๋‹ค.

Motivation

Achievement

How

Figure 3

Figure 3: Geometric view of CF-VLA. Standard flow matching starts

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 4/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: CF-VLA๋Š” flow-based VLA ์ •์ฑ…์˜ ๊ตฌ์กฐ์  ๋น„ํšจ์œจ์„ฑ์„ ๋ช…ํ™•ํ•˜๊ฒŒ ํŒŒ์•…ํ•˜๊ณ , coarse-to-fine ๋ถ„ํ•ด๋ฅผ ํ†ตํ•ด ์‹ค์šฉ์ ์ด๊ณ  ํšจ๊ณผ์ ์ธ ํ•ด๊ฒฐ์ฑ…์„ ์ œ์‹œํ•œ๋‹ค. 75.4%์˜ ์ง€์—ฐ์‹œ๊ฐ„ ๊ฐ์†Œ์™€ ์‹ค๋กœ๋ด‡ 83.0% ์„ฑ๊ณต๋ฅ ์€ ๊ฐ•๋ ฅํ•œ ๊ฒฝํ—˜์  ๊ฒ€์ฆ์„ ๋ณด์—ฌ์ฃผ๋ฉฐ, ๋ฐฉ๋ฒ•์˜ ํ”Œ๋Ÿฌ๊ทธ-์•ค-ํ”Œ๋ ˆ์ด ํŠน์„ฑ์œผ๋กœ ์ธํ•ด ๊ด‘๋ฒ”์œ„ํ•œ ์ ์šฉ์„ฑ์„ ๊ฐ€์ง„๋‹ค. ๋‹ค๋งŒ ์ด๋ก ์  ๋ถ„์„๊ณผ ๋” ๊นŠ์€ ํ†ต์ฐฐ์ด ์ถ”๊ฐ€๋˜๋ฉด ๋”์šฑ ์™„์„ฑ๋„ ์žˆ๋Š” ์—ฐ๊ตฌ๊ฐ€ ๋  ๊ฒƒ์ด๋‹ค.

← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •