DualVLA: Building a Generalizable Embodied Agent via Partial Decoupling of Reasoning and Action

์ €์ž: Zhen Fang, Zhuoyang Liu, Jiaming Liu, Hao Chen, Yu Zeng, Shiting Huang, Zehui Chen, Lin Chen, Shanghang Zhang, Feng Zhao | ๋‚ ์งœ: 2025-11-27 | URL: https://arxiv.org/abs/2511.22134 📄 PDF


Essence

Figure 1

Figure 1. DUALVLA first constructs a sparse, information-dense embodied reasoning dataset by combining video event predi

DualVLA๋Š” Vision-Language-Action ๋ชจ๋ธ์—์„œ ์ถ”๋ก  ๋Šฅ๋ ฅ์„ ์ถ”๊ฐ€ํ•  ๋•Œ ๋ฐœ์ƒํ•˜๋Š” ํ–‰๋™ ์„ฑ๋Šฅ ์ €ํ•˜(action degeneration)๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด, ์ด์ค‘์ธต ๋ฐ์ดํ„ฐ ํ”„๋ฃจ๋‹๊ณผ ์ด์ค‘ ๊ต์‚ฌ ์ ์‘ํ˜• ์ฆ๋ฅ˜ ์ „๋žต์„ ํ†ตํ•ด ์ถ”๋ก ๊ณผ ํ–‰๋™์„ ๋ถ€๋ถ„์ ์œผ๋กœ ๋ถ„๋ฆฌํ•˜๋Š” ์ ‘๊ทผ๋ฒ•์„ ์ œ์‹œํ•œ๋‹ค.

Motivation

Achievement

Figure 2

Figure 2. VLMs possess strong reasoning ability but lack action

How

Figure 1

Figure 1. DUALVLA first constructs a sparse, information-dense embodied reasoning dataset by combining video event predi

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 3/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: ๋ณธ ๋…ผ๋ฌธ์€ Vision-Language-Action ๋ชจ๋ธ์˜ ์‹ค์งˆ์ ์ธ ๋ฌธ์ œ์ธ action degeneration์„ ๋ช…ํ™•ํžˆ ์ •์˜ํ•˜๊ณ , ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•œ ์ด์ค‘์ธต ํ”„๋ฃจ๋‹๊ณผ ์ด์ค‘ ๊ต์‚ฌ ์ฆ๋ฅ˜ ์ „๋žต์„ ์ œ์‹œํ•จ์œผ๋กœ์จ ์ถ”๋ก  ๋Šฅ๋ ฅ๊ณผ ์กฐ์ž‘ ๋Šฅ๋ ฅ์˜ ๊ท ํ˜•์„ ํšจ๊ณผ์ ์œผ๋กœ ๋‹ฌ์„ฑํ•˜์˜€๋‹ค. ํŠนํžˆ VLA ํ‰๊ฐ€๋ฅผ ์œ„ํ•œ ๋‹ค์ฐจ์›์  ํ”„๋ ˆ์ž„์›Œํฌ ์ œ์‹œ๋Š” ํ–ฅํ›„ embodied AI ์—ฐ๊ตฌ์˜ ํ‰๊ฐ€ ํ‘œ์ค€์œผ๋กœ์„œ ์ค‘์š”ํ•œ ๊ธฐ์—ฌ๋ฅผ ํ•œ๋‹ค.

← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •