ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning

์ €์ž: Chi-Pin Huang, Yueh-Hua Wu, Min-Hung Chen, Yu-Chiang Frank Wang, Fu-En Yang | ๋‚ ์งœ: 2025-07-22 | URL: https://arxiv.org/abs/2507.16815 📄 PDF


Essence

Figure 1

Figure 1: We introduce ThinkAct, a reasoning VLA framework capable of thinking before acting. Through

ThinkAct๋Š” Vision-Language-Action ์ถ”๋ก  ์ž‘์—…์„ ์œ„ํ•ด ๊ฐ•ํ™”ํ•™์Šต ๊ธฐ๋ฐ˜ ์‹œ๊ฐ ์ž ์žฌ ๊ณ„ํš์„ ํ†ตํ•ด ๊ณ ์ˆ˜์ค€ ์ถ”๋ก ๊ณผ ์ €์ˆ˜์ค€ ํ–‰๋™ ์‹คํ–‰์„ ์—ฐ๊ฒฐํ•˜๋Š” ์ด์ค‘ ์‹œ์Šคํ…œ ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ๋‹ค์ค‘๋ชจ๋‹ฌ LLM์ด ์ƒ์„ฑํ•œ ์ถ”๋ก  ๊ณ„ํš์„ ์‹œ๊ฐ ๊ณ„ํš ์ž ์žฌ๋กœ ์••์ถ•ํ•˜์—ฌ ๋‹ค์šด์ŠคํŠธ๋ฆผ ํ–‰๋™ ๋ชจ๋ธ์„ ์กฐ๊ฑดํ™”ํ•˜์—ฌ ์žฅ๊ธฐ ๊ณ„ํš, ์†Œ์ˆ˜์ƒท ์ ์‘, ์ž์ฒด ์ˆ˜์ • ๋Šฅ๋ ฅ์„ ๋‹ฌ์„ฑํ•œ๋‹ค.

Motivation

Achievement

Figure 1

Figure 1: We introduce ThinkAct, a reasoning VLA framework capable of thinking before acting. Through

How

Figure 2

Figure 2: Overview of our ThinkAct. (a) Given observation ๐‘œ๐‘กand instruction ๐‘™, ThinkAct advances action-

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 3/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: ThinkAct๋Š” ํ–‰๋™ ์ •๋ ฌ ์‹œ๊ฐ ๋ณด์ƒ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ ํ˜์‹ ์ ์ธ GRPO ๊ฐ•ํ™”ํ•™์Šต๊ณผ ์‹œ๊ฐ ์ž ์žฌ ๊ณ„ํš ์••์ถ•์„ ํ†ตํ•ด Vision-Language-Action ๋ชจ๋ธ์— ๊ตฌ์กฐํ™”๋œ ์ถ”๋ก  ๋Šฅ๋ ฅ์„ ํšจ๊ณผ์ ์œผ๋กœ ๋ถ€์—ฌํ•œ๋‹ค. ์žฅ๊ธฐ ๊ณ„ํš, ์†Œ์ˆ˜์ƒท ์ ์‘, ์ž์ฒด ์ˆ˜์ • ๋Šฅ๋ ฅ์„ ๋™์‹œ์— ๋‹ฌ์„ฑํ•œ ์ ์—์„œ ๊ตฌ์ฒดํ™”๋œ AI ๋ฐ ๋กœ๋ด‡ ์กฐ์ž‘ ๋ถ„์•ผ์— ์˜๋ฏธ ์žˆ๋Š” ๊ธฐ์—ฌ๋ฅผ ํ•œ๋‹ค.

← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •