TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies

์ €์ž: Ruijie Zheng, Yongyuan Liang, Shuaiyi Huang, Jianfeng Gao, Hal Daumรฉ, Andrey Kolobov, Furong Huang, Jianwei Yang | ๋‚ ์งœ: 2024-12-13 | URL: https://arxiv.org/abs/2412.10345 📄 PDF


Essence

Figure 1

Figure 1: An illustration of our method. The first image shows the original robotโ€™s observation, while the second

Visual trace prompting ๊ธฐ๋ฒ•์„ ํ†ตํ•ด VLA ๋ชจ๋ธ์˜ spatial-temporal ์ธ์‹์„ ํ–ฅ์ƒ์‹œ์ผœ ๋กœ๋ด‡ ์กฐ์ž‘ ์ž‘์—…์˜ ์„ฑ๋Šฅ์„ ๊ฐœ์„ ํ•œ ์—ฐ๊ตฌ์ด๋‹ค. 150K ๋กœ๋ด‡ ์กฐ์ž‘ ๊ถค์  ๋ฐ์ดํ„ฐ์…‹์„ ์ˆ˜์ง‘ํ•˜๊ณ  TraceVLA ๋ชจ๋ธ์„ ๊ฐœ๋ฐœํ•˜์—ฌ ์‹œ๋ฎฌ๋ ˆ์ด์…˜๊ณผ ์‹ค์ œ ๋กœ๋ด‡ ํ™˜๊ฒฝ์—์„œ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ์ž…์ฆํ–ˆ๋‹ค.

Motivation

Achievement

Figure 3

Figure 3: (Left): 7B TraceVLA vs. 7B OpenVLA. (Right): 4B TraceVLA-Phi3 vs. 4B OpenVLA-Phi3.

How

Figure 2

Figure 2: An illustration of visual trace generation. Given a sequence of historical image observations, we first

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 3/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: Visual trace prompting์€ ์ง๊ด€์ ์ด๋ฉด์„œ๋„ ํšจ๊ณผ์ ์ธ ๊ธฐ๋ฒ•์œผ๋กœ, VLA ๋ชจ๋ธ์˜ ๊ณต๊ฐ„-์‹œ๊ฐ„ ์ธ์‹์„ ์‹ค์งˆ์ ์œผ๋กœ ๊ฐœ์„ ํ•˜๋ฉฐ ๊ด‘๋ฒ”์œ„ํ•œ ์‹คํ—˜(์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๋ฐ ์‹ค์ œ ๋กœ๋ด‡)์„ ํ†ตํ•ด ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ์ผ๊ด€๋˜๊ฒŒ ์ž…์ฆํ–ˆ๋‹ค. ICLR 2025 ๊ฒŒ์žฌ ๋…ผ๋ฌธ์œผ๋กœ์„œ ๋กœ๋ด‡ ์กฐ์ž‘ ๋ถ„์•ผ์˜ ์‹ค์งˆ์  ๊ธฐ์—ฌ๋„๊ฐ€ ๋†’๋‹ค.

← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •