FAST: Efficient Action Tokenization for Vision-Language-Action Models

์ €์ž: Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, Sergey Levine | ๋‚ ์งœ: 2025-01-16 | URL: https://arxiv.org/abs/2501.09747 📄 PDF


Essence

Figure 2

Fig. 2: Left: FAST tokenization enables training of autoregres-

Robot action tokenization์„ ์œ„ํ•ด discrete cosine transform (DCT) ๊ธฐ๋ฐ˜์˜ FAST ๋ฐฉ์‹์„ ์ œ์•ˆํ•˜์—ฌ, ๊ณ ์ฃผํŒŒ ๊ณ ์ •๋ฐ€ ๋กœ๋ด‡ ์ œ์–ด ์ž‘์—…์—์„œ autoregressive VLA๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•จ.

Motivation

Achievement

Figure 1

Fig. 1: We propose FAST, a simple yet effective approach

How

Figure 4

Fig. 4: Overview of the FAST action tokenization pipeline. Given a normalized chunk of actions, we apply discrete cosine

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 3/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: ๊ณ ์ฃผํŒŒ ๋กœ๋ด‡ ์ œ์–ด ์ž‘์—…์—์„œ autoregressive VLA์˜ ์‹ค์šฉ์„ฑ์„ ํฌ๊ฒŒ ๋†’์ด๋Š” ์šฐ์•„ํ•˜๊ณ  ํšจ๊ณผ์ ์ธ tokenization ๋ฐฉ๋ฒ•๋ก ์„ ์ œ์‹œํ•จ. DCT ๊ธฐ๋ฐ˜ ์ ‘๊ทผ์˜ ์ƒˆ๋กœ์›€, ๊ด‘๋ฒ”์œ„ํ•œ ์‹คํ—˜, 5๋ฐฐ ๋น ๋ฅธ ํ•™์Šต๊ณผ ๋™๋“ฑํ•œ ์„ฑ๋Šฅ ๋‹ฌ์„ฑ์€ ๋กœ๋ด‡ ํ•™์Šต ์ปค๋ฎค๋‹ˆํ‹ฐ์— ์ฆ‰๊ฐ์ ์ธ ์ž„ํŒฉํŠธ๋ฅผ ์ค„ ์ˆ˜ ์žˆ๋Š” ์šฐ์ˆ˜ํ•œ ๋…ผ๋ฌธ์ž„.

← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •