OpenVLA: An Open-Source Vision-Language-Action Model

์ €์ž: Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, Chelsea Finn | ๋‚ ์งœ: 2024-06-13 | URL: https://arxiv.org/abs/2406.09246 📄 PDF


Essence

Figure 1

Figure 1: We present OpenVLA, a 7B-parameter open-source vision-language-action model (VLA), trained

OpenVLA๋Š” 970k๊ฐœ์˜ ๋กœ๋ด‡ ์‹œ์—ฐ ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šต๋œ 7B ํŒŒ๋ผ๋ฏธํ„ฐ์˜ ์˜คํ”ˆ์†Œ์Šค Vision-Language-Action ๋ชจ๋ธ๋กœ, ํ์‡„ํ˜• ๋ชจ๋ธ๋“ค๋ณด๋‹ค ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์ด๋ฉด์„œ ํšจ์œจ์ ์ธ ๋ฏธ์„ธ์กฐ์ •๊ณผ ๋ฐฐํฌ๋ฅผ ์ง€์›ํ•œ๋‹ค.

Motivation

Achievement

Figure 3

Figure 3: BridgeData V2 WidowX robot evaluation tasks and results. We evaluate OpenVLA and prior

How

Figure 2

Figure 2: OpenVLA model architecture. Given an image observation and a language instruction, the model

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 3/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: OpenVLA๋Š” ํ์‡„ํ˜• ๋Œ€๊ทœ๋ชจ VLA ๋ชจ๋ธ์„ ๋Šฅ๊ฐ€ํ•˜๋Š” ์„ฑ๋Šฅ์„ ๋” ์ž‘์€ ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ ๋‹ฌ์„ฑํ•˜๋ฉด์„œ ์™„์ „ํ•œ ์˜คํ”ˆ์†Œ์Šค ๊ณต๊ฐœ์™€ ํšจ์œจ์  ๋ฏธ์„ธ์กฐ์ • ๋ฐฉ๋ฒ•์„ ์ œ์‹œํ•˜์—ฌ ๋กœ๋ด‡ ๋ถ„์•ผ์˜ ํŒŒ์šด๋ฐ์ด์…˜ ๋ชจ๋ธ ์ƒํƒœ๊ณ„ ๊ตฌ์ถ•์— ์ค‘์š”ํ•œ ๊ธฐ์—ฌ๋ฅผ ํ•œ๋‹ค.

← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •