VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning Tasks

์ €์ž: Shiduo Zhang, Zhe Xu, Peiju Liu, Xiaopeng Yu, Yuan Li, Qinghui Gao, Zhaoye Fei, Zhangyue Yin, Zuxuan Wu, Yu-Gang Jiang, Xipeng Qiu | ๋‚ ์งœ: 2024-12-24 | URL: https://arxiv.org/abs/2412.18194 📄 PDF


Essence

Figure 1

Figure 1. Overview of VLABench. VLABench is a large-scale language-conditioned manipulation benchmark to evaluate the co

VLABench๋Š” Vision-Language-Action ๋ชจ๋ธ์˜ ๋Šฅ๋ ฅ์„ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•ด ์„ค๊ณ„๋œ ๋Œ€๊ทœ๋ชจ ๋กœ๋ด‡ ์กฐ์ž‘ ๋ฒค์น˜๋งˆํฌ๋กœ, ์ž์—ฐ์–ด ์ง€์‹œ, ์ƒ์‹ ์ด์ „, ์žฅ๊ธฐ ์ถ”๋ก ์ด ํ•„์š”ํ•œ 100๊ฐœ์˜ ๊ณผ์ œ๋ฅผ ์ œ๊ณตํ•œ๋‹ค.

Motivation

Achievement

Figure 1

Figure 1. Overview of VLABench. VLABench is a large-scale language-conditioned manipulation benchmark to evaluate the co

How

Figure 3

Figure 3. Task examples in each dimension. The first row showcases examples of primitive tasks from Section 3.1, while t

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 3/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: VLABench๋Š” foundation model ๊ธฐ๋ฐ˜์˜ ๋กœ๋ด‡ ์กฐ์ž‘ ์—ฐ๊ตฌ๋ฅผ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•œ ์ฒซ ๋ฒˆ์งธ ํฌ๊ด„์  ๋ฒค์น˜๋งˆํฌ๋กœ์„œ, ์ž์—ฐ์–ธ์–ด ์ง€์‹œ, ์ƒ์‹ ์ด์ „, ์žฅ๊ธฐ ์ถ”๋ก  ๋“ฑ ๊ธฐ์กด ๋ฒค์น˜๋งˆํฌ๊ฐ€ ๊ฐ„๊ณผํ–ˆ๋˜ ์ค‘์š”ํ•œ ์ฐจ์›๋“ค์„ ์ฒด๊ณ„์ ์œผ๋กœ ๋„์ž…ํ–ˆ๋‹ค. ํ˜„ SOTA ๋ชจ๋ธ๋“ค์˜ ํ•œ๊ณ„๋ฅผ ๋ช…ํ™•ํžˆ ๋“œ๋Ÿฌ๋ƒ„์œผ๋กœ์จ ํ–ฅํ›„ VLA ๋ฐ embodied AI ์—ฐ๊ตฌ ๋ฐฉํ–ฅ ์„ค์ •์— ์ค‘์š”ํ•œ ์—ญํ• ์„ ํ•  ๊ฒƒ์œผ๋กœ ์˜ˆ์ƒ๋œ๋‹ค.

← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •