EmbSpatial-Bench: Benchmarking Spatial Understanding for Embodied Tasks with Large Vision-Language Models

์ €์ž: Mengfei Du, Binhao Wu, Zejun Li, Xuanjing Huang, Zhongyu Wei | ๋‚ ์งœ: 2024-06-09 | URL: https://arxiv.org/abs/2406.05756 📄 PDF


Essence

Figure 1

Figure 1: Comparison between EmbSpatial-Bench and

Large Vision-Language Model(LVLM)๋“ค์˜ ๊ตฌํ˜„ํ™”๋œ ํ™˜๊ฒฝ์—์„œ์˜ ๊ณต๊ฐ„ ์ดํ•ด ๋Šฅ๋ ฅ์„ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•ด egocentric ๊ด€์ ์˜ 6๊ฐ€์ง€ ๊ณต๊ฐ„ ๊ด€๊ณ„๋ฅผ ํฌํ•จํ•˜๋Š” EmbSpatial-Bench ๋ฒค์น˜๋งˆํฌ๋ฅผ ๊ตฌ์ถ•ํ•˜๊ณ , ์ด๋ฅผ ๊ฐœ์„ ํ•˜๊ธฐ ์œ„ํ•œ instruction-tuning ๋ฐ์ดํ„ฐ์…‹ EmbSpatial-SFT๋ฅผ ์ œ์‹œํ•œ๋‹ค.

Motivation

Achievement

Figure 2

Figure 2: Overview of the construction pipeline for EmbSpatial-Bench based on existing annotated 3D environments.

How

Figure 2

Figure 2: Overview of the construction pipeline for EmbSpatial-Bench based on existing annotated 3D environments.

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 3/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: ๋ณธ ๋…ผ๋ฌธ์€ embodied AI์˜ ํ•ต์‹ฌ ๋Šฅ๋ ฅ์ธ spatial understanding์„ ์ฒด๊ณ„์ ์œผ๋กœ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•ด egocentric ๊ด€์ ์˜ ๋ฒค์น˜๋งˆํฌ๋ฅผ ์ฒ˜์Œ์œผ๋กœ ์ œ์‹œํ•˜๋ฉฐ, 3D ํ™˜๊ฒฝ ๊ธฐ๋ฐ˜์˜ ์ž๋™ ๊ตฌ์ถ• ํŒŒ์ดํ”„๋ผ์ธ๊ณผ ๊ฐœ์„  ๋ฐ์ดํ„ฐ์…‹์„ ํ†ตํ•ด ํ˜„์žฌ LVLM์˜ ๋ช…ํ™•ํ•œ ๋ถ€์กฑํ•จ์„ ๋“œ๋Ÿฌ๋‚ด๊ณ  ๊ฐœ์„  ๋ฐฉํ–ฅ์„ ์ œ์‹œํ•œ๋‹ค๋Š” ์ ์—์„œ embodied AI ์ปค๋ฎค๋‹ˆํ‹ฐ์— ์ค‘์š”ํ•œ ๊ธฐ์—ฌ๋ฅผ ํ•œ๋‹ค.

← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •