An Embodied Generalist Agent in 3D World

์ €์ž: Jiangyong Huang, Silong Yong, Xiaojian Ma, Xiongkun Linghu, Puhao Li, Yan Wang, Qing Li, Song-Chun Zhu, Baoxiong Jia, Siyuan Huang | ๋‚ ์งœ: 2023-11-18 | URL: https://arxiv.org/abs/2311.12871 📄 PDF


Essence

Figure 1

Figure 1: The proposed embodied generalist agent LEO. It takes egocentric 2D images, 3D point clouds, and texts as input

LEO๋Š” egocentric 2D ์ด๋ฏธ์ง€, 3D point cloud, ํ…์ŠคํŠธ๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์•„ 3D ํ™˜๊ฒฝ์—์„œ ์ธ์‹, grounding, ์ถ”๋ก , ๊ณ„ํš, ํ–‰๋™์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋Š” ์ตœ์ดˆ์˜ embodied generalist agent์ด๋‹ค. ํ†ต์ผ๋œ ๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜์™€ ํ•™์Šต ๋ชฉํ‘œ๋กœ 3D vision-language alignment์™€ 3D vision-language-action instruction tuning์˜ ๋‘ ๋‹จ๊ณ„๋กœ ํ•™์Šต๋œ๋‹ค.

Motivation

Achievement

Figure 1

Figure 1: The proposed embodied generalist agent LEO. It takes egocentric 2D images, 3D point clouds, and texts as input

How

Figure 2

Figure 2: Our proposed LLM-assisted 3D-language data generation pipeline and data examples.. (Top-left) Messages with 3D

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 3/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: LEO๋Š” 3D ํ™˜๊ฒฝ์—์„œ์˜ embodied generalist agent ๊ฐœ๋ฐœ์— ์ค‘์š”ํ•œ ์ด์ •ํ‘œ๋ฅผ ์ œ์‹œํ•˜๋ฉฐ, ํ†ต์ผ๋œ ์•„ํ‚คํ…์ฒ˜๋กœ ๋‹ค์–‘ํ•œ 3D ์ž‘์—…์„ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์Œ์„ ์ž…์ฆํ–ˆ๋‹ค. LLM-assisted ๋ฐ์ดํ„ฐ ์ƒ์„ฑ ํŒŒ์ดํ”„๋ผ์ธ์€ 3D ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘์˜ ์‹ค์งˆ์  ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๋Š” ์‹ค์šฉ์  ๊ธฐ์—ฌ์ด๋ฉฐ, ๊ด‘๋ฒ”์œ„ํ•œ ์‹คํ—˜๊ณผ ablation study๊ฐ€ ์—ฐ๊ตฌ์˜ ์‹ ๋ขฐ์„ฑ์„ ๋†’์ธ๋‹ค.

← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •