MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception

์ €์ž: Yiran Qin, Enshen Zhou, Qichang Liu, Zhenfei Yin, Lu Sheng, Ruimao Zhang, Yu Qiao, Jing Shao | ๋‚ ์งœ: 2023-12-12 | URL: https://arxiv.org/abs/2312.07472 📄 PDF


Essence

Figure 2

Figure 2. Overview of module interaction in MP5. After receiving the task instruction, MP5 first utilizes Parser to gene

MP5๋Š” Minecraft์—์„œ ์žฅ๊ธฐ-์ง€ํ‰์„  ๊ฐœ๋ฐฉํ˜• ํƒœ์Šคํฌ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด MLLMs ๊ธฐ๋ฐ˜์˜ ๋‹ค์ค‘๋ชจ๋“ˆ embodied ์‹œ์Šคํ…œ์œผ๋กœ, active perception scheme์„ ํ†ตํ•ด ํ”„๋กœ์„ธ์Šค ์˜์กด์„ฑ๊ณผ ์ปจํ…์ŠคํŠธ ์˜์กด์„ฑ์„ ๋ชจ๋‘ ์ฒ˜๋ฆฌํ•œ๋‹ค.

Motivation

Achievement

Figure 1

Figure 1. The process of finishing the task โ€œkill a pig with a stone sward during the daytime near the water with grass

How

Figure 2

Figure 2. Overview of module interaction in MP5. After receiving the task instruction, MP5 first utilizes Parser to gene

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 3/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: MP5๋Š” active perception scheme์„ ํ†ตํ•ด process-dependent์™€ context-dependent ํƒœ์Šคํฌ๋ฅผ ํ†ตํ•ฉ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๋Š” ์ฐฝ์˜์ ์ธ ์ ‘๊ทผ๋ฒ•์„ ์ œ์‹œํ•˜๋ฉฐ, MLLMs ๊ธฐ๋ฐ˜ embodied AI์˜ ์‹ค์งˆ์  ๋ฐœ์ „์„ ๋ณด์—ฌ์ค€๋‹ค. ๋‹ค๋งŒ ์ ˆ๋Œ€์  ์„ฑ๋Šฅ ์ˆ˜์น˜์™€ ์‹ค์ œ ํ™˜๊ฒฝ ์ „์ด ๊ฐ€๋Šฅ์„ฑ์— ๋Œ€ํ•œ ์ถ”๊ฐ€ ๊ฒ€์ฆ์ด ์š”๊ตฌ๋œ๋‹ค.

← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •