Magma: A Foundation Model for Multimodal AI Agents

์ €์ž: Jianwei Yang, Reuben Tan, Qianhui Wu, Ruijie Zheng, Baolin Peng, Yongyuan Liang, Yu Gu, Mu Cai, Seonghyeon Ye, Joel Jang, Yuquan Deng, Lars Liden, Jianfeng Gao | ๋‚ ์งœ: 2025-02-18 | URL: https://arxiv.org/abs/2502.13130 📄 PDF


Essence

Figure 1

Figure 1. We introduce Magma, the first foundation model that is capable of interpreting and grounding multimodal inputs

Magma๋Š” ๋””์ง€ํ„ธ ๋ฐ ๋ฌผ๋ฆฌ์  ํ™˜๊ฒฝ์—์„œ UI ๋„ค๋น„๊ฒŒ์ด์…˜๋ถ€ํ„ฐ ๋กœ๋ด‡ ์กฐ์ž‘๊นŒ์ง€ ๋‹ค์–‘ํ•œ ์—์ด์ „ํŠธ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋Š” ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๊ธฐ์ดˆ ๋ชจ๋ธ์ด๋‹ค. Set-of-Mark(SoM)๊ณผ Trace-of-Mark(ToM) ๊ธฐ๋ฒ•์„ ํ†ตํ•ด ์‹œ๊ณต๊ฐ„ ์ง€๋Šฅ์„ ํš๋“ํ•˜์—ฌ ์–ธ์–ด ์ดํ•ด์™€ ํ–‰๋™ ์˜ˆ์ธก์„ ๋™์‹œ์— ์ˆ˜ํ–‰ํ•œ๋‹ค.

Motivation

Achievement

Figure 1

Figure 1. We introduce Magma, the first foundation model that is capable of interpreting and grounding multimodal inputs

How

Figure 4

Figure 4. Trace-of-Mark supervisions for robot manipulation (left) and human action (right). Same coordinate normalizati

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 3/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: Magma๋Š” ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์—์ด์ „ํŠธ ์—ฐ๊ตฌ์—์„œ ์ค‘์š”ํ•œ ์ด์ •ํ‘œ๋ฅผ ์ œ์‹œํ•˜๋Š” ์‹ค์งˆ์ ์ธ ๊ธฐ์ดˆ ๋ชจ๋ธ์ด๋ฉฐ, SoM/ToM์„ ํ†ตํ•œ ๋ฐ์ดํ„ฐ ๋ณ€ํ™˜ ๊ธฐ๋ฒ•์˜ ์šฐ์•„ํ•จ๊ณผ ์‹ค์ฆ์  ์„ฑ๊ณผ(UI ๋ฐ ๋กœ๋ด‡ SOTA)๊ฐ€ ๋†’์€ ์ž„ํŒฉํŠธ๋ฅผ ์‹œ์‚ฌํ•œ๋‹ค. ๊ณต๊ฐœ ๊ณต๊ฐœ์™€ ํ•จ๊ป˜ ์ถ”ํ›„ ์—ฐ๊ตฌ์˜ ๊ธฐ๋ฐ˜์ด ๋  ๊ฐ€๋Šฅ์„ฑ์ด ํฌ๋‹ค.

← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •