Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces

์ €์ž: Gen Luo, Ganlin Yang, Ziyang Gong, Guanzhou Chen, Haonan Duan, Erfei Cui, Ronglei Tong, Zhi Hou, Tianyi Zhang, Zhe Chen, Shenglong Ye, Lewei Lu, Jingbo Wang, Wenhai Wang, Jifeng Dai, Yu Qiao, Rongrong Ji, Xizhou Zhu | ๋‚ ์งœ: 2025-05-30 | URL: https://arxiv.org/abs/2506.00123 📄 PDF


Essence

Figure 1

Figure 1: Overview of VeBrain and VeBrain-600k. Compared to existing MLLMs, VeBrain achieves

VeBrain์€ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋Œ€ํ˜• ์–ธ์–ด ๋ชจ๋ธ(MLLM)์„ ์ง€๊ฐ, ์ถ”๋ก , ์ œ์–ด ๊ธฐ๋Šฅ์œผ๋กœ ํ†ตํ•ฉํ•˜๋Š” ํ”„๋ ˆ์ž„์›Œํฌ์ด๋ฉฐ, ๋กœ๋ด‡ ์ œ์–ด ์ž‘์—…์„ 2D ์‹œ๊ฐ ๊ณต๊ฐ„์˜ ํ…์ŠคํŠธ ๊ธฐ๋ฐ˜ MLLM ์ž‘์—…์œผ๋กœ ์žฌ๊ตฌ์„ฑํ•ฉ๋‹ˆ๋‹ค.

Motivation

Achievement

Figure 1

Figure 1: Overview of VeBrain and VeBrain-600k. Compared to existing MLLMs, VeBrain achieves

How

Figure 2

Figure 2: Illustration of VeBrain architecture and robotic adapter. In VeBrain, the MLLM is capable

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 4/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: VeBrain์€ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ดํ•ด์™€ ๋กœ๋ด‡ ์ œ์–ด๋ฅผ 2D ์‹œ๊ฐ ๊ณต๊ฐ„์˜ ๊ณตํ†ต MLLM ์ž‘์—…์œผ๋กœ ํ†ตํ•ฉํ•˜๋Š” ํ˜์‹ ์ ์ธ ์ ‘๊ทผ์œผ๋กœ, ๊ด‘๋ฒ”์œ„ํ•œ ๋ฒค์น˜๋งˆํฌ์™€ ๋กœ๋ด‡ ์‹คํ—˜์—์„œ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ์ž…์ฆํ•˜๋ฉฐ ๊ตฌ์ฒดํ™”๋œ AI์˜ ์ค‘์š”ํ•œ ์ง„์ „์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.

← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •