DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution

์ €์ž: Yang Yue, Yulin Wang, Bingyi Kang, Yizeng Han, Shenzhi Wang, Shiji Song, Jiashi Feng, Gao Huang | ๋‚ ์งœ: 2024-11-04 | URL: https://arxiv.org/abs/2411.02359 📄 PDF


Essence

Figure 1

Figure 1: Left: Dynamic inference of DeeR. For inference, we adaptively activate an appropriate size of MLLM

DeeR-VLA๋Š” ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋Œ€ํ˜• ์–ธ์–ด ๋ชจ๋ธ(MLLM)์˜ ๋™์  ์กฐ๊ธฐ ์ข…๋ฃŒ ํ”„๋ ˆ์ž„์›Œํฌ๋กœ, ๋กœ๋ด‡์˜ ๊ฐ ์ƒํ™ฉ์— ๋”ฐ๋ผ ํ™œ์„ฑํ™”๋˜๋Š” ๋ชจ๋ธ ํฌ๊ธฐ๋ฅผ ์ž๋™์œผ๋กœ ์กฐ์ •ํ•˜์—ฌ ๊ณ„์‚ฐ ํšจ์œจ์„ฑ์„ 5.2-6.5๋ฐฐ ํ–ฅ์ƒ์‹œํ‚ต๋‹ˆ๋‹ค.

Motivation

Achievement

Figure 3

Figure 3: Results atop OpenFlamingo 3B. Upper: Avg. successful len v.s. avg. LLM GFLOPs. Bottom:

How

Figure 2

Figure 2: Multi-exit MLLM architecture for robot.

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 3/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: DeeR-VLA๋Š” ๋กœ๋ด‡ ์ œ์–ด๋ฅผ ์œ„ํ•œ MLLM ํšจ์œจํ™”์—์„œ ์‹ค์งˆ์ ์ด๊ณ  ํ˜์‹ ์ ์ธ ์ ‘๊ทผ์„ ์ œ์‹œํ•˜๋ฉฐ, 5๋ฐฐ ์ด์ƒ์˜ ๊ณ„์‚ฐ ๋น„์šฉ ๊ฐ์†Œ๋ฅผ ๋‹ฌ์„ฑํ•˜๋ฉด์„œ๋„ ์„ฑ๋Šฅ์„ ์œ ์ง€ํ•˜๋Š” ๊ธฐ์ˆ ์  ์„ฑ๊ณผ๋Š” ์‹ค์ œ ๋กœ๋ด‡ ๋ฐฐํฌ ๊ฐ€๋Šฅ์„ฑ์„ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œํ‚ต๋‹ˆ๋‹ค.

← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •