MEM: Multi-Scale Embodied Memory for Vision Language Action Models

์ €์ž: Marcel Torne, Karl Pertsch, Homer Walke, Kyle Vedder, Suraj Nair, Brian Ichter, Allen Z. Ren, Haohuan Wang, Jiaming Tang, Kyle Stachowicz, Karan Dhabalia, Michael Equi, Quan Vuong, Jost Tobias Springenberg, Sergey Levine, Chelsea Finn, Danny Driess | ๋‚ ์งœ: 2026-03-04 | URL: https://arxiv.org/abs/2603.03596 📄 PDF


Essence

Figure 1

Fig. 1: Multi-Scale Embodied Memory (MEM) equips Vision Language Action Models (VLAs) with memory for solving long-horiz

๋กœ๋ด‡์˜ ์žฅ์‹œ๊ฐ„ ์ž‘์—…์„ ์œ„ํ•ด ๋น„๋””์˜ค ๊ธฐ๋ฐ˜ ๋‹จ๊ธฐ ๋ฉ”๋ชจ๋ฆฌ์™€ ํ…์ŠคํŠธ ๊ธฐ๋ฐ˜ ์žฅ๊ธฐ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ๊ฒฐํ•ฉํ•œ Multi-Scale Embodied Memory (MEM)์„ ์ œ์•ˆํ•˜์—ฌ, 15๋ถ„ ์ด์ƒ์˜ ๋ณต์žกํ•œ ์กฐ์ž‘ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋Š” Vision Language Action ๋ชจ๋ธ์„ ๊ตฌํ˜„ํ–ˆ๋‹ค.

Motivation

Achievement

Figure 1

Fig. 1: Multi-Scale Embodied Memory (MEM) equips Vision Language Action Models (VLAs) with memory for solving long-horiz

How

Figure 2

Figure 2 shows an overview of our MEM system. Our goal

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 3/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: ๋ณธ ๋…ผ๋ฌธ์€ ๋กœ๋ด‡์˜ ์žฅ์‹œ๊ฐ„ ์ž‘์—…์„ ์œ„ํ•œ ๋‹ค์ค‘ ์Šค์ผ€์ผ ๋ฉ”๋ชจ๋ฆฌ ์•„ํ‚คํ…์ฒ˜๋ฅผ ์ฐฝ์˜์ ์œผ๋กœ ์ œ์•ˆํ•˜์—ฌ 15๋ถ„ ์ด์ƒ์˜ ๋ณต์žกํ•œ ์กฐ์ž‘ ์ž‘์—…์„ ์ฒ˜์Œ์œผ๋กœ ์„ฑ๊ณต์ ์œผ๋กœ ๊ตฌํ˜„ํ–ˆ์œผ๋ฉฐ, ์ด๋Š” ์‹ค์ œ ๋กœ๋ด‡ ์ž๋™ํ™”์˜ ์‹ค์šฉ์„ฑ์„ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ์ค‘์š”ํ•œ ๊ธฐ์—ฌ๋ฅผ ํ•œ๋‹ค.

← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •