MimicDroid: In-Context Learning for Humanoid Robot Manipulation from Human Play Videos

์ €์ž: Rutav Shah, Shuijing Liu, Qi Wang, Zhenyu Jiang, Sateesh Kumar, Mingyo Seo, Roberto Martรญn-Martรญn, Yuke Zhu | ๋‚ ์งœ: 2025-09-11 | URL: https://arxiv.org/abs/2509.09769 📄 PDF


Essence

Figure 1

Fig. 1: Overview. MIMICDROID enables few-shot learning for humanoid manipulation by training solely on human play

MimicDroid๋Š” ์ธ๊ฐ„์˜ ์ž์œ ๋กœ์šด ์ƒํ˜ธ์ž‘์šฉ ๋น„๋””์˜ค(human play videos)๋งŒ์„ ํ•™์Šต ๋ฐ์ดํ„ฐ๋กœ ์‚ฌ์šฉํ•˜์—ฌ ํœด๋จธ๋…ธ์ด๋“œ ๋กœ๋ด‡์ด In-Context Learning(ICL)์„ ํ†ตํ•ด ์ƒˆ๋กœ์šด ์กฐ์ž‘ ์ž‘์—…์„ ํšจ์œจ์ ์œผ๋กœ ์ˆ˜ํ–‰ํ•˜๋„๋ก ํ•œ๋‹ค.

Motivation

Achievement

Figure 4

Fig. 4: Overview of our simulation benchmark. We introduce a simulation benchmark to evaluate few-shot learning for

How

Figure 2

Fig. 2: Method Overview. MIMICDROID performs meta-training for in-context learning (Meta-ICL) by constructing context-

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 4/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: MimicDroid๋Š” human play videos๋ผ๋Š” ํ˜„์‹ค์ ์ด๊ณ  ํ™•์žฅ ๊ฐ€๋Šฅํ•œ ๋ฐ์ดํ„ฐ ์†Œ์Šค๋ฅผ ํ™œ์šฉํ•˜์—ฌ ํœด๋จธ๋…ธ์ด๋“œ ๋กœ๋ด‡์˜ In-Context Learning ๊ธฐ๋ฐ˜ ์กฐ์ž‘์„ ์‹คํ˜„ํ•œ ํ˜์‹ ์ ์ธ ์—ฐ๊ตฌ์ด๋ฉฐ, ๋ช…ํ™•ํ•œ ๋ฐฉ๋ฒ•๋ก , ๊ฐ•๋ ฅํ•œ ์‹ค์ฆ์  ๊ฒฐ๊ณผ, ๊ทธ๋ฆฌ๊ณ  ๊ณต๊ฐœ ๋ฒค์น˜๋งˆํฌ๋ฅผ ํ†ตํ•ด ๋กœ๋ด‡ ํ•™์Šต ๋ถ„์•ผ์— ์‹ค์งˆ์ ์ธ ๊ธฐ์—ฌ๋ฅผ ํ•œ๋‹ค.

← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •