HumanEgo: Zero-Shot Robot Learning from Minutes of Human Egocentric Videos

์ €์ž: Zhi Wang, Botao He, Kelin Yu, Seungjae Lee, Ruohan Gao, Furong Huang, Yiannis Aloimonos | ๋‚ ์งœ: 2026 | DOI: 10.48550/ARXIV.2605.24934 📄 PDF


Essence

HumanEgo๋Š” ์ธ๊ฐ„์˜ ์ž์•„์ค‘์‹ฌ ์˜์ƒ(egocentric video)์œผ๋กœ๋ถ€ํ„ฐ ๋กœ๋ด‡ ์กฐ์ž‘ ์ •์ฑ…์„ ํ•™์Šตํ•˜๋Š” ํ”„๋ ˆ์ž„์›Œํฌ๋กœ์„œ, Interaction-Centric Tokens(ICT)๋ฅผ ํ†ตํ•ด ๊ตฌ์ฒดํ™” ๊ฒฉ์ฐจ(embodiment gap)๋ฅผ ํ•ด๊ฒฐํ•˜๊ณ  flow matching ์ •์ฑ…๊ณผ ์กฐ๋ฐ€ํ•œ ๋ณด์กฐ ๋ชฉํ‘œ๋“ค์„ ๊ฒฐํ•ฉํ•˜์—ฌ 30๋ถ„์˜ ์ธ๊ฐ„ ์˜์ƒ๋งŒ์œผ๋กœ 92.5% ์„ฑ๊ณต๋ฅ ์„ ๋‹ฌ์„ฑํ•œ๋‹ค.

Motivation

Achievement

Figure 4

Fig. 4: Overall Real-World Evaluation. Real-world success rate (%) for each method across

How

Figure 2

Fig. 2: System overview of HumanEgo. Arm inpainting and visual keypoints bridge the visual gap;

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 4/5 Significance: 5/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: HumanEgo๋Š” ์ธ๊ฐ„ ์ž์•„์ค‘์‹ฌ ์˜์ƒ์œผ๋กœ๋ถ€ํ„ฐ ๋กœ๋ด‡ ์ •์ฑ…์„ ํ•™์Šตํ•˜๋Š” ๋ฌธ์ œ์— ๋ช…ํ™•ํ•œ ํ•ด๊ฒฐ์ฑ…์„ ์ œ์‹œํ•œ๋‹ค. Interaction-Centric Tokens๋ฅผ ํ†ตํ•œ ํ˜์‹ ์  ํ‘œํ˜„๊ณผ ์กฐ๋ฐ€ํ•œ ๋ณด์กฐ ๊ฐ์‹œ์˜ ์กฐํ•ฉ์€ ๊ธฐ์ˆ ์ ์œผ๋กœ ํƒ€๋‹นํ•˜๋ฉฐ, 30๋ถ„ ์˜์ƒ์œผ๋กœ 92.5% ์„ฑ๊ณต๋ฅ ๊ณผ zero-shot ์ „์ด ๋Šฅ๋ ฅ์€ ์‹ค์šฉ์  ์˜์˜๊ฐ€ ํฌ๋‹ค. ๋‹ค๋งŒ Aria ์„ผ์„œ ์˜์กด๋„์™€ ์ œํ•œ๋œ ์ž‘์—… ํ‰๊ฐ€ ๋ฒ”์œ„๊ฐ€ ์ผ๋ฐ˜ํ™” ๊ฐ€๋Šฅ์„ฑ์— ์˜๋ฌธ์„ ์ œ๊ธฐํ•œ๋‹ค.

← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •