Human2Robot: Learning Robot Actions from Paired Human-Robot Videos

์ €์ž: Sicheng Xie, Haidong Cao, Zejia Weng, Zhen Xing, Haoran Chen, Shiwei Shen, Jiaqi Leng, Zuxuan Wu, Yu-Gang Jiang | ๋‚ ์งœ: 2025-02-23 | URL: https://arxiv.org/abs/2502.16587 📄 PDF


Essence

Figure 1

Figure 1: HUMAN2ROBOT: An human-video-conditioned

VR ์›๊ฒฉ์กฐ์ข…์œผ๋กœ ์ˆ˜์ง‘ํ•œ ์ •๋ฐ€ํ•˜๊ฒŒ ์ •๋ ฌ๋œ ์ธ๊ฐ„-๋กœ๋ด‡ ๋น„๋””์˜ค ์Œ ๋ฐ์ดํ„ฐ์…‹ H&R๊ณผ ์ด๋ฅผ ํ™œ์šฉํ•œ Human2Robot ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์‹œํ•˜์—ฌ, Video Prediction Model์„ ํ†ตํ•ด ์ธ๊ฐ„ ๋™์ž‘์œผ๋กœ๋ถ€ํ„ฐ ๋กœ๋ด‡ ๋™์ž‘์„ ํ”„๋ ˆ์ž„ ์ˆ˜์ค€์—์„œ ํ•™์Šตํ•˜๊ณ  ๋ฏธํ•™์Šต ํƒœ์Šคํฌ์— ์ผ๋ฐ˜ํ™”ํ•œ๋‹ค.

Motivation

Achievement

Figure 2

Figure 2: Dataset Overview. (L) The ratio of four basic task

How

Figure 3

Figure 3: Architecture overview of HUMAN2ROBOT. Our approach consists of two training stages. In the first stage, we tra

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 3/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: VR ์›๊ฒฉ์กฐ์ข…์„ ํ†ตํ•œ ์ •๋ฐ€ํ•œ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘๊ณผ conditional video generation ํŒจ๋Ÿฌ๋‹ค์ž„์˜ ๊ฒฐํ•ฉ์œผ๋กœ ์ธ๊ฐ„-๋กœ๋ด‡ ํ•™์Šต์˜ ๊ทผ๋ณธ์  ํ•œ๊ณ„๋ฅผ ํ•ด๊ฒฐํ•œ ์˜ํ–ฅ๋ ฅ ์žˆ๋Š” ์—ฐ๊ตฌ์ด๋‹ค. ๋‹ค๋งŒ embodiment gap ๋ฌธ์ œ์˜ ๋ฏธํ•ด๊ฒฐ๊ณผ ํ‰๊ฐ€ ๋ฒ”์œ„์˜ ์ œํ•œ์ด ์‹ค์ œ ์ ์šฉ์„ฑ์„ ๋‹ค์†Œ ์ œ์•ฝํ•œ๋‹ค.

← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •