Now You See That: Learning End-to-End Humanoid Locomotion from Raw Pixels

์ €์ž: Wandong Sun, Yongbo Su, Leoric Huang, Alex Zhang, Dwyane Wei, Mu San, Daniel Tian, Ellie Cao, Finn Yan, Ethan Xie, Zongwu Xie | ๋‚ ์งœ: 2026-02-06 | URL: https://arxiv.org/abs/2602.06382 📄 PDF


Essence

Figure 1

Fig. 1: Overview. Our end-to-end vision-based humanoid locomotion policy enables robust traversal across diverse challen

Raw ๊นŠ์ด ์ด๋ฏธ์ง€๋กœ๋ถ€ํ„ฐ end-to-end ํœด๋จธ๋…ธ์ด๋“œ ๋กœ๋ด‡ ๋ณดํ–‰์„ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•ด, ํ˜„์‹ค์ ์ธ depth ์„ผ์„œ ์‹œ๋ฎฌ๋ ˆ์ด์…˜๊ณผ vision-aware behavior distillation, ๊ทธ๋ฆฌ๊ณ  terrain-specific multi-critic/multi-discriminator ํ•™์Šต์„ ๊ฒฐํ•ฉํ•œ ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์‹œํ•œ๋‹ค.

Motivation

Achievement

Figure 1

Fig. 1: Overview. Our end-to-end vision-based humanoid locomotion policy enables robust traversal across diverse challen

How

Figure 4

Fig. 4: Method Overview. Our framework consists of two stages: (1)

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 4/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: ๋ณธ ๋…ผ๋ฌธ์€ ํœด๋จธ๋…ธ์ด๋“œ ๋กœ๋ด‡์˜ vision-based ๋ณดํ–‰์—์„œ sim-to-real gap๊ณผ ๋‹ค์–‘ํ•œ terrain ํ†ตํ•ฉ ํ•™์Šต์˜ ๊ทผ๋ณธ์ ์ธ ๋‘ ๊ณผ์ œ๋ฅผ ์ฒด๊ณ„์ ์œผ๋กœ ํ•ด๊ฒฐํ•˜๋ฉฐ, ํ˜„์‹ค์ ์ธ ์„ผ์„œ ๋ชจ๋ธ๋ง๊ณผ behavior distillation, terrain-specific ํ•™์Šต์„ ๊ฒฐํ•ฉํ•œ ์ฐฝ์˜์ ์ธ ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์‹œํ•œ๋‹ค. ๋‘ ๊ฐœ์˜ ์‹ค์ œ ๋กœ๋ด‡ ํ”Œ๋žซํผ์—์„œ ๊ทนํ•œ ์žฅ์• ๋ฌผ๋ถ€ํ„ฐ fine-grained ์ž‘์—…๊นŒ์ง€ ๊ด‘๋ฒ”์œ„ํ•œ ์„ฑ๋Šฅ ๊ฒ€์ฆ์„ ํ†ตํ•ด ํ•™์ˆ ์ ยท์‹ค๋ฌด์  ๊ฐ€์น˜๊ฐ€ ๋†’๋‹ค.

← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •