Global-Local Attention Decomposition for Terrain Encoding in Humanoid Perceptive Locomotion

์ €์ž: Shengcheng Fu, Yang Zhang, Zhanxiang Cao, Liyun Yan, Yizhi Chen, Yunpeng Yin, Yue Gao | ๋‚ ์งœ: 2026 | DOI: 10.48550/ARXIV.2606.00637 📄 PDF


Essence

Figure 2

Fig. 2.

๋ณธ ๋…ผ๋ฌธ์€ ์ธ๊ฐ„ํ˜• ๋กœ๋ด‡์˜ ์ง€ํ˜• ์ธ์‹ ๋ณดํ–‰์„ ์œ„ํ•ด Global-Local Attention Decomposition (GLAD)์ด๋ผ๋Š” ์ƒˆ๋กœ์šด terrain encoder๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ๊ด‘๋ฒ”์œ„ํ•œ ์ง€ํ˜• ๋งฅ๋ฝ ์ดํ•ด์™€ ์ •ํ™•ํ•œ ๋ฐœํŒ ์„ ํƒ์ด๋ผ๋Š” ๋‘ ๊ฐ€์ง€ ์ง€๊ฐ ๋ชฉํ‘œ๋ฅผ ๋ช…์‹œ์ ์œผ๋กœ ๋ถ„๋ฆฌํ•จ์œผ๋กœ์จ sparse-foothold terrain์—์„œ์˜ ์•ˆ์ •์ ์ธ ๋ณดํ–‰์„ ๋‹ฌ์„ฑํ•œ๋‹ค.

Motivation

Achievement

Figure 1

Fig. 1. Real-world locomotion results on the Unitree G1 humanoid robot. A

โ€ข GLAD๋Š” sparse-foothold terrain(stepping stone, staircase, 70cm ์ด์ƒ์˜ gap)์—์„œ ์•ˆ์ •์ ์ธ ๋ณดํ–‰์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•จ.

โ€ข ๋ช…์‹œ์  navigation planner ์—†์ด ๊ฐ„๋‹จํ•œ velocity command๋งŒ์œผ๋กœ narrow path ์ถ”์ข…, obstacle ํšŒํ”ผ ๋“ฑ์˜ emergent terrain-responsive ํ–‰๋™์„ ๋‚˜ํƒ€๋ƒ„.

โ€ข Unitree G1 humanoid robot์— onboard LiDAR ๊ธฐ๋ฐ˜ elevation map์„ ์‚ฌ์šฉํ•˜์—ฌ zero-shot sim-to-real transfer ๋‹ฌ์„ฑ.

How

Figure 2

Fig. 2.

โ€ข CNN์„ ํ†ตํ•œ robot-centric elevation map์˜ spatially aligned local feature ์ถ”์ถœ

โ€ข Global attention branch: attention pooling์œผ๋กœ ์ฃผ๋ณ€ ์ง€ํ˜• ๋งฅ๋ฝ ์š”์•ฝ

โ€ข Local attention branch: state-conditioned sparsification๊ณผ MHA๋กœ foothold ๊ด€๋ จ geometry ์ธ์ฝ”๋”ฉ

โ€ข ๋‘ branch์˜ ๊ฒฐ๊ณผ๋ฅผ ๊ฒฐํ•ฉํ•˜์—ฌ locomotion policy์— ์ œ๊ณต

โ€ข Reinforcement learning ํ”„๋ ˆ์ž„์›Œํฌ ๋‚ด์—์„œ end-to-end ํ•™์Šต

Originality

โ€ข Global๊ณผ local attention์˜ ๋ช…์‹œ์  ๋ถ„๋ฆฌ๋ผ๋Š” ์•„ํ‚คํ…์ฒ˜ ์„ค๊ณ„๊ฐ€ ๊ธฐ์กด ๋ฐฉ์‹(๋‘ ๋ชฉํ‘œ๋ฅผ ๋‹จ์ผ attention mechanism์— ํ˜ผ์žฌ)๊ณผ ์ฐจ๋ณ„ํ™”๋จ.

โ€ข State-conditioned local feature sparsification์ด๋ผ๋Š” ํšจ์œจ์„ฑ ๊ฐœ์„  ๋ฉ”์ปค๋‹ˆ์ฆ˜ ๋„์ž….

โ€ข Coarse-to-fine encoder ๊ตฌ์กฐ๋กœ broad context์™€ fine-grained geometry๋ฅผ ๋™์‹œ์— ์ฒ˜๋ฆฌํ•˜๋Š” ์„ค๊ณ„.

โ€ข Emergent navigation ํ–‰๋™์˜ ๋ฐœํ˜„์„ naturalํ•œ RL ํ•™์Šต ๊ฒฐ๊ณผ๋กœ ์ œ์‹œ.

Limitation & Further Study

โ€ข ๋…ผ๋ฌธ์€ terrain-specific curriculum์ด๋‚˜ expert policy์—์˜ ์˜์กด๋„๋Š” ๊ฐ์†Œ์‹œํ‚ค์ง€๋งŒ, elevation map ๊ธฐ๋ฐ˜ ์ ‘๊ทผ์ด ๊ฐ€์ง€๋Š” ๋ณธ์งˆ์  ํ•œ๊ณ„(์˜ˆ: occluded region์˜ ์ฒ˜๋ฆฌ)์— ๋Œ€ํ•œ ๋…ผ์˜ ๋ถ€์กฑ.

โ€ข Zero-shot sim-to-real transfer์˜ ์„ฑ๊ณต์ด ๋ณด๊ณ ๋˜์—ˆ์œผ๋‚˜, ์‹คํŒจ ์‚ฌ๋ก€๋‚˜ ํ•œ๊ณ„ ํ™˜๊ฒฝ์— ๋Œ€ํ•œ ๋ถ„์„ ๋ฏธํก.

โ€ข Global๊ณผ local attention branch์˜ ์ƒ๋Œ€์  ๊ธฐ์—ฌ๋„ ๋ถ„์„์ด๋‚˜ ๊ฐ branch์˜ ํ•„์ˆ˜์„ฑ์— ๋Œ€ํ•œ ablation study ๊ฒฐ๊ณผ๊ฐ€ ์ œ์‹œ๋˜์ง€ ์•Š์Œ.

โ€ข ๋‹ค๋ฅธ terrain encoder (AME, AME-2 ๋“ฑ)๊ณผ์˜ ์ง์ ‘์ ์ธ ์ •๋Ÿ‰์  ๋น„๊ต ํ‰๊ฐ€ ๋ถ€์žฌ.

Evaluation

Novelty: 4/5 Technical Soundness: 4/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: ๋ณธ ๋…ผ๋ฌธ์€ ์ธ๊ฐ„ํ˜• ๋กœ๋ด‡์˜ sparse-foothold ๋ณดํ–‰์„ ์œ„ํ•ด attention mechanism์˜ ์—ญํ• ์„ ๋ช…์‹œ์ ์œผ๋กœ ๋ถ„๋ฆฌํ•˜๋Š” GLAD๋ฅผ ์ œ์•ˆํ•˜๋ฉฐ, ์ด๋ก ์  ๋™๊ธฐ๋ถ€์—ฌ๊ฐ€ ๋ช…ํ™•ํ•˜๊ณ  ์‹ค์ œ ๋กœ๋ด‡ ๋ฐฐํฌ์—์„œ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ–ˆ๋‹ค๋Š” ์ ์—์„œ ์˜๋ฏธ ์žˆ๋Š” ๊ธฐ์—ฌ๋ฅผ ํ•œ๋‹ค. ๋‹ค๋งŒ, ๋” ์ฒ ์ €ํ•œ ablation study์™€ ๊ธฐ์กด ๋ฐฉ๋ฒ•๊ณผ์˜ ์ •๋Ÿ‰์  ๋น„๊ต๊ฐ€ ๋ณด์ถฉ๋˜๋ฉด ๋”์šฑ ๊ฐ•๋ ฅํ•œ ๋…ผ๋ฌธ์ด ๋  ๊ฒƒ์ด๋‹ค.

← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •