Learning Versatile Humanoid Manipulation with Touch Dreaming

์ €์ž: Yaru Niu, Zhenlong Fang, Binghong Chen, Shuai Zhou, Revanth Senthilkumaran, Hao Zhang, Bingqing Chen, Chen Qiu, H. Eric Tseng, Jonathan Francis, Ding Zhao | ๋‚ ์งœ: 2026-04-14 | URL: https://arxiv.org/abs/2604.13015 📄 PDF


Essence

Figure 1

Fig. 1: Our system enables versatile, contact-rich, and dexterous humanoid manipulation. A: long-horizon, multi-stage ma

ํœด๋จธ๋…ธ์ด๋“œ ๋กœ๋ด‡์˜ ์ ‘์ด‰-ํ’๋ถ€ํ•œ ์กฐ์ž‘์„ ์œ„ํ•ด VR ํ…”๋ ˆ์˜คํผ๋ ˆ์ด์…˜ ๊ธฐ๋ฐ˜ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘๊ณผ ํ„ฐ์น˜ ๊ฐ๊ฐ์„ ํ•ต์‹ฌ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๋กœ ํ•˜๋Š” Humanoid Transformer with Touch Dreaming (HTD)์„ ์ œ์•ˆํ•œ๋‹ค.

Motivation

Achievement

Figure 1

Fig. 1: Our system enables versatile, contact-rich, and dexterous humanoid manipulation. A: long-horizon, multi-stage ma

How

Figure 4

Fig. 4: HTD model architecture. HTD is a modular encoderโ€“decoder Transformer. Left: modality tokenizers encode multi-vie

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 3/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: ๋ณธ ๋…ผ๋ฌธ์€ ํ„ฐ์น˜๋ฅผ ํ•ต์‹ฌ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๋กœ ํ•˜๋Š” Touch Dreaming ๊ธฐ๋ฒ•๊ณผ ํ†ตํ•ฉ๋œ ์‹ค์„ธ๊ณ„ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ ์‹œ์Šคํ…œ์œผ๋กœ ํœด๋จธ๋…ธ์ด๋“œ ์ ‘์ด‰-ํ’๋ถ€ํ•œ ์กฐ์ž‘์˜ ์‹คํ˜„ ๊ฐ€๋Šฅ์„ฑ์„ ๊ฐ•๋ ฅํ•˜๊ฒŒ ์ž…์ฆํ•œ๋‹ค. ๋‹ค์„ฏ ๊ฐ€์ง€ ๋‹ค์–‘ํ•œ ์‹ค์ œ ์ž‘์—…์—์„œ 90.9% ์„ฑ๋Šฅ ๊ฐœ์„ ์„ ๋‹ฌ์„ฑํ•˜๋ฉฐ, ์ž ์žฌ ๊ณต๊ฐ„ ์˜ˆ์ธก์˜ ํšจ๊ณผ์„ฑ์„ ๋ช…ํ™•ํžˆ ๋ณด์—ฌ์ฃผ๋Š” ๋†’์€ ์งˆ์˜ ์—ฐ๊ตฌ์ด๋‹ค.

← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •