Dexterity from Smart Lenses: Multi-Fingered Robot Manipulation with In-the-Wild Human Demonstrations

์ €์ž: Irmak Guzey, Haozhi Qi, Julen Urain, Changhao Wang, Jessica Yin, Krishna Bodduluri, Mike Lambeta, Lerrel Pinto, Akshara Rai, Jitendra Malik, Tingfan Wu, Akash Sharma, Homanga Bharadhwaj | ๋‚ ์งœ: 2025-11-20 | URL: https://arxiv.org/abs/2511.16661 📄 PDF


Essence

Figure 1

Fig. 1: AINA is a framework for learning multi-fingered policies from in-the-wild human data collected with smart glasse

Aria Gen 2 ์Šค๋งˆํŠธ ๊ธ€๋ž˜์Šค๋กœ ์ˆ˜์ง‘ํ•œ in-the-wild ์ธ๊ฐ„ ์˜์ƒ๋งŒ์œผ๋กœ ๋กœ๋ด‡์šฉ ๋‹ค์ค‘ ์†๊ฐ€๋ฝ ์กฐ์ž‘ ์ •์ฑ…์„ ํ•™์Šตํ•˜๋Š” AINA ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ์ด๋Š” ๋กœ๋ด‡ ๋ฐ์ดํ„ฐ๋‚˜ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ์—†์ด๋„ ์ง์ ‘ ๋ฐฐํฌ ๊ฐ€๋Šฅํ•œ 3D point-based ์ •์ฑ…์„ ์ƒ์„ฑํ•œ๋‹ค.

Motivation

Achievement

Figure 3

Fig. 3: Comparison of AINAโ€™s capabilities with some prior human-to-robot learning frameworks. In-The-Wild indicates whet

How

Figure 4

Fig. 4: Illustration of our overall AINA framework. On the left, we show how the data is processed: the human hand pose

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 3/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: ์ด ๋…ผ๋ฌธ์€ ์Šค๋งˆํŠธ ๊ธ€๋ž˜์Šค์˜ ๊ณ ๊ธ‰ ์„ผ์‹ฑ ๋Šฅ๋ ฅ์„ ์ฐฝ์˜์ ์œผ๋กœ ํ™œ์šฉํ•˜์—ฌ ์ˆœ์ˆ˜ ์ธ๊ฐ„ ๋น„๋””์˜ค๋งŒ์œผ๋กœ ๋‹ค์ค‘ ์†๊ฐ€๋ฝ ๋กœ๋ด‡ ์กฐ์ž‘ ์ •์ฑ…์„ ํ•™์Šตํ•˜๋Š” ์‹ค์งˆ์ ์ด๊ณ  ํ™•์žฅ ๊ฐ€๋Šฅํ•œ ํ•ด๋ฒ•์„ ์ œ์‹œํ•œ๋‹ค. ๊ฐ•๋ ฅํ•œ ์‹ค์ฆ ๊ฒฐ๊ณผ์™€ ๋ช…ํ™•ํ•œ ๋ฐฉ๋ฒ•๋ก ์œผ๋กœ ์ธ๊ฐ„-๋กœ๋ด‡ ๋ชจ๋ฐฉ ํ•™์Šต ๋ถ„์•ผ์— ์ƒ๋‹นํ•œ ์ง„์ „์„ ์ด๋ฃจ์—ˆ์œผ๋ฉฐ, ๋กœ๋ด‡ ์กฐ์ž‘์˜ ๋Œ€๊ทœ๋ชจ ์‹ค์šฉํ™”๋ฅผ ํ–ฅํ•œ ์ค‘์š”ํ•œ ํ•œ ๊ฑธ์Œ์„ ์ œ๊ณตํ•œ๋‹ค.

← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •