Vision in Action: Learning Active Perception from Human Demonstrations

์ €์ž: Haoyu Xiong, Xiaomeng Xu, Jimmy Wu, Yifan Hou, Jeannette Bohg, Shuran Song | ๋‚ ์งœ: 2025-06-18 | URL: https://arxiv.org/abs/2506.15666 📄 PDF


Essence

Figure 1

Figure 1: Vision in Action (ViA) uses an active head

ViA๋Š” 6-DoF ๋กœ๋ด‡ ๋„ฅ๊ณผ VR ํ…”๋ ˆ์˜คํผ๋ ˆ์ด์…˜ ์ธํ„ฐํŽ˜์ด์Šค๋ฅผ ํ†ตํ•ด ์ธ๊ฐ„์˜ ๋Šฅ๋™์  ์ง€๊ฐ ์ „๋žต์„ ์ง์ ‘ ํ•™์Šตํ•˜์—ฌ ์ด์ค‘ํŒ” ์กฐ์ž‘ ๋กœ๋ด‡์˜ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ์‹œ์Šคํ…œ์ด๋‹ค.

Motivation

Achievement

Figure 5

Figure 5: Policy Learning Camera Setup Comparison Results. We report stage-wise success rates across

How

Figure 2

Figure 2: VR Teleoperation Comparison. [Left] Traditional RGB streaming suffers from motion-to-photon

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 4/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: ViA๋Š” ๋Šฅ๋™์  ์ง€๊ฐ, VR ํ…”๋ ˆ์˜คํผ๋ ˆ์ด์…˜, ์ด์ค‘ํŒ” ์กฐ์ž‘์„ ํšจ๊ณผ์ ์œผ๋กœ ํ†ตํ•ฉํ•œ ํ˜์‹ ์  ์‹œ์Šคํ…œ์œผ๋กœ, ์ค‘๊ฐ„ 3D ํ‘œํ˜„์„ ํ†ตํ•œ ์ง€์—ฐ ์‹œ๊ฐ„ ํ•ด๊ฒฐ๊ณผ ๊ณต์œ  ๊ด€์ฐฐ ๊ณต๊ฐ„ ๊ฐœ๋…์ด ํŠนํžˆ ์ฐฝ์˜์ ์ด๋ฉฐ, ์‹œ๊ฐ์  ํ์ƒ‰์ด ์žˆ๋Š” ๋ณต์žกํ•œ ์‹ค์ œ ์ž‘์—…์—์„œ ์‹ค์งˆ์ ์ธ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ๋‹ฌ์„ฑํ–ˆ๋‹ค.

← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •