Thinking in 360ยฐ: Humanoid Visual Search in the Wild

์ €์ž: Heyang Yu, Yinan Han, Xiangyu Zhang, Baiqiao Yin, Bowen Chang, Xiangyu Han, Xinhao Liu, Jing Zhang, Marco Pavone, Chen Feng, Saining Xie, Yiming Li | ๋‚ ์งœ: 2025-11-25 | URL: https://arxiv.org/abs/2511.20351 📄 PDF


Essence

Figure 1

Figure 1. We pose a fundamental question: can an AI agent actively search for objects or paths in a 3D world like a huma

์ธ๊ฐ„์ฒ˜๋Ÿผ 360ยฐ ํŒŒ๋…ธ๋ผ๋งˆ ํ™˜๊ฒฝ์—์„œ ๋จธ๋ฆฌ ํšŒ์ „์„ ํ†ตํ•ด ๋Šฅ๋™์ ์œผ๋กœ ๋ฌผ์ฒด๋ฅผ ํƒ์ƒ‰ํ•˜๊ฑฐ๋‚˜ ๊ฒฝ๋กœ๋ฅผ ์ฐพ๋Š” embodied ์‹œ๊ฐ ํƒ์ƒ‰ ์—์ด์ „ํŠธ๋ฅผ ์ œ์•ˆํ•˜๊ณ , ์‹ค๋‚ด ์žฅ๋ฉด์„ ๋„˜์–ด ์ง€ํ•˜์ฒ ์—ญยท์‡ผํ•‘๋ชฐยท๊ฑฐ๋ฆฌ ๋“ฑ ๋ณต์žกํ•œ ํ˜„์‹ค ํ™˜๊ฒฝ์„ ๋Œ€์ƒ์œผ๋กœ ํ•œ H*Bench ๋ฒค์น˜๋งˆํฌ๋ฅผ ๊ตฌ์ถ•ํ–ˆ๋‹ค.

Motivation

Achievement

Figure 4

Figure 4. Comparison of In-task (train and test on the same task family) and Cross-task (train on one task family and te

How

Figure 2

Figure 2. Pipeline Illustration. Stage 1 (SFT) provides the foundational ability to map perspective images to plausible

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 3/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: humanoid visual search๋ผ๋Š” ์ƒˆ๋กœ์šด embodied AI ๋ฌธ์ œ๋ฅผ ์ •์˜ํ•˜๊ณ  ํ˜„์‹ค์ ์ด๊ณ  ๋„์ „์ ์ธ H*Bench ๋ฒค์น˜๋งˆํฌ๋ฅผ ์ œ์‹œํ•จ์œผ๋กœ์จ MLLM ๊ธฐ๋ฐ˜ ์—์ด์ „ํŠธ์˜ ๊ณต๊ฐ„ ์ถ”๋ก  ๋Šฅ๋ ฅ์„ ์ฒด๊ณ„์ ์œผ๋กœ ํ‰๊ฐ€ํ•  ์ˆ˜ ์žˆ๋Š” ๊ธฐํ‹€์„ ๋งˆ๋ จํ–ˆ์œผ๋ฉฐ, SFT์™€ RL์„ ํ†ตํ•œ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ๋ณด์—ฌ์ฃผ๋˜ ๋‚จ์€ ํฐ ๋„์ „๊ณผ์ œ๋„ ๋ช…ํ™•ํžˆ ๊ทœ๋ช…ํ•œ ๋†’์€ ๊ฐ€์น˜์˜ ์—ฐ๊ตฌ์ด๋‹ค.

← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •