From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation

์ €์ž: Yifu Yuan, Haiqin Cui, Yibin Chen, Zibin Dong, Fei Ni, Longxin Kou, Jinyi Liu, Pengyi Li, Yan Zheng, Jianye Hao | ๋‚ ์งœ: 2025-05-13 | URL: https://arxiv.org/abs/2505.08548 📄 PDF


Essence

Figure 1

Figure 1 Overview of FSD. FSD unlocks visual aids reasoning and generation through Spatial Relationship

FSD๋Š” Vision-Language Model์— spatial relationship reasoning์„ ํ†ตํ•œ ์ค‘๊ฐ„ ํ‘œํ˜„(visual aids) ์ƒ์„ฑ์„ ์ถ”๊ฐ€ํ•˜์—ฌ, ๋กœ๋ด‡ ์กฐ์ž‘์—์„œ zero-shot ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์„ ํš๊ธฐ์ ์œผ๋กœ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ๋ชจ๋ธ์ด๋‹ค.

Motivation

Achievement

Figure 1

Figure 1 Overview of FSD. FSD unlocks visual aids reasoning and generation through Spatial Relationship

How

Figure 3

Figure 3 Inspired by the process of human reasoning, FSD uses a spatial relationship graph as an anchor to derive

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 4/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: FSD๋Š” spatial reasoning์„ ํ†ตํ•œ visual aids ์ƒ์„ฑ์œผ๋กœ ๋กœ๋ด‡ ์กฐ์ž‘์˜ ์ผ๋ฐ˜ํ™” ๋ฌธ์ œ๋ฅผ ์ฐฝ์˜์ ์œผ๋กœ ํ•ด๊ฒฐํ•˜๋ฉฐ, ๋‹ค์–‘ํ•œ ๋ฒค์น˜๋งˆํฌ์™€ ์‹ค์ œ ๋กœ๋ด‡ ํ™˜๊ฒฝ์—์„œ ๊ฒ€์ฆ๋œ ์šฐ์ˆ˜ํ•œ ์„ฑ๊ณผ๋ฅผ ๋ณด์—ฌ์ค€๋‹ค. ICLR 2026 ๋ฐœํ‘œ ๋…ผ๋ฌธ์œผ๋กœ์„œ embodied AI์˜ ์ค‘์š”ํ•œ ์ง„์ „์„ ์ œ์‹œํ•œ๋‹ค.

← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •