CRISP: Contact-Guided Real2Sim from Monocular Video with Planar Scene Primitives

์ €์ž: Zihan Wang, Jiashun Wang, Jeff Tan, Yiwen Zhao, Jessica Hodgins, Shubham Tulsiani, Deva Ramanan | ๋‚ ์งœ: 2025-12-16 | URL: https://arxiv.org/abs/2512.14696 📄 PDF


Essence

Figure 2

Figure 2: CRISP pipeline. Given a casual RGB video (left), CRISP reconstructs scene geometry

๋‹จ์•ˆ ๋น„๋””์˜ค์—์„œ planar primitive ๊ธฐ๋ฐ˜ scene geometry ๋ณต์›๊ณผ human motion ์ถ”์ •์„ ํ†ตํ•ด ๋ฌผ๋ฆฌ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๊ฐ€๋Šฅํ•œ human-scene reconstruction์„ ์ˆ˜ํ–‰ํ•˜๋Š” real-to-sim ํŒŒ์ดํ”„๋ผ์ธ์„ ์ œ์•ˆํ•œ๋‹ค.

Motivation

Achievement

Figure 4

Figure 4: Qualitative comparison. We compare VideoMimic with CRISP (ours) on six sequences

How

Figure 3

Figure 3: Planar fitting. Given per-frame pointmaps from a visual SLAM system, we (1) produce

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 3/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: CRISP๋Š” planar primitive ๊ธฐ๋ฐ˜์˜ ๊ฐ„๋‹จํ•˜๋ฉด์„œ๋„ ํšจ๊ณผ์ ์ธ real-to-sim ํŒŒ์ดํ”„๋ผ์ธ์œผ๋กœ, ๊ธฐ์กด human-scene reconstruction์˜ ๊ทผ๋ณธ์  ๋ฌธ์ œ(simulation incompatibility)๋ฅผ physics ๊ธฐ๋ฐ˜ ๊ฒ€์ฆ์œผ๋กœ ํ•ด๊ฒฐํ•˜๋ฉฐ, substantial empirical improvement์™€ in-the-wild generalization์„ ํ†ตํ•ด embodied AI ๋ถ„์•ผ์— ์‹ค์งˆ์  ๊ธฐ์—ฌ๋ฅผ ํ•œ๋‹ค.

← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •