์ ์: Zihan Wang, Jiashun Wang, Jeff Tan, Yiwen Zhao, Jessica Hodgins, Shubham Tulsiani, Deva Ramanan | ๋ ์ง: 2025-12-16 | URL: https://arxiv.org/abs/2512.14696 📄 PDF
Figure 2: CRISP pipeline. Given a casual RGB video (left), CRISP reconstructs scene geometry
๋จ์ ๋น๋์ค์์ planar primitive ๊ธฐ๋ฐ scene geometry ๋ณต์๊ณผ human motion ์ถ์ ์ ํตํด ๋ฌผ๋ฆฌ ์๋ฎฌ๋ ์ด์ ๊ฐ๋ฅํ human-scene reconstruction์ ์ํํ๋ real-to-sim ํ์ดํ๋ผ์ธ์ ์ ์ํ๋ค.
Figure 4: Qualitative comparison. We compare VideoMimic with CRISP (ours) on six sequences
Figure 3: Planar fitting. Given per-frame pointmaps from a visual SLAM system, we (1) produce
์ดํ: CRISP๋ planar primitive ๊ธฐ๋ฐ์ ๊ฐ๋จํ๋ฉด์๋ ํจ๊ณผ์ ์ธ real-to-sim ํ์ดํ๋ผ์ธ์ผ๋ก, ๊ธฐ์กด human-scene reconstruction์ ๊ทผ๋ณธ์ ๋ฌธ์ (simulation incompatibility)๋ฅผ physics ๊ธฐ๋ฐ ๊ฒ์ฆ์ผ๋ก ํด๊ฒฐํ๋ฉฐ, substantial empirical improvement์ in-the-wild generalization์ ํตํด embodied AI ๋ถ์ผ์ ์ค์ง์ ๊ธฐ์ฌ๋ฅผ ํ๋ค.