V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

์ €์ž: Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, , , Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, Sergio Arnaud, Abha Gejji, Ada Martin, Francois Robert Hogan, Daniel Dugas, Piotr Bojanowski, Vasil Khalidov, Patrick Labatut, Francisco Massa, Marc Szafraniec, Kapil Krishnakumar, Yong Li, Xiaodong Ma, Sarath Chandar, Franziska Meier, Yann LeCun, Michael Rabbat, Nicolas Ballas | ๋‚ ์งœ: 2025-06-11 | URL: https://arxiv.org/abs/2506.09985 📄 PDF


Essence

Figure 1

Figure 1 V-JEPA 2 Overview. Leveraging 1M hours of internet-scale video and 1M images, we pretrain the V-JEPA 2

V-JEPA 2๋Š” 1๋ฐฑ๋งŒ ์‹œ๊ฐ„ ์ด์ƒ์˜ ์ธํ„ฐ๋„ท ๊ทœ๋ชจ ๋น„๋””์˜ค๋กœ ์‚ฌ์ „ํ•™์Šตํ•œ ์ž๊ธฐ์ง€๋„ํ•™์Šต ๋น„๋””์˜ค ๋ชจ๋ธ๋กœ, ๋น„๋””์˜ค ์ดํ•ดยท์˜ˆ์ธกยท๋กœ๋ด‡ ๊ณ„ํš์„ ๋ชจ๋‘ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•œ๋‹ค.

Motivation

Achievement

Figure 1

Figure 1 V-JEPA 2 Overview. Leveraging 1M hours of internet-scale video and 1M images, we pretrain the V-JEPA 2

How

Figure 2

Figure 2 Multistage training. (Left) We first pretrain the V-JEPA 2 video encoder on internet-scale image and

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 4/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: V-JEPA 2๋Š” ์ธํ„ฐ๋„ท ๊ทœ๋ชจ ์ž๊ธฐ์ง€๋„ํ•™์Šต๊ณผ ์ตœ์†Œํ•œ์˜ ๋กœ๋ด‡ ์ƒํ˜ธ์ž‘์šฉ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฒฐํ•ฉํ•˜์—ฌ ๋น„๋””์˜ค ์ดํ•ด, ์˜ˆ์ธก, ์‹ค์ œ ๋กœ๋ด‡ ๊ณ„ํš์„ ๋ชจ๋‘ ๋‹ฌ์„ฑํ•œ ํš๊ธฐ์  ์—ฐ๊ตฌ๋กœ, ์„ธ๊ณ„ ๋ชจ๋ธ ๊ธฐ๋ฐ˜ ์ผ๋ฐ˜ ์—์ด์ „ํŠธ ๊ฐœ๋ฐœ์˜ ์ƒˆ๋กœ์šด ๋ฐฉํ–ฅ์„ ์ œ์‹œํ•œ๋‹ค.

← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •