Unified Video Action Model

์ €์ž: Shuang Li, Yihuai Gao, Dorsa Sadigh, Shuran Song | ๋‚ ์งœ: 2025-02-28 | URL: https://arxiv.org/abs/2503.00200 📄 PDF


Essence

Figure 1

Fig. 1: Unified Video Action Model. (a) UVA features a joint video-action latent representation and decoupled video-acti

UVA๋Š” ๋น„๋””์˜ค ์ƒ์„ฑ๊ณผ ์•ก์…˜ ์˜ˆ์ธก์„ ํ†ตํ•ฉ์ ์œผ๋กœ ํ•™์Šตํ•˜๋Š” ๋ชจ๋ธ๋กœ, ๊ณต์œ ๋œ ์ž ์žฌ ํ‘œํ˜„๊ณผ ๋ถ„๋ฆฌ๋œ ํ™•์‚ฐ ํ—ค๋“œ๋ฅผ ํ†ตํ•ด ๋†’์€ ์ •ํ™•๋„์™€ ๋น ๋ฅธ ์ถ”๋ก  ์†๋„๋ฅผ ๋™์‹œ์— ๋‹ฌ์„ฑํ•œ๋‹ค.

Motivation

Achievement

Figure 1

Fig. 1: Unified Video Action Model. (a) UVA features a joint video-action latent representation and decoupled video-acti

How

Figure 2

Fig. 2: Network Architecture. Given historical observations {Otโˆ’h+1, . . . , Ot} and corresponding action chunks {Atโˆ’h,

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 3/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: UVA๋Š” ๋น„๋””์˜ค์™€ ์•ก์…˜ ํ•™์Šต์˜ ์˜ค๋žœ ํŠธ๋ ˆ์ด๋“œ์˜คํ”„๋ฅผ ํ†ตํ•ฉ ์ž ์žฌ ํ‘œํ˜„๊ณผ ๋ถ„๋ฆฌ๋œ ๋””์ฝ”๋”ฉ์œผ๋กœ ํšจ๊ณผ์ ์œผ๋กœ ํ•ด๊ฒฐํ•˜๋ฉฐ, ๋งˆ์Šคํฌ ํ›ˆ๋ จ์„ ํ†ตํ•œ ๋‹ค๋ชฉ์  ํ™œ์šฉ์œผ๋กœ ๋กœ๋ด‡ ํ•™์Šต ํ”„๋ ˆ์ž„์›Œํฌ์˜ ์‹ค์šฉ์„ฑ์„ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œํ‚จ๋‹ค.

← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •