Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models

์ €์ž: Siddharth Karamcheti, Suraj Nair, Ashwin Balakrishna, Percy Liang, Thomas Kollar, Dorsa Sadigh | ๋‚ ์งœ: 2024-02-12 | URL: https://arxiv.org/abs/2402.07865 📄 PDF


Essence

Figure 1

Figure 1. Prismatic VLMs. Through rigorous experiments ex-*

Visually-Conditioned Language Models (VLMs)์˜ ์„ค๊ณ„ ๊ณต๊ฐ„์„ ์ฒด๊ณ„์ ์œผ๋กœ ํƒ์ƒ‰ํ•˜์—ฌ ํ•ต์‹ฌ ์„ค๊ณ„ ๊ฒฐ์ •์ด ๋ชจ๋ธ ์„ฑ๋Šฅ์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ์„ ๋ถ„์„ํ•˜๊ณ , ํ‘œ์ค€ํ™”๋œ ํ‰๊ฐ€ ์Šค์œ„ํŠธ์™€ ์ตœ์ ํ™”๋œ ํ•™์Šต ์ฝ”๋“œ, ๊ทธ๋ฆฌ๊ณ  InstructBLIP๊ณผ LLaVa v1.5๋ฅผ ๋Šฅ๊ฐ€ํ•˜๋Š” Prismatic VLMs๋ฅผ ์ œ์‹œํ•œ๋‹ค.

Motivation

Achievement

Figure 1

Figure 1. Prismatic VLMs. Through rigorous experiments ex-*

How

Figure 2

Figure 2. Exploring VLM Design Axes. We explore four key design axes for developing VLMs: 1) optimization procedure, 2)

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 4/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: ์ด ๋…ผ๋ฌธ์€ VLM์˜ ์„ค๊ณ„ ๊ณต๊ฐ„์„ ์ฒด๊ณ„์ ์œผ๋กœ ํƒ์ƒ‰ํ•˜๋Š” ์ฒซ ํฌ๊ด„์  ์—ฐ๊ตฌ๋กœ, ํ‘œ์ค€ํ™”๋œ ํ‰๊ฐ€ ํ”„๋ ˆ์ž„์›Œํฌ์™€ ์ตœ์ ํ™”๋œ ํ•™์Šต ์ฝ”๋“œ, ๊ทธ๋ฆฌ๊ณ  ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์˜ ๋ชจ๋ธ์„ ์ œ์‹œํ•จ์œผ๋กœ์จ VLM ๊ฐœ๋ฐœ์˜ ๊ธฐ์ดˆ๋ฅผ ๋‹ค์ง„๋‹ค. ๊ณต๊ฐœ๋œ ๋ฆฌ์†Œ์Šค์™€ ๋ช…ํ™•ํ•œ ์ธ์‚ฌ์ดํŠธ๋Š” ํ›„์† ์—ฐ๊ตฌ๋ฅผ ํฌ๊ฒŒ ๊ฐ€์†ํ™”ํ•  ์ˆ˜ ์žˆ๋Š” ์ค‘์š”ํ•œ ๊ธฐ์—ฌ์ด๋‹ค.

← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •