์ ์: Xi Chen, Josip Djolonga, Piotr Padlewski, Basil Mustafa, Soravit Changpinyo, Jialin Wu, Carlos Riquelme Ruiz, Sebastian Goodman, Xiao Wang, Yi Tay, Siamak Shakeri, Mostafa Dehghani, Daniel Salz, Mario Lucic, Michael Tschannen, Arsha Nagrani, Hexiang Hu, Mandar Joshi, Bo Pang, Ceslee Montgomery, Paulina Pietrzyk, Marvin Ritter, AJ Piergiovanni, Matthias Minderer, Filip Pavetic, Austin Waters, Gang Li, Ibrahim Alabdulmohsin, Lucas Beyer, Julien Amelot, Kenton Lee, Andreas Peter Steiner, Yang Li, Daniel Keysers, Anurag Arnab, Yuanzhong Xu, Keran Rong, Alexander Kolesnikov, Mojtaba Seyedhosseini, Anelia Angelova, Xiaohua Zhai, Neil Houlsby, Radu Soricut | ๋ ์ง: 2023-05-29 | URL: https://arxiv.org/abs/2305.18565 📄 PDF
Figure 1: [Left] Comparing PaLI-X against PaLI on image-captioning and VQA benchmarks. [Right]
PaLI-X๋ ์๊ฐ ๋ฐ ์ธ์ด ์ปดํฌ๋ํธ๋ฅผ ๊ท ํ์๊ฒ ํ์ฅํ ๋ค๊ตญ์ด ๋น์ -์ธ์ด ๋ชจ๋ธ๋ก, 25๊ฐ ์ด์์ ๋ฒค์น๋งํฌ์์ ์๋ก์ด ์ต์ฒจ๋จ ์ฑ๋ฅ์ ๋ฌ์ฑํ๋ฉฐ ๋ณต์กํ ๊ณ์ฐ๊ณผ ๋ค๊ตญ์ด ๊ฐ์ฒด ๊ฒ์ถ ๊ฐ์ ์๋ก์ด ๋ฅ๋ ฅ์ ๋ณด์ฌ์ค๋ค.
Figure 1: [Left] Comparing PaLI-X against PaLI on image-captioning and VQA benchmarks. [Right]
Figure 4: Visual input for videos: each frame is independently processed by ViT; patch embeddings
์ดํ: PaLI-X๋ ๊ท ํ์กํ ์ด๋ํ ๋น์ -์ธ์ด ๋ชจ๋ธ ํ์ฅ์ ํตํด ๊ด๋ฒ์ํ ์์ ์์ ์ต์ฒจ๋จ ์ฑ๋ฅ์ ๋ฌ์ฑํ๊ณ ์๋ก์ด emergence capability๋ฅผ ๋ณด์ฌ์ฃผ๋ ๋งค์ฐ ์๋ฏธ ์๋ ์ฐ๊ตฌ์ด๋ค. ๋จ, ๋ชจ๋ธ ๊ท๋ชจ๋ก ์ธํ ์ค๋ฌด ์ ์ฉ์ ์ ์ฝ๊ณผ emergence ๋ฉ์ปค๋์ฆ์ ๋ํ ์ฌ์ธต ๋ถ์์ด ์ถ๊ฐ๋๋ฉด ๋์ฑ ์ฐ์ํ ๋ ผ๋ฌธ์ด ๋ ๊ฒ์ด๋ค.