Search-TTA: A Multimodal Test-Time Adaptation Framework for Visual Search in the Wild

์ €์ž: Derek Ming Siang Tan, , Boyang Liu, Alok Raj, Qi Xuan Ang, Weiheng Dai, Tanishq Duhan, Jimmy Chiun, Yuhong Cao, Florian Shkurti, Guillaume Sartoretti | ๋‚ ์งœ: 2025-05-16 | URL: https://arxiv.org/abs/2505.11350 📄 PDF


Essence

Search-TTA๋Š” ์œ„์„ฑ ์ด๋ฏธ์ง€์™€ ํ˜„์žฅ ์„ผ์„œ ์ธก์ •์„ ํ™œ์šฉํ•˜์—ฌ VLM(Vision Language Model)์˜ ์˜ˆ์ธก์„ ์‹ค์‹œ๊ฐ„์œผ๋กœ ๊ฐœ์„ ํ•˜๋Š” ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ํ…Œ์ŠคํŠธํƒ€์ž„ ์ ์‘ ํ”„๋ ˆ์ž„์›Œํฌ๋กœ, ์•ผ์™ธ ๋กœ๋ด‡ ์‹œ๊ฐ ํƒ์ƒ‰ ์„ฑ๋Šฅ์„ 30%๊นŒ์ง€ ํ–ฅ์ƒ์‹œํ‚จ๋‹ค.

Motivation

Achievement

Figure 5

Figure 5: Multimodal Alignment

How

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 3/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: Search-TTA๋Š” ์•ผ์™ธ ์‹œ๊ฐ ํƒ์ƒ‰์—์„œ VLM์˜ ์˜ค๋ฅ˜๋ฅผ ์˜จ๋ผ์ธ์œผ๋กœ ๋ณด์ •ํ•˜๋Š” ํ˜์‹ ์ ์ธ ํ”„๋ ˆ์ž„์›Œํฌ๋กœ, ๋Œ€๊ทœ๋ชจ AVS-Bench ๋ฐ์ดํ„ฐ์…‹๊ณผ ํ•จ๊ป˜ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ ์‘๊ณผ ์‹ค์ œ ๋ฐฐํฌ ๊ฐ€๋Šฅ์„ฑ์„ ์‹œ์—ฐํ•œ๋‹ค. ๋‹ค๋งŒ ์™„์ „ํ•œ ํ˜„์žฅ ๊ฒ€์ฆ๊ณผ ์ด๋ก ์  ๋ถ„์„์ด ๋ณด์™„๋˜๋ฉด ๋”์šฑ ์™„์„ฑ๋„ ์žˆ๋Š” ์—ฐ๊ตฌ๊ฐ€ ๋  ๊ฒƒ์ด๋‹ค.

← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •