Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

์ €์ž: Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, Lei Zhang | ๋‚ ์งœ: 2023-03-09 | URL: https://arxiv.org/abs/2303.05499 📄 PDF


Essence

Figure 3

Fig. 3: The framework of Grounding DINO. We present the overall framework, a feature

Grounding DINO๋Š” Transformer ๊ธฐ๋ฐ˜ detector DINO์™€ grounded pre-training์„ ๊ฒฐํ•ฉํ•˜์—ฌ ์–ธ์–ด ์ž…๋ ฅ(์นดํ…Œ๊ณ ๋ฆฌ๋ช… ๋˜๋Š” referring expressions)์œผ๋กœ ์ž„์˜์˜ ๊ฐ์ฒด๋ฅผ ํƒ์ง€ํ•˜๋Š” open-set object detector๋ฅผ ์ œ์‹œํ•œ๋‹ค. ํ•ต์‹ฌ์€ ์–ธ์–ด์™€ ๋น„์ „ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๋ฅผ ์„ธ ๋‹จ๊ณ„(feature enhancer, language-guided query selection, cross-modality decoder)์—์„œ ๊ธด๋ฐ€ํžˆ ์œตํ•ฉํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

Motivation

Achievement

How

Figure 2

Fig. 2: Extending closed-set detectors to open-set scenarios.

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 3/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: Grounding DINO๋Š” Transformer ๊ธฐ๋ฐ˜ detector์˜ structural advantage๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์„ธ ๋‹จ๊ณ„ ๋ชจ๋‘์—์„œ tight language-vision fusion์„ ๊ตฌํ˜„ํ•จ์œผ๋กœ์จ, open-set object detection์˜ ์ƒˆ๋กœ์šด SOTA๋ฅผ ์ˆ˜๋ฆฝํ–ˆ๋‹ค. ํฌ๊ด„์ ์ธ ๋ฒค์น˜๋งˆํฌ ํ‰๊ฐ€์™€ ์‹ค์šฉ์  ์‘์šฉ ์‚ฌ๋ก€๋ฅผ ํ†ตํ•ด ๋†’์€ ์—ฐ๊ตฌ ๊ฐ€์น˜๋ฅผ ์ž…์ฆํ•œ๋‹ค.

← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •