SciCap+: A Knowledge Augmented Dataset to Study the Challenges of Scientific Figure Captioning

์ €์ž: Zhishen Yang, Raj Dabre, Hideki Tanaka, Naoaki Okazaki | ๋‚ ์งœ: 2023 | DOI: 10.48550/ARXIV.2306.03491 📄 PDF


Essence

Figure 1

Figure 1: An example figure (Zhang et al. 2019) with its cap-

๋ณธ ๋…ผ๋ฌธ์€ ํ•™์ˆ  ๋…ผ๋ฌธ์˜ ๊ณผํ•™ ๋„ํ˜•์— ๋Œ€ํ•œ ์บก์…˜ ์ƒ์„ฑ ๋ฌธ์ œ๋ฅผ ์žฌ์ •์˜ํ•˜๋Š” ์—ฐ๊ตฌ์ด๋‹ค. ๊ธฐ์กด์˜ ๋„ํ˜•-์บก์…˜ ์ƒ์„ฑ ์ž‘์—…์„ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์š”์•ฝ ์ž‘์—…์œผ๋กœ ์žฌ์ •์˜ํ•˜๊ณ , mention-paragraph์™€ OCR ํ† ํฐ์„ ํฌํ•จํ•˜๋Š” SciCap+ ๋ฐ์ดํ„ฐ์…‹์„ ๊ตฌ์ถ•ํ•˜์—ฌ ๋ฐฐ๊ฒฝ ์ง€์‹ ํ™œ์šฉ์˜ ์ค‘์š”์„ฑ์„ ์ž…์ฆํ•˜์˜€๋‹ค.

Motivation

Achievement

Figure 2

Figure 2: The overall workflow of the data augmentation for creating SciCap+ dataset. For each figure in SciCap+, we ext

mention-paragraph์™€ OCR์˜ ํšจ๊ณผ: mention-paragraph๊ฐ€ ์ถ”๊ฐ€๋  ๊ฒฝ์šฐ BLEU, METEOR, CIDEr ๋“ฑ์˜ ์ž๋™ ํ‰๊ฐ€ ์ ์ˆ˜๊ฐ€ ๋„ํ˜• ๋‹จ๋… ๋ฒ ์ด์Šค๋ผ์ธ ๋Œ€๋น„ ํฌ๊ฒŒ ํ–ฅ์ƒ๋จ์„ ์ž…์ฆํ–ˆ๋‹ค. ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ง€์‹์˜ ๊ฐ€์น˜: ์„œ๋กœ ๋‹ค๋ฅธ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ์— ์ž„๋ฒ ๋“œ๋œ ์ง€์‹์ด ์บก์…˜ ์ƒ์„ฑ ์„ฑ๋Šฅ์„ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ๊ฒƒ์œผ๋กœ ํ™•์ธ๋˜์—ˆ๋‹ค. ์ธ๊ฐ„ ํ‰๊ฐ€๋ฅผ ํ†ตํ•œ ๋‚œ์ œ ๋ถ„์„: ๋ชจ๋ธ ์ƒ์„ฑ ์บก์…˜์ด ์ธ๊ฐ„์ด ์ƒ์„ฑํ•œ ์บก์…˜๊ณผ ์œ ์‚ฌํ•œ ์ •๋ณด์„ฑ์„ ๊ฐ€์ง€๋ฉฐ, ์ธ๊ฐ„๋„ mention-paragraph๋ฅผ ์ฐธ์กฐํ•  ๋•Œ ground-truth ์บก์…˜ ์ž‘์„ฑ์ด ์—ฌ์ „ํžˆ ์–ด๋ ค์›€์„ ๋ณด์˜€๋‹ค. ๋ฐ์ดํ„ฐ์…‹ ๊ธฐ์—ฌ: ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ •๋ณด๋ฅผ ํฌํ•จํ•œ ๋Œ€๊ทœ๋ชจ SciCap+ ๋ฐ์ดํ„ฐ์…‹(414k ๋„ํ˜•)์„ ๊ณต๊ฐœํ–ˆ๋‹ค.

How

Figure 2

Figure 2: The overall workflow of the data augmentation for creating SciCap+ dataset. For each figure in SciCap+, we ext

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 4/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: ๋ณธ ๋…ผ๋ฌธ์€ ๊ณผํ•™ ๋„ํ˜• ์บก์…”๋‹์„ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์š”์•ฝ ๋ฌธ์ œ๋กœ ์žฌ์ •์˜ํ•˜๊ณ  SciCap+ ๋ฐ์ดํ„ฐ์…‹์„ ํ†ตํ•ด ๋ฐฐ๊ฒฝ ์ง€์‹์˜ ์ค‘์š”์„ฑ์„ ์ฒด๊ณ„์ ์œผ๋กœ ์ž…์ฆํ•œ ์˜๋ฏธ ์žˆ๋Š” ์—ฐ๊ตฌ์ด๋‹ค. ์ž๋™ ํ‰๊ฐ€์™€ ์ธ๊ฐ„ ํ‰๊ฐ€์˜ ๋ณ‘ํ–‰์œผ๋กœ ๋ฌธ์ œ์˜ ๋‚œ์ œ์„ฑ์„ ๋ช…ํ™•ํžˆ ํ–ˆ์œผ๋ฉฐ, ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์…‹ ๊ณต๊ฐœ๋Š” ํ›„์† ์—ฐ๊ตฌ์— ํฐ ๊ธฐ์—ฌ๊ฐ€ ๋  ๊ฒƒ์ด๋‹ค. ๋‹ค๋งŒ ๋„ํ˜• ์œ ํ˜•์˜ ํ•œ์ •์„ฑ๊ณผ ๋‹จ์ผ ๋ฒ ์ด์Šค๋ผ์ธ ์‚ฌ์šฉ์€ ๊ฐœ์„ ํ•  ์—ฌ์ง€๊ฐ€ ์žˆ๋‹ค.

๊ฐ™์ด ๋ณด๋ฉด ์ข‹์€ ๋…ผ๋ฌธ

๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
๊ณผํ•™๋…ผ๋ฌธ figure caption ์ƒ์„ฑ์˜ ๊ธฐ๋ณธ baseline ๋ฐ ๋ฌธ์ œ ์„ค์ •์„ ๊ตฌ์ถ•ํ•œ SciCap(708) ๋…ผ๋ฌธ์ด knowledge-augmented version์ธ 709๋‚˜์˜ฌ ๊ธฐ๋ฐ˜์ž…๋‹ˆ๋‹ค.
๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
709๋Š” ์บก์…˜ ์ƒ์„ฑ์„ ์œ„ํ•œ ์ง€์‹ ์ฆ๊ฐ• ๋ฐ์ดํ„ฐ์…‹์„ ์„ค๊ณ„ํ•˜์—ฌ, 605์ฒ˜๋Ÿผ ํŠนํ—ˆ ๋“ฑ ๋„๋ฉ”์ธ๋ณ„ ์บก์…˜ ์ƒ์„ฑ ์—ฐ๊ตฌ์— ๊ธฐ์ดˆ ์ž๋ฃŒ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
564๋Š” ๋‹ค์ˆ˜์˜ LLM์ด ํ˜‘์—…ํ•˜์—ฌ ๊ณผํ•™ ๋„ํ‘œ ์บก์…˜ ์ƒ์„ฑ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๋Š” ์ ‘๊ทผ์œผ๋กœ, 709์˜ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ปจํ…์ŠคํŠธ์™€ ๋น„๊ต์  ์ฝ์„ ๊ฐ€์น˜๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.
ํ›„์† ์—ฐ๊ตฌ
709๋Š” SciCap ๋ฐ์ดํ„ฐ์…‹ ๋ฐ ์บก์…˜ ์ƒ์„ฑ์˜ ๋‹ค์ค‘๋ชจ๋“œ ์ปจํ…์ŠคํŠธ ํ™•์žฅ์„ ์‹œ๋„ํ•˜์—ฌ 708์˜ ํ›„์† ๋ฐœ์ „ ์‚ฌ๋ก€์ž…๋‹ˆ๋‹ค.
ํ›„์† ์—ฐ๊ตฌ
ํ…์ŠคํŠธ ๊ธฐ๋ฐ˜ figure reference์™€ ๋„ํ‘œ ์บก์…˜ ์ž๋™ ์ถ”๋ก ์„ ํƒ๊ตฌํ•œ 338๋ฒˆ ๋…ผ๋ฌธ์ด 709์—์„œ ๋„์ž…ํ•œ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ/์ปจํ…์ŠคํŠธ ๊ฐ•ํ™”๋ฅผ ํ•œ ๋‹จ๊ณ„ ํ™•์žฅํ•ฉ๋‹ˆ๋‹ค.
์‘์šฉ ์‚ฌ๋ก€
605 ๋…ผ๋ฌธ์€ ํŠนํ—ˆ ๋„๋ฉด์˜ ์บก์…˜ ์ž๋™์ƒ์„ฑ์— 709์—์„œ ๋‹ค๋ฃจ๋Š” ๊ณผํ•™ ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ์ •๋ ฌ์„ ์‹ค์ œ ์—”์ง€๋‹ˆ์–ด๋ง ๋„๋ฉ”์ธ์— ์ ์šฉํ•ฉ๋‹ˆ๋‹ค.
์‘์šฉ ์‚ฌ๋ก€
853์€ ์‹ค์ œ ๋…ผ๋ฌธ ์ž‘์„ฑ์ž๋“ค์ด AI-์ƒ์„ฑ ์ด๋ฏธ์ง€ ์บก์…˜์„ ์–ด๋–ป๊ฒŒ ์‚ฌ์šฉํ•˜๋Š”์ง€ ๋ถ„์„ํ•จ์œผ๋กœ์จ, 709์˜ ๋ชจ๋ธ์ด ์‹ค์„ธ๊ณ„์—์„œ ์–ด๋– ํ•œ ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š”์ง€ ๋ณด์—ฌ์ค€๋‹ค.
์‘์šฉ ์‚ฌ๋ก€
PaperBanana(601)์€ AI ๋…ผ๋ฌธ ์ž‘์„ฑ์„ ์œ„ํ•œ ์ž๋™ ๊ทธ๋ž˜ํ”ฝ ์ƒ์„ฑ tool๋กœ, figure ์บก์…”๋‹๊ณผ multimodal dataset์˜ ์‹ค์ œ์  ์‘์šฉ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.
← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •