Improving health question answering with reliable and time-aware evidence retrieval

์ €์ž: Juraj Vladika, Florian Matthes (Technical University of Munich) | ๋‚ ์งœ: 2024 | DOI: 10.48550/arXiv.2404.08359 📄 PDF


Essence

Figure 1

Figure 1: The question-answering system used in our

๋ณธ ๋…ผ๋ฌธ์€ open-domain health question answering ์‹œ์Šคํ…œ์—์„œ retrieved evidence์˜ ํ’ˆ์งˆ๊ณผ ์–‘์ด QA ์„ฑ๋Šฅ์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ์„ ์ฒด๊ณ„์ ์œผ๋กœ ๋ถ„์„ํ•œ๋‹ค. PubMed์˜ 2์ฒœ๋งŒ ๊ฐœ biomedical ๋…ผ๋ฌธ์„ knowledge base๋กœ ํ™œ์šฉํ•˜์—ฌ ๋ฌธ์„œ ๊ฐœ์ˆ˜, ๋ฐœํ–‰ ์—ฐ๋„, ์ธ์šฉ ์ˆ˜ ๋“ฑ์˜ retrieval ์ „๋žต์ด ์ตœ์ข… QA ์„ฑ๋Šฅ์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ์„ ์‹คํ—˜์ ์œผ๋กœ ํ‰๊ฐ€ํ•œ๋‹ค.

Motivation

Achievement

Figure 1

Figure 1: The question-answering system used in our

Retrieved document ๊ฐœ์ˆ˜ ์ตœ์ ํ™”: ๋ฌธ์„œ ๊ฐœ์ˆ˜๋ฅผ ์ค„์ž„์œผ๋กœ์จ ์ตœ๋Œ€ 10% ์„ฑ๋Šฅ ํ–ฅ์ƒ. ์‹œ๊ฐ„ ์ธ์‹์  retrieval: ์ตœ๊ทผ ๋ฐœํ–‰ ๋…ผ๋ฌธ๊ณผ ๋†’์€ ์ธ์šฉ ์ˆ˜์˜ document๋ฅผ ์šฐ์„ ํ•˜๋ฉด QA ์„ฑ๋Šฅ ๊ฐœ์„ . ๋Œ€๊ทœ๋ชจ evidence corpus: PubMed 2์ฒœ๋งŒ ๊ฐœ ๋…ผ๋ฌธ์œผ๋กœ open-domain health QA ํ‰๊ฐ€. ์ •์„ฑ์  ๋ถ„์„: evidence disagreement ๋“ฑ ์‹ค์ œ ๋ฌธ์ œ์  ํŒŒ์•… ๋ฐ ๋ฏธ๋ž˜ ์—ฐ๊ตฌ ๋ฐฉํ–ฅ ์ œ์‹œ.

How

Figure 1

Figure 1: The question-answering system used in our

โ€ข ์„ธ ๊ฐœ์˜ health/biomedical question dataset(์งˆ๋ฌธ๊ณผ yes/no ๋‹ต๋ณ€ ํฌํ•จ)์œผ๋กœ ์‹คํ—˜ ์ˆ˜ํ–‰

โ€ข PubMed ์ „์ฒด corpus๋ฅผ knowledge base๋กœ indexing

โ€ข Retrieved document ๊ฐœ์ˆ˜(1~100๊ฐœ)์™€ extracted sentence ๊ฐœ์ˆ˜๋ฅผ ๋ณ€์ˆ˜๋กœ ์„ค์ •

โ€ข Document์˜ publication year์™€ citation count ๊ธฐ๋ฐ˜ ํ•„ํ„ฐ๋ง ๋ฐ ์žฌ์ˆœ์œ„ํ™”

โ€ข Precision, Recall, macro F1์„ ํ‰๊ฐ€ ์ง€ํ‘œ๋กœ ์‚ฌ์šฉ

โ€ข Reader ๋ชจ๋“ˆ์€ ๊ณ ์ •ํ•˜๊ณ  retrieval ์„ค์ •๋งŒ ๋ณ€๋™

Originality

โ€ข Biomedical questions์— ๋Œ€ํ•ด ์ฒ˜์Œ์œผ๋กœ temporal aspect(๋ฐœํ–‰ ์—ฐ๋„)๋ฅผ ์ฒด๊ณ„์ ์œผ๋กœ ํƒ์ƒ‰

โ€ข Retrieved document ๊ฐœ์ˆ˜๋ฅผ ๊ณ ์ •ํ•˜์ง€ ์•Š๊ณ  ์ตœ์ ๊ฐ’์„ ์ฐพ๋Š” ์‹คํ—˜์  ์ ‘๊ทผ

โ€ข Citation count ๋“ฑ evidence quality ์ง€ํ‘œ๋ฅผ ํ†ตํ•ฉ์ ์œผ๋กœ ๋ถ„์„

โ€ข PubMed ์ „์ฒด 2์ฒœ๋งŒ ๊ฐœ ๋…ผ๋ฌธ์„ ํ™œ์šฉํ•œ largest document collection ์‚ฌ์šฉ

Limitation & Further Study

โ€ข ์„ธ ๊ฐœ dataset๋งŒ ์‚ฌ์šฉํ•˜์—ฌ ๊ฒฐ๊ณผ์˜ ์ผ๋ฐ˜ํ™” ๊ฐ€๋Šฅ์„ฑ ์ œํ•œ. โ€ข Closed-domain QA์—์„œ gold evidence๊ฐ€ ์ œ๊ณต๋˜๋ฏ€๋กœ real open-domain ์„ฑ๋Šฅ๊ณผ ๊ดด๋ฆฌ ๊ฐ€๋Šฅ. โ€ข Evidence disagreement์— ๋Œ€ํ•œ ์ •์„ฑ์  ๋ถ„์„๋งŒ ์ œ์‹œ๋˜๊ณ  ํ•ด๊ฒฐ ๋ฐฉ์•ˆ ๋ฏธ์ œ์‹œ. โ€ข Reader ๋ชจ๋“ˆ์˜ ์„ ํƒ(ํŠน์ • architecture)์— ๋”ฐ๋ฅธ ๊ฒฐ๊ณผ ๋ฏผ๊ฐ๋„ ๋ถ„์„ ๋ถ€์žฌ. โ€ข ์ถ”ํ›„ ์—ฐ๊ตฌ๋กœ user-friendly explanation ์ƒ์„ฑ ๋ฉ”์ปค๋‹ˆ์ฆ˜ ๊ฐœ๋ฐœ ํ•„์š”.

Evaluation

Novelty: 4/5 Technical Soundness: 4/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: ๋ณธ ๋…ผ๋ฌธ์€ health QA์—์„œ retrieval ์ „๋žต์˜ ์˜ํ–ฅ์„ ์ฒด๊ณ„์ ์œผ๋กœ ํ‰๊ฐ€ํ•œ ์‹ค์šฉ์  ๊ฐ€์น˜ ๋†’์€ ์—ฐ๊ตฌ๋กœ, ์ตœ์‹  ๋ฐ ๋†’์ธ์šฉ document ์šฐ์„ ์˜ ์ „๋žต์ด QA ์„ฑ๋Šฅ์„ 10% ํ–ฅ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ์Œ์„ ์ž…์ฆํ–ˆ๋‹ค. ๋‹ค๋งŒ ๊ฒฐ๊ณผ์˜ ์ผ๋ฐ˜ํ™”์™€ evidence disagreement ํ•ด๊ฒฐ์— ๋Œ€ํ•œ ๊นŠ์ด ์žˆ๋Š” ๋…ผ์˜๊ฐ€ ์ถ”๊ฐ€๋˜๋ฉด ๋”์šฑ ์™„์„ฑ๋„ ๋†’์€ ์—ฐ๊ตฌ๊ฐ€ ๋  ์ˆ˜ ์žˆ๋‹ค.

๊ฐ™์ด ๋ณด๋ฉด ์ข‹์€ ๋…ผ๋ฌธ

๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
PubMedQA๋Š” ์ƒ์˜ํ•™ ๋…ผ๋ฌธ ๊ธฐ๋ฐ˜ ๊ฑด๊ฐ• ์งˆ๋ฌธ ์‘๋‹ต์˜ ๋Œ€ํ‘œ์  ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ, 424 ๋…ผ๋ฌธ์˜ ์ฆ๊ฑฐ ๊ฒ€์ƒ‰ ์ „๋žต ์—ฐ๊ตฌ์˜ ๋ฐ์ดํ„ฐ ๊ธฐ๋ฐ˜์ด ๋œ๋‹ค.
๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
Large Language Models are Zero Shot Hypothesis Proposers ๋…ผ๋ฌธ์€ LLM์˜ ๊ณผํ•™์  ์งˆ์˜ยท๊ฐ€์„ค ์ƒ์„ฑ ๋Šฅ๋ ฅ์˜ ์ด๋ก ์  ๊ทผ๊ฑฐ๋ฅผ ์ œ๊ณตํ•˜์—ฌ, ์‹ ๋ขฐ์„ฑยท์‹œ๊ฐ„์ •๋ณด ํ†ตํ•ฉ QA ๊ฐœ์„ ์˜ ํ† ๋Œ€๋ฅผ ๋งˆ๋ จํ•ฉ๋‹ˆ๋‹ค.
๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
LLM ๊ธฐ๋ฐ˜ QA ์‹œ์Šคํ…œ์—์„œ ์ฆ๊ฑฐ ๊ฒ€์ƒ‰ ์ „๋žต์— ๋”ฐ๋ฅธ ๋‹ต๋ณ€ ์‹ ๋ขฐ๋„ ๋ณ€ํ™” ์‹ค์ฆ์—ฐ๊ตฌ๋กœ, 500์˜ ์ž๋™ ๊ณผํ•™ ์ฃผ์žฅ ๊ฒ€์ฆ ํ”„๋ ˆ์ž„์›Œํฌ ์„ค๊ณ„์— ๋ฐ์ดํ„ฐ ๋ฐ ๊ฒฐ๊ณผ์  ๊ธฐ๋ฐ˜์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
PaperQA๋Š” ๊ณผํ•™์  ์งˆ์˜์‘๋‹ต์—์„œ RAG ๊ธฐ๋ฐ˜ LLM ์ฆ๊ฑฐ ๊ฒ€์ƒ‰ ๊ฐ•ํ™” ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•˜์—ฌ, ๊ฑด๊ฐ• QA์— ํŠนํ™”๋œ ๋ณธ ๋…ผ๋ฌธ์˜ ๋ฐฉ๋ฒ•๊ณผ ๋น„๊ต๊ฐ€ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
์˜๋ฃŒ ๋ฌธํ—Œ ๋ถ„๋ฅ˜ ๋ฐ ์Šคํฌ๋ฆฌ๋‹์—์„œ ๋Œ€๊ทœ๋ชจ ์–ธ์–ด๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ•˜๋Š” ๊ด€๋ จ ์—ฐ๊ตฌ์ด๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
DEFAME๋Š” ๊ฑด๊ฐ•ยท๊ณผํ•™ ์งˆ๋ฌธ์— ๋Œ€ํ•œ ๋‹ค์ค‘๋ชจ๋‹ฌ ๊ทผ๊ฑฐ ๊ธฐ๋ฐ˜ ํŒฉํŠธ์ฒดํ‚น์„ ๋‹ค๋ฃจ๋ฉฐ, 424์™€ ์œ ์‚ฌํ•œ ๋ฌธ์ œ๋ฅผ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ๋กœ ํ™•์žฅํ•œ ์ ‘๊ทผ์ด๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
์ƒ์˜ํ•™ ๋ถ„์•ผ์—์„œ RAG์™€ ์ง€์‹ ๊ทธ๋ž˜ํ”„๋ฅผ ๊ฒฐํ•ฉํ•œ ์ •๋ณด ๊ฒ€์ƒ‰ ๋ฐ ์ถ”๋ก ์„ ๋‹ค๋ฃจ๋Š” ์œ ์‚ฌํ•œ ์—ฐ๊ตฌ์ด๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
LLM์˜ ํ•™์ˆ  ํ‰๊ฐ€ ๋Šฅ๋ ฅ์„ ๊ฒ€์ฆํ•˜๋Š” ์œ ์‚ฌํ•œ ์—ฐ๊ตฌ์ด๋‹ค.
ํ›„์† ์—ฐ๊ตฌ
500 ๋…ผ๋ฌธ์€ LLM๊ธฐ๋ฐ˜ RAG์˜ ์ฆ๊ฑฐ ๊ฒ€์ƒ‰๊ณผ ๋ฐ˜๋ฐ•/์ง€์ง€ ์ฆ๊ฑฐ ์ž๋™ ํŒ๋ณ„์„ ์ œ์•ˆํ•˜์—ฌ, 424์˜ ๊ณผํ•™์  Q&A ์‹ ๋ขฐ์„ฑ ํ–ฅ์ƒ ์—ฐ๊ตฌ๋ฅผ ์ตœ์‹  LLM ํ™œ์šฉ ๊ด€์ ์—์„œ ํ™•์žฅํ•œ๋‹ค.
์‘์šฉ ์‚ฌ๋ก€
Sciclaimhunt ๋…ผ๋ฌธ์€ ์ฆ๊ฑฐ๊ธฐ๋ฐ˜ ๊ณผํ•™ ์ฃผ์žฅ ๊ฒ€์ฆ ๋ฐ์ดํ„ฐ์…‹์„ ์ œ์‹œํ•˜์—ฌ, ๋ณธ ๋…ผ๋ฌธ์—์„œ ๊ฐ•์กฐํ•œ PubMed ๊ธฐ๋ฐ˜ ๊ฑด๊ฐ• ์งˆ๋ฌธ ์‘๋‹ต ์‹œ์Šคํ…œ์˜ ํ‰๊ฐ€ ๋ฐ ์‘์šฉ์— ๋„์›€์ด ๋ฉ๋‹ˆ๋‹ค.
← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •