BioMedLM: A 2.7B Parameter Language Model Trained on Biomedical Text

์ €์ž: Elliot Bolton, Abhinav Venigalla, Michihiro Yasunaga, David Hall, Betty Xiong | ๋‚ ์งœ: 2024 | DOI: arXiv:2403.18421


Essence

Figure 1

Figure 1: Train and Validation Loss after 100k Batches

๋ณธ ๋…ผ๋ฌธ์€ ์ƒ์˜ํ•™ ๋ถ„์•ผ ํŠนํ™” 2.7B ํŒŒ๋ผ๋ฏธํ„ฐ GPT ์Šคํƒ€์ผ ์–ธ์–ด๋ชจ๋ธ์ธ BioMedLM์„ ์ œ์‹œํ•œ๋‹ค. PubMed ์ถ”์ƒ ๋ฐ ์ „๋ฌธ ๋…ผ๋ฌธ์œผ๋กœ ํ•™์Šต๋œ ์ด ๋ชจ๋ธ์€ ๋Œ€๊ทœ๋ชจ ์ผ๋ฐ˜ ๋ชจ๋ธ(GPT-4, Med-PaLM 2)๊ณผ ๊ฒฝ์Ÿ ๊ฐ€๋Šฅํ•œ ์„ฑ๋Šฅ์„ ์ œ๊ณตํ•˜๋ฉด์„œ๋„ ๊ฐœ์ธ์ •๋ณด ๋ณดํ˜ธ, ๋น„์šฉ ํšจ์œจ์„ฑ, ํˆฌ๋ช…์„ฑ์„ ๊ฐ–์ถ˜ ๋Œ€์•ˆ์„ ์ œ๊ณตํ•œ๋‹ค.

Motivation

Achievement

Figure 1

Figure 1: Train and Validation Loss after 100k Batches

How

Figure 1

Figure 1: Train and Validation Loss after 100k Batches

Originality

Limitation & Further Study

Evaluation

Novelty: 3/5 Technical Soundness: 4/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: BioMedLM์€ ์‹ค์šฉ์ ์ด๊ณ  ์ค‘์š”ํ•œ ๊ธฐ์—ฌ๋ฅผ ํ•˜๋Š” ์ž˜ ์‹คํ–‰๋œ ์—ฐ๊ตฌ์ด๋‹ค. ํˆฌ๋ช…์„ฑ, ๊ฐœ์ธ์ •๋ณด ๋ณดํ˜ธ, ๊ฒฝ์ œ์„ฑ์„ ๊ฐ–์ถ˜ ์ค‘์†Œ ๊ทœ๋ชจ ์ƒ์˜ํ•™ ํŠนํ™” ๋ชจ๋ธ์„ ์ œ์‹œํ•˜์—ฌ ์˜๋ฃŒ ๊ธฐ๊ด€์˜ ์‹ค์ œ ์ˆ˜์š”๋ฅผ ํ•ด๊ฒฐํ•œ๋‹ค. ํ‰๊ฐ€๊ฐ€ ์ฒด๊ณ„์ ์ด๊ณ  ๊ฒฐ๊ณผ๊ฐ€ ์„ค๋“๋ ฅ ์žˆ์œผ๋‚˜, ์•„ํ‚คํ…์ฒ˜ ํ˜์‹ ์€ ์ œํ•œ์ ์ด๊ณ  ์ตœ์‹  ๊ธฐ๋ฒ•๋“ค์ด ๋ฏธ์ ์šฉ๋˜์—ˆ๋‹ค. ๋„๋ฉ”์ธ ํŠนํ™”์˜ ์‹ค์งˆ์  ๊ฐ€์น˜๋ฅผ ๋ช…ํ™•ํžˆ ๋ณด์—ฌ์ฃผ๋Š” ์ข‹์€ ์‹ค์ฆ ์—ฐ๊ตฌ์ด๋‹ค.

๊ฐ™์ด ๋ณด๋ฉด ์ข‹์€ ๋…ผ๋ฌธ

๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
SciBERT ์—ญ์‹œ ๊ณผํ•™ ํ…์ŠคํŠธ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ ์–ธ์–ด๋ชจ๋ธ๋กœ, BioMedLM์˜ ๋„๋ฉ”์ธ ํŠนํ™” LLM ์ ‘๊ทผ์˜ ์ฃผ์š” ์„ ํ–‰ ์‚ฌ๋ก€์ด๋‹ค.
๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
BioBERT๋Š” BioMedLM ๊ฐœ๋ฐœ์„ ์œ„ํ•œ ์ „๋ฌธ ์ƒ์˜ํ•™ ์‚ฌ์ „ํ•™์Šต ๋ชจ๋ธ๋กœ, ๊ธฐ๋ณธ์ ์ธ ์–ธ์–ด ํ‘œํ˜„ ๋Šฅ๋ ฅ์˜ ํ† ๋Œ€๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
๋ฐ”์ด์˜ค ๋ฐ ํด๋ฆฌ๋‹‰ ๋ถ„์•ผ ํŠนํ™” ๋Œ€ํ˜• ์–ธ์–ด๋ชจ๋ธ์˜ ๊ตฌ์กฐ์™€ ์‘์šฉ์„ ํฌ๊ด„์ ์œผ๋กœ ์ •๋ฆฌํ•œ ์„œ๋ฒ ์ด์ž…๋‹ˆ๋‹ค.
๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
167(BioMedLM)์€ ๋ฐ”์ด์˜ค๋ฉ”๋””์ปฌ ๋ถ„์•ผ ํŠนํ™” LLM์˜ ๊ตฌ์ถ• ์›๋ฆฌ๋ฅผ ์„ค๋ช…ํ•˜์—ฌ, MEIsensor(3164)์— ๊ด€๋ จ ๋ฐ์ดํ„ฐ ๋ฐ ํŠนํ™” ํ‘œํ˜„ํ•™์Šต ํ™œ์šฉ์˜ ๊ทผ๊ฑฐ๊ฐ€ ๋ฉ๋‹ˆ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
BioMedLM ๋Œ€๋น„ ์‹ค์ œ ์ž„์ƒ QA์™€ ํ˜‘์—… ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ LLM์„ ํ™œ์šฉํ•˜๋Š” ๋ฐฉ์•ˆ์„ ํƒ๊ตฌํ•ฉ๋‹ˆ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
Foundation models in bioinformatics ๋…ผ๋ฌธ์€ BioMedLM๊ณผ ๊ฐ™์€ ๋ฐ”์ด์˜ค ํŠนํ™” ๋Œ€ํ˜•๋ชจ๋ธ์˜ ์ „๋ฐ˜์  ์„ฑ๋Šฅ, ํŒŒ๊ธ‰ํšจ๊ณผ๋ฅผ ๋น„๊ตํ•˜๋Š” ๋ฆฌ๋ทฐ์ž…๋‹ˆ๋‹ค.
์‘์šฉ ์‚ฌ๋ก€
MedBioLM ๋…ผ๋ฌธ์€ BioMedLM์ฒ˜๋Ÿผ QA ์„ฑ๋Šฅ ์ตœ์ ํ™”๋œ ๋ชจ๋ธ๋กœ, ์‹ค์ œ ๋ฐ”์ด์˜ค ์งˆ์˜์‘๋‹ต์— ์–ด๋–ป๊ฒŒ ์ ์šฉ๋˜๋Š”์ง€ ๋ณด์—ฌ์ค€๋‹ค.
← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •