BioMiner: A Multi-modal System for Automated Mining of Protein-Ligand Bioactivity Data from Literature

์ €์ž: | ๋‚ ์งœ: 2026-04-23 | URL: https://arxiv.org/abs/2604.21508 📄 PDF


Essence

Figure 1

Figure 1. Overview of protein-ligand bioactivity extraction framework BIOMINER and benchmark BIOVISTA. (a) The whole

๋‹จ๋ฐฑ์งˆ-๋ฆฌ๊ฐ„๋“œ ์ƒ๋ฌผํ™œ์„ฑ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฌธํ—Œ์—์„œ ์ž๋™์œผ๋กœ ์ถ”์ถœํ•˜๋Š” ๋‹ค์ค‘๋ชจ๋‹ฌ ์‹œ์Šคํ…œ BIOMINER๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ์ƒ๋ฌผํ™œ์„ฑ ์˜๋ฏธ ํ•ด์„๊ณผ ๋ฆฌ๊ฐ„๋“œ ๊ตฌ์กฐ ๋ณต์›์„ ๋ช…์‹œ์ ์œผ๋กœ ๋ถ„๋ฆฌํ•˜์—ฌ ์ฒ˜๋ฆฌํ•˜๋ฉฐ, 16,457๊ฐœ ํ•ญ๋ชฉ์˜ BIOVISTA ๋ฒค์น˜๋งˆํฌ๋ฅผ ๊ตฌ์ถ•ํ•˜๊ณ  F1 0.32์„ ๋‹ฌ์„ฑํ–ˆ๋‹ค.

Motivation

Achievement

Figure 1

Figure 1. Overview of protein-ligand bioactivity extraction framework BIOMINER and benchmark BIOVISTA. (a) The whole

โ€ข BIOVISTA ๋ฒค์น˜๋งˆํฌ ๊ตฌ์ถ•: 500๊ฐœ ๋…ผ๋ฌธ์—์„œ ์ „๋ฌธ๊ฐ€๊ฐ€ ์ •์ œํ•œ 16,457๊ฐœ ์ƒ๋ฌผํ™œ์„ฑ ํ•ญ๋ชฉ๊ณผ 8,735๊ฐœ ๊ณ ์œ  ํ™”ํ•™๊ตฌ์กฐ ์ˆ˜๋ก. 6๊ฐœ ํ‰๊ฐ€ ์ž‘์—… ์ง€์›. \nโ€ข ๊ธฐ๋ณธ ์„ฑ๋Šฅ: ์ƒ๋ฌผํ™œ์„ฑ triplet ์ถ”์ถœ์—์„œ F1 0.32 ๋‹ฌ์„ฑ. \nโ€ข ์ „์ดํ•™์Šต ๊ฐœ์„ : 82,262๊ฐœ ๋ฐ์ดํ„ฐ๋กœ ๊ตฌ์ถ•ํ•œ ์‚ฌ์ „ํ•™์Šต ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค๊ฐ€ PDBbind v2016, CSAR-HiQ์—์„œ 3.9% RMSE ๊ฐœ์„ . \nโ€ข HITL ์›Œํฌํ”Œ๋กœ์šฐ: 26์‹œ๊ฐ„ ๋‚ด NLRP3 ์ƒ๋ฌผํ™œ์„ฑ ๋ฐ์ดํ„ฐ 1,592๊ฐœ ์ˆ˜์ง‘(ChEMBL ๋Œ€๋น„ 2๋ฐฐ), QSAR ๋ชจ๋ธ 38.6% EF1% ๊ฐœ์„ , 16๊ฐœ ์‹ ๊ทœ scaffold ํ›„๋ณด ์‹๋ณ„. \nโ€ข ๊ตฌ์กฐ ์ฃผ์„ ๊ฐ€์†ํ™”: PoseBusters ๋ฐ์ดํ„ฐ์…‹์—์„œ ์ˆ˜๋™ ์ž‘์—… ๋Œ€๋น„ 5.59๋ฐฐ ๊ณ ์†ํ™”, ์ •ํ™•๋„ 96.25%(์ˆ˜๋™ 90.5%).

How

Figure 1

Figure 1. Overview of protein-ligand bioactivity extraction framework BIOMINER and benchmark BIOVISTA. (a) The whole

โ€ข ๋ฌธ์„œ ํŒŒ์‹ฑ agent: ๋‹ค์ค‘๋ชจ๋‹ฌ ์†Œ์Šค์—์„œ ์ •๋ณด ์ถ”์ถœ \nโ€ข Bioactivity agent: semantic reasoning์„ ํ†ตํ•œ ์ƒ๋ฌผํ™œ์„ฑ ๊ฐ’, ํƒ€์ž…, ๋‹จ์œ„ ์ถ”์ถœ \nโ€ข Chemical structure agent: CSG-VSR ํŒจ๋Ÿฌ๋‹ค์ž„์œผ๋กœ MLLM์ด ํ™”ํ•™ ๊ธฐ๋ฐ˜ ์‹œ๊ฐ ํ‘œํ˜„์— ๋Œ€ํ•ด ์ถ”๋ก ํ•œ ํ›„ domain chemistry tools๋กœ ๋ถ„์ž ๊ตฌ์„ฑ \nโ€ข Integration agent: ์ถ”์ถœ๋œ ๋ฐ์ดํ„ฐ ๋ณ‘ํ•ฉ ๋ฐ ๊ฒ€์ฆ \nโ€ข Markush ๊ตฌ์กฐ ์—ด๊ฑฐ: ์ž๋™ ํˆด๊ณผ MLLM ์ถ”๋ก ์˜ ๋ฐ˜๋ณต์  ๊ฐœ์„ 

Originality

โ€ข ์ƒ๋ฌผํ™œ์„ฑ ์˜๋ฏธ ์ถ”์ถœ๊ณผ ํ™”ํ•™๊ตฌ์กฐ ๋ณต์›์„ ๋ช…์‹œ์ ์œผ๋กœ ๋ถ„๋ฆฌํ•˜๋Š” ์„ค๊ณ„ \nโ€ข ๋ณต์žกํ•œ Markush ๊ตฌ์กฐ๋ฅผ ์ž๋™์œผ๋กœ ๊ฐœ๋ณ„ ๋ถ„์ž๋กœ ์—ด๊ฑฐํ•˜๋Š” CSG-VSR ๋ฉ”์ปค๋‹ˆ์ฆ˜(๊ธฐ์กด ๋ฏธํ•ด๊ฒฐ ๊ณผ์ œ) \nโ€ข MLLM๊ณผ domain-specific ๋„๊ตฌ ๊ฒฐํ•ฉ์„ ํ†ตํ•œ ์ •ํ™•ํ•œ ๊ธฐํ˜ธ ํ‘œํ˜„ ์ƒ์„ฑ \nโ€ข ๋‹ค์ค‘๋ชจ๋‹ฌ agent ๊ธฐ๋ฐ˜ ๋ถ„ํ•ด ์•„ํ‚คํ…์ฒ˜

Limitation & Further Study

โ€ข ์„ฑ๋Šฅ ์ œ์•ฝ: F1 0.32๋Š” ์•„์ง ์‹ค์ œ ์šด์˜ ๊ธฐ์ค€์œผ๋กœ ๋‚ฎ์•„ HITL ์›Œํฌํ”Œ๋กœ์šฐ ํ•„์ˆ˜. \nโ€ข ๋ฒค์น˜๋งˆํฌ ๊ทœ๋ชจ: 16,457๊ฐœ๋Š” ์ฃผ์š” ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค(ChEMBL ์ˆ˜๋ฐฑ๋งŒ)์— ๋น„ํ•ด ์ œํ•œ์ ์ด๊ณ , 500๊ฐœ ๋…ผ๋ฌธ๋งŒ ๋Œ€์ƒ. \nโ€ข Markush ์ฒ˜๋ฆฌ ํ•œ๊ณ„: ๋ณต์žกํ•œ R-group ๊ตฌ์กฐ๋‚˜ ํŠน์ˆ˜ ํ™”ํ•™ ์กฐ๊ฑด์—์„œ์˜ ์ •ํ™•์„ฑ ๋ฏธ๊ฒ€์ฆ. \nโ€ข ์ผ๋ฐ˜ํ™” ๋ถˆํ™•์‹ค์„ฑ: PDBbind ๊ธฐ๋ฐ˜ ๋…ผ๋ฌธ์— ํŠนํ™”๋˜์–ด ๋‹ค๋ฅธ ํ•™๋ฌธ ์˜์—ญ ์ ์šฉ์„ฑ ๋ฏธ๋ช…ํ™•. \nโ€ข ๋น„์šฉ-ํšจ์œจ: MLLM API ํ˜ธ์ถœ ๋น„์šฉ, chemistry tools ์˜์กด์„ฑ ๋“ฑ ์‹ค์šด์˜ ๋ณต์žก๋„. \n\nํ›„์†์—ฐ๊ตฌ: ๋” ํฐ ๋ฒค์น˜๋งˆํฌ ์ˆ˜์ง‘, F1 ๊ฐœ์„ , ๋‹ค๋ฅธ ๊ณผํ•™ ๋„๋ฉ”์ธ ํ™•์žฅ, end-to-end ์„ฑ๋Šฅ ํ–ฅ์ƒ.

Evaluation

Novelty: 4/5 Technical Soundness: 4/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: BIOMINER๋Š” ๋‹ค์ค‘๋ชจ๋‹ฌ ์ƒ๋ฌผํ™œ์„ฑ ๋ฐ์ดํ„ฐ ์ถ”์ถœ์ด๋ผ๋Š” ์ •์˜๋˜์ง€ ์•Š์€ ๋ฌธ์ œ๋ฅผ ๋ช…ํ™•ํžˆ ์ •์˜ํ•˜๊ณ , CSG-VSR์„ ํ†ตํ•ด Markush ์—ด๊ฑฐ ๊ฐ™์€ ๊ธฐ์ˆ ์  ๊ณผ์ œ๋ฅผ ์ฐฝ์˜์ ์œผ๋กœ ํ•ด๊ฒฐํ–ˆ๋‹ค. BIOVISTA ๋ฒค์น˜๋งˆํฌ๋Š” ํ–ฅํ›„ ์—ฐ๊ตฌ๋ฅผ ์œ„ํ•œ ์†Œ์ค‘ํ•œ ์ž์‚ฐ์ด๋ฉฐ, ์„ธ ๊ฐ€์ง€ ์‘์šฉ ์‹œ๋‚˜๋ฆฌ์˜ค(์‚ฌ์ „ํ•™์Šต, HITL, ๊ตฌ์กฐ ์ฃผ์„)๊ฐ€ ์‹ค์งˆ์  ๊ฐ€์น˜๋ฅผ ์ž…์ฆํ–ˆ๋‹ค. ๋‹ค๋งŒ ์ ˆ๋Œ€ ์„ฑ๋Šฅ(F1 0.32)๊ณผ ๋ฒค์น˜๋งˆํฌ ๊ทœ๋ชจ ์ œ์•ฝ์ด ๊ด‘๋ฒ”์œ„ํ•œ ์‚ฐ์—… ์ ์šฉ์—๋Š” ํ•œ๊ณ„๊ฐ€ ์žˆ์œผ๋ฏ€๋กœ ์ง€์†์  ๊ฐœ์„ ์ด ํ•„์š”ํ•˜๋‹ค.

๊ฐ™์ด ๋ณด๋ฉด ์ข‹์€ ๋…ผ๋ฌธ

๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
Bioinformatics ๋ถ„์•ผ์˜ ๋Œ€ํ˜• ์–ธ์–ด๋ชจ๋ธ, ๋ฐ์ดํ„ฐ, ๋‹ค์ค‘๋ชจ๋‹ฌ ์ฒ˜๋ฆฌ ์ด๋ก ์ด BioMiner์˜ ์‹œ์Šคํ…œ ๋ฐ ๋ฒค์น˜๋งˆํ‚น ๊ตฌ์กฐ์˜ ๊ธฐ๋ฐ˜์ด ๋ฉ๋‹ˆ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
131 ๋…ผ๋ฌธ์€ LLM ๊ธฐ๋ฐ˜ ํ”„๋กœํ…Œ์˜ค๋ฏน์Šค ์—ฐ๊ตฌ ์ž๋™ํ™”๋ฅผ ๋‹ค๋ฃจ์–ด, 3043๊ณผ ๊ฐ™์ด ๋„๊ตฌ ๋ฐ ํ…์ŠคํŠธ์—์„œ ๋ฐ์ดํ„ฐ ์ถ”์ถœ ๋ฐฉ๋ฒ•์„ ๋น„๊ตํ•  ๋งŒํ•˜๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
ProtAgents ๋…ผ๋ฌธ์€ ๋‹จ๋ฐฑ์งˆ ๋ฐœ๊ฒฌ AI ๋ฉ€ํ‹ฐ์—์ด์ „ํŠธ ์ž‘์—…์„ ๋‹ค๋ฃจ์–ด, BioMiner์˜ ๋‹จ๋ฐฑ์งˆ-๋ฆฌ๊ฐ„๋“œ ์ •๋ณด ์ถ”์ถœ๋ฌธ์ œ์— ๋Œ€ํ•œ agent ๊ธฐ๋ฐ˜ ๋Œ€์•ˆ ๋ฐฉ๋ฒ•์„ ์ œ์‹œํ•œ๋‹ค.
ํ›„์† ์—ฐ๊ตฌ
524 ๋…ผ๋ฌธ์€ ์‹œ๊ฐ์  ๊ณผํ•™๋ฌธํ—Œ์—์„œ ์ •๋ณด ์ถ”์ถœ ๋ฐ ๊ตฌ์กฐ ๋ณต์›์„ ๋‹ค๋ฃจ์–ด, 3043์˜ ๋‹จ๋ฐฑ์งˆ-๋ฆฌ๊ฐ„๋“œ ์ƒ๋ฌผํ™œ์„ฑ ๋ฐ์ดํ„ฐ ๋งˆ์ด๋‹ ๋ฐฉ๋ฒ•์„ ๊ธฐ์ˆ ์ ์œผ๋กœ ํ™•์žฅํ•œ๋‹ค.
ํ›„์† ์—ฐ๊ตฌ
3043์€ ๋‹จ๋ฐฑ์งˆ ์ •๋ณด ์ถ”์ถœ์„ ์œ„ํ•œ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์‹œ์Šคํ…œ์„ ์ œ์‹œํ•˜์—ฌ, 2186์˜ ChemMiner๊ฐ€ ์ œ์•ˆํ•œ LLM ๊ธฐ๋ฐ˜ ํ™”ํ•™์ •๋ณด ์ž๋™ํ™”์˜ ์ ์šฉ ํ™•์žฅํŒ์ด๋‹ค.
ํ›„์† ์—ฐ๊ตฌ
BioMiner์˜ ๋ฌธํ—Œ๊ธฐ๋ฐ˜ ๋ถ„์žยท๋‹จ๋ฐฑ์งˆ์ •๋ณด ์ถ”์ถœ๊ณผ์ •์ด retrieval-augmented foundation model ๊ธฐ๋ฐ˜ ๋ถ„์ž๋งค์นญ์œผ๋กœ ํ™•์žฅ ์—ฐ๊ตฌ๋œ๋‹ค.
ํ›„์† ์—ฐ๊ตฌ
BioPipelines๋Š” BioMiner์—์„œ ๋ฌธํ—Œ์œผ๋กœ ์ถ”์ถœํ•œ ๋‹จ๋ฐฑ์งˆ-๋ฆฌ๊ฐ„๋“œ ์ƒ๋ฌผํ™œ์„ฑ ๋ฐ์ดํ„ฐ์˜ ์‹ค์ œ ์„ค๊ณ„/๋ถ„์„ ํ”„๋กœ์„ธ์Šค ์ „์ฒด ํŒŒ์ดํ”„๋ผ์ธ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
ํ›„์† ์—ฐ๊ตฌ
BioMiner๋Š” ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์‹œ์Šคํ…œ์„ ํ†ตํ•œ ๋‹จ๋ฐฑ์งˆ ์ด๋ฏธ์ง€์™€ ๊ธฐ๋Šฅ์ •๋ณด ๋งˆ์ด๋‹์„ ํ•˜์—ฌ, ์ƒ์„ฑ ๋ชจ๋ธ ๊ธฐ๋ฐ˜ ๋ถ„์„์— ์‹ค์งˆ์  ์‘์šฉ์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.
์‘์šฉ ์‚ฌ๋ก€
๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ •๋ณด์ถ”์ถœ๊ณผ ๋„๊ตฌ ๊ฒฐํ•ฉ ๊ธฐ๋ฐ˜์˜ ๊ณผํ•™์˜์ƒ ๋ถ„์„ ์‚ฌ๋ก€๋ฅผ ํ†ตํ•ด aiscivision์ด ์‹ค์ œ ์ƒ๋ฌผํ•™ยท์˜ํ•™ ๋“ฑ ์˜์—ญ์—์„œ ์–ด๋–ป๊ฒŒ ์“ฐ์ผ ์ˆ˜ ์žˆ๋Š”์ง€ ๋ณด์—ฌ์ค€๋‹ค.
์‘์šฉ ์‚ฌ๋ก€
BioMiner ๋…ผ๋ฌธ์€ multi-modal protein-ligand data extraction์„ ๋‹ค๋ฃจ์–ด M2UMol์˜ modality knowledge transfer ๋ฐฉ์‹์„ ์‹ค์ œ ์ƒ๋ฌผํ•™ ๋ฐ์ดํ„ฐ์— ์ ์šฉํ•˜๋Š” ์˜ˆ์‹œ๊ฐ€ ๋œ๋‹ค.
← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •