Sequence modeling and design from molecular to genome scale with Evo

์ €์ž: Eric Nguyen, Michael Poli, Matthew G. Durrant, Armin W. Thomas, Brian Kang | ๋‚ ์งœ: 2024 | DOI: 10.1101/2024.02.27.582234 📄 PDF


Essence

Figure 1

Figure 1 | Pretraining a genomic foundation model across prokaryotic life. (A) A model of genome se-

Evo๋Š” 7์–ต ๊ฐœ ํŒŒ๋ผ๋ฏธํ„ฐ์˜ genomic foundation model๋กœ์„œ 131kb์˜ ๊ธด context length์—์„œ ๋‹จ์ผ nucleotide ํ•ด์ƒ๋„๋กœ DNA ์„œ์—ด์„ ์˜ˆ์ธกํ•˜๊ณ  ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ๋ถ„์ž ๊ทœ๋ชจ๋ถ€ํ„ฐ genome ๊ทœ๋ชจ๊นŒ์ง€ ๋‹ค์–‘ํ•œ ์ƒ๋ฌผํ•™์  ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•œ๋‹ค.

Motivation

Achievement

How

Figure 1

Figure 1 | Pretraining a genomic foundation model across prokaryotic life. (A) A model of genome se-

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 4/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: Evo๋Š” StripedHyena ์•„ํ‚คํ…์ฒ˜์™€ ๋‹จ์ผ nucleotide ํ•ด์ƒ๋„๋ฅผ ํ†ตํ•ด ๊ธด genomic context์—์„œ ์˜ˆ์ธก๊ณผ ์ƒ์„ฑ์„ ์ˆ˜ํ–‰ํ•˜๋Š” ํ˜์‹ ์ ์ธ genomic foundation model์ด๋ฉฐ, zero-shot ํ•จ์ˆ˜ ์˜ˆ์ธก์—์„œ SOTA ์„ฑ๋Šฅ ๋‹ฌ์„ฑ๊ณผ multi-component ์ƒ๋ฌผํ•™์  ์‹œ์Šคํ…œ ์„ค๊ณ„ ๊ฐ€๋Šฅ์„ฑ์„ ์ž…์ฆํ•˜์—ฌ ํ•ฉ์„ฑ์ƒ๋ฌผํ•™ ๋ถ„์•ผ์— ์ค‘๋Œ€ํ•œ ๊ธฐ์—ฌ๋ฅผ ํ•œ๋‹ค.

๊ฐ™์ด ๋ณด๋ฉด ์ข‹์€ ๋…ผ๋ฌธ

๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
SCAnpy(699)๋Š” ๋Œ€๊ทœ๋ชจ single-cell gene expression ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ์™€ ๋ถ„์„์˜ ํ‘œ์ค€ platform์œผ๋กœ, 749์˜ ์ƒ๋ฌผํ•™์  sequence ์˜ˆ์ธก/์ƒ์„ฑ์— ๊ธฐ๋ฐ˜์ด ๋ฉ๋‹ˆ๋‹ค.
๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
345๋Š” ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ํŒŒ์šด๋ฐ์ด์…˜ ๋ชจ๋ธ๊ณผ molecular grammar ๊ฐœ๋…์„ ๋…ผ์˜ํ•˜์—ฌ, 749์—์„œ ์ œ์•ˆ๋œ Evo ๋ชจ๋ธ์˜ ๊ธฐ๋ฐ˜ ๊ฐœ๋…์„ ์ œ๊ณตํ•œ๋‹ค.
๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
382๋Š” ESM ๊ธฐ๋ฐ˜ ์œ ์ „์ฒด ์„ค๊ณ„ ๋ฐ ์˜ˆ์ธก์„ ๋‹ค๋ฃจ๋Š” foundational ๋ชจ๋ธ๋กœ, 749์˜ Evo ๋ชจ๋ธ์ด ๋‹ค์–‘ํ•œ ์ƒ๋ฌผํ•™์  ์ž‘์—… ์ˆ˜ํ–‰์— ์ด๋ก ์  ํ† ๋Œ€๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
749๋ฒˆ ๋…ผ๋ฌธ์€ ๋‹ค๋ชจ๋‹ฌ ์„ค๊ณ„ ๋ฐ ๊ตฌ์กฐ-์„œ์—ด ๊ณต๋™์ƒ์„ฑ ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ํฌ๊ด„์ ์œผ๋กœ ๋‹ค๋ฃจ๋ฏ€๋กœ, 3112์˜ multimodal diffusion ๋ชจ๋ธ ๊ธฐ๋ฐ˜ de novo ๋‹จ๋ฐฑ์งˆ ์„ค๊ณ„์˜ ์ด๋ก ์  ๋ฐฐ๊ฒฝ์ด ๋ฉ๋‹ˆ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
619 ๋…ผ๋ฌธ์€ ๋ฌผ๋ฆฌ์ •๋ณด ์‹ ๊ฒฝ๋ง ๋ฐ ๋”ฅ๋Ÿฌ๋‹์„ ๋Œ€๊ทœ๋ชจ ์ƒ๋ช…๊ณผํ•™ ๋ฐ์ดํ„ฐ์— ์ ‘๋ชฉํ•˜์—ฌ sequence modeling ์ ‘๊ทผ๊ณผ ๋Œ€๋ฆฝ๋˜๋Š” ๋ฐฉํ–ฅ์„ฑ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
Transformer ๊ธฐ๋ฐ˜ ๊ฒŒ๋†ˆ ๊ธฐ์ดˆ ๋ชจ๋ธ๋กœ ์œ ์ „์ž ๊ทœ์ œ ์˜ˆ์ธก์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๋Œ€์•ˆ์  ์ ‘๊ทผ๋ฒ•์ด๋‹ค
๋‹ค๋ฅธ ์ ‘๊ทผ
291์€ ๋Œ€๊ทœ๋ชจ ํŒŒ์šด๋ฐ์ด์…˜ ๋ชจ๋ธ ๋Œ€์‹  contrastive ํ•™์Šต ๊ธฐ๋ฐ˜์˜ drug-disease ์ƒํ˜ธ์ž‘์šฉ ์˜ˆ์ธก ๋ฐฉ๋ฒ•์„ ์ œ์‹œํ•œ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
DNA ์„œ์—ด๋กœ๋ถ€ํ„ฐ ๋‹ค์ค‘ ์ƒ๋ฌผํ•™์  ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ์œ ์‚ฌํ•œ ํ†ตํ•ฉ ๊ฒŒ๋†ˆ ๋ชจ๋ธ์„ ์ œ์•ˆํ•œ ์—ฐ๊ตฌ์ด๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
๋ถ„์ž ๋ฐ ์œ ์ „์ฒด ์‹œํ€€์Šค ๋””์ž์ธ ๋ฌธ์ œ๋ฅผ ๋‹ค๋ฅธ ๋ชจ๋ธ๋ง ์ „๋žต(์˜ˆ: cross-domain sequence modeling)์œผ๋กœ ๋‹ค๋ฃจ์–ด ๋น„๊ต์— ์ ํ•ฉํ•˜๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
๋ถ„์ž์—์„œ ์œ ์ „์ฒด ์ „์ฒด์— ์ด๋ฅด๋Š” ์‹œํ€€์Šค ๋””์ž์ธ ๋ฐ ๋ชจ๋ธ๋ง ๋ฐฉ๋ฒ•๋ก ์„ ๋‹ค๋ฅธ ๋ฒ”์œ„์™€ ์ „๋žต์œผ๋กœ ์ œ์‹œํ•˜์—ฌ ์ƒํ˜ธ๋ณด์™„์ ์ด๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
์œ ์ „์žยท๋ถ„์ž ์„œ์—ด ๋ชจ๋ธ๋ง ๋ฐ ์„ค๊ณ„ ๋ฌธ์ œ์— ์–ธ์–ด๋ชจ๋ธ ๊ธฐ๋ฐ˜์˜ ๋‹ค์–‘ํ•œ ์ ‘๊ทผ์„ ๋‹ค๋ฃจ๋ฉฐ, RNA ์ตœ์ ํ™”์˜ ๋Œ€์•ˆ์  ์‹œ๊ฐ์„ ์ œ์‹œํ•œ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
749๋ฒˆ ๋…ผ๋ฌธ์€ single-to-multimodal ์„ค๊ณ„๋ฅผ ๋…ผ์˜ํ•˜์—ฌ, 3109์—์„œ ์ œ์‹œํ•œ sequence ๊ธฐ๋ฐ˜ ์ ‘๊ทผ๋ฒ•์˜ ์žฅ๋‹จ์ ์„ ๋น„๊ตํ•˜๋ฉฐ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
3171๋ฒˆ ๋…ผ๋ฌธ์€ ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ์œ ์ „์ž ์กฐ์ ˆ ์˜ˆ์ธก๊ณผ perturbation modeling์„ ๋‹ค์–‘ํ•œ scale์—์„œ ์ ์šฉํ•˜์—ฌ ์œ ์‚ฌ ์ฃผ์ œ๋ฅผ ๋‹ค๋ฃฌ ๋˜ ๋‹ค๋ฅธ ์ ‘๊ทผ๋ฒ•์ž…๋‹ˆ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
749๋Š” ์„œ์—ด ๊ธฐ๋ฐ˜์—์„œ ์œ ์ „์ฒดยท๋‹จ๋ฐฑ์งˆ ๊ธฐ๋Šฅ ์˜ˆ์ธก ๋ฐ ์„ค๊ณ„ ๋ฌธ์ œ๋ฅผ ๋‹ค๋ฃจ์–ด, 3275์˜ ์•”ํ˜ธํ™”๋œ ์–ธ์–ด๋ชจ๋ธ ๊ธฐ๋ฐ˜ ํšจ์†Œ ๋ฐœ๊ฒฌ ํ”„๋ ˆ์ž„์›Œํฌ์™€ ์ƒํ˜ธ๋ณด์™„์ ์œผ๋กœ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
๋ถ„์ž์—์„œ ๊ฒŒ๋†ˆ ์Šค์ผ€์ผ๊นŒ์ง€์˜ ์„œ์—ด ๋ชจ๋ธ๋ง๊ณผ ๋””์ž์ธ์„ ๋‹ค๋ฃจ๋ฉฐ, IDR-Prop2Seq์ฒ˜๋Ÿผ sequence design์˜ ๋ฒ”์ฃผ ํ™•์žฅ ์‹ค์ œ ์‚ฌ๋ก€์™€ ๋น„๊ตํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
ํ›„์† ์—ฐ๊ตฌ
505๋Š” LLM ๊ธฐ๋ฐ˜ ์œ ์ „์ž ์กฐ์ ˆ ๋„คํŠธ์›Œํฌ ์ถ”๋ก ์ด๋ผ๋Š” ์‘์šฉ์„ ๋ณด์ด๋ฉฐ, 749์—์„œ ์ œ๊ณตํ•˜๋Š” ์„œ์—ด ๋ชจ๋ธ๋ง ๊ธฐ๋ฒ•์˜ ์‹ค์ œ ํ™œ์šฉ ์˜ˆ๋ฅผ ๋ณด์—ฌ์ค€๋‹ค.
ํ›„์† ์—ฐ๊ตฌ
Foundation models in bioinformatics(344)๋Š” Evo(749)๊ฐ€ ๋‹ค๋ฃจ๋Š” genomic foundation model์˜ ์ด๋ก ์  ๋ฐ ์‹ค์šฉ์  ๋ฐœ์ „์‚ฌ๋ก€๋กœ, ๋‹ค์–‘ํ•œ ์ƒ๋ฌผํ•™์  ์‚ฐ์ถœ๋ฌผ ์˜ˆ์ธก์˜ ๋ฐฐ๊ฒฝ์„ ์ด๋ฃน๋‹ˆ๋‹ค.
ํ›„์† ์—ฐ๊ตฌ
749๋ฒˆ ๋…ผ๋ฌธ์€ ์„œ์—ด ๊ธฐ๋ฐ˜ ๋ฐ ๊ตฌ์กฐ ๊ธฐ๋ฐ˜ ๋‹จ๋ฐฑ์งˆ/๊ฒŒ๋†ˆ ์„ค๊ณ„์™€ ์˜ˆ์ธก์„ ํฌ๊ด„ํ•˜๋ฏ€๋กœ, 3104์˜ ์‹ค์ œ ๋‹จ๋ฐฑ์งˆ ์—”์ง€๋‹ˆ์–ด๋ง ์บ ํŽ˜์ธ ์ ์šฉ์„ฑ๊ณผ ์—ฐ๊ฒฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
์‘์šฉ ์‚ฌ๋ก€
LLM ๊ธฐ๋ฐ˜ ์‹œํ€€์Šค ๋ชจ๋ธ๋ง ๋ฐ ๋””์ž์ธ ์˜์—ญ์—์„œ ์‹ค์ œ ์‘์šฉ ์‚ฌ๋ก€๋ฅผ ํ†ตํ•ด, ์ด ๋ฆฌ๋ทฐ ๋…ผ๋ฌธ์˜ ๋…ผ์˜ ํญ์„ ๋„“ํ˜€์ค€๋‹ค.
์‘์šฉ ์‚ฌ๋ก€
749์˜ Evo ๋ชจ๋ธ์€ ๋Œ€๊ทœ๋ชจ genome-scale sequence modeling์„ ํ†ตํ•ด 856์˜ ๊ณ„์ธต์  ๊ตฌ์กฐ ์ •๋ณด ํฌ์ฐฉ ๋ชจ๋ธ ์ ์šฉ ๊ฐ€๋Šฅ์„ฑ์„ ์‹ค์ฆ์ ์œผ๋กœ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.
← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •