Iterative Distillation for Reward-Guided Fine-Tuning of Diffusion Models in Biomolecular Design

์ €์ž: Xingyu Su, Xiner Li, Masatoshi Uehara, Sunwoo Kim, Yulai Zhao | ๋‚ ์งœ: 2025 | DOI: 10.48550/arXiv.2507.00445 📄 PDF


Essence

์ƒ๋ฌผ๋ถ„์ž ์„ค๊ณ„์—์„œ ๋ฏธ๋ถ„๋ถˆ๊ฐ€๋Šฅํ•œ ๋ณด์ƒํ•จ์ˆ˜(reward function)๋ฅผ ์ตœ์ ํ™”ํ•˜๊ธฐ ์œ„ํ•ด ํ™•์‚ฐ๋ชจ๋ธ(diffusion model)์„ ์•ˆ์ •์ ์œผ๋กœ ๋ฏธ์„ธ์กฐ์ •ํ•˜๋Š” ์ƒˆ๋กœ์šด ํ”„๋ ˆ์ž„์›Œํฌ VIDD(Value-guided Iterative Distillation for Diffusion models)๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ๊ธฐ์กด ๊ฐ•ํ™”ํ•™์Šต ๊ธฐ๋ฐ˜ ๋ฐฉ๋ฒ•๋“ค์˜ ๋ถˆ์•ˆ์ •์„ฑ๊ณผ ๋ชจ๋“œ ๋ถ•๊ดด ๋ฌธ์ œ๋ฅผ ์˜คํ”„์ •์ฑ…(off-policy) ํ•™์Šต๊ณผ ์ •๋ฐฉํ–ฅ KL ๋ฐœ์‚ฐ(forward KL divergence) ์ตœ์†Œํ™”๋ฅผ ํ†ตํ•ด ํ•ด๊ฒฐํ•œ๋‹ค.

Motivation

Achievement

Figure 1

๊ทธ๋ฆผ 1: VIDD์˜ ๊ฐœ์š”. ์˜คํ”„์ •์ฑ… ๋กค์ธ, ๊ฐ’ํ•จ์ˆ˜ ๊ธฐ๋ฐ˜ ๋ณด์ƒ๊ฐ€์ค‘ ๋กค์•„์›ƒ, ์ •๋ฐฉํ–ฅ KL ๊ธฐ๋ฐ˜ ๋ชจ๋ธ ์—…๋ฐ์ดํŠธ๋ฅผ ๋ฐ˜๋ณต์ ์œผ๋กœ ์ˆ˜ํ–‰

  1. ์•ˆ์ •์„ฑ ํ–ฅ์ƒ: ์˜คํ”„์ •์ฑ… ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘๊ณผ ์ •๋ฐฉํ–ฅ KL ๋ชฉ์ ํ•จ์ˆ˜๋ฅผ ํ†ตํ•ด ์˜จ์ •์ฑ… ๋ฐฉ๋ฒ• ๋Œ€๋น„ ํ›ˆ๋ จ ์•ˆ์ •์„ฑ์ด ํ–ฅ์ƒ๋˜๊ณ  ๋ชจ๋“œ ๋ถ•๊ดด ์œ„ํ—˜ ๊ฐ์†Œ
  2. ์ƒ˜ํ”Œ ํšจ์œจ ๊ฐœ์„ : ๊ธฐ์กด RL ๋ฐฉ๋ฒ•๋“ค(PPO, DDPO)๋ณด๋‹ค ์šฐ์ˆ˜ํ•œ ์ƒ˜ํ”Œ ํšจ์œจ๋กœ ๋” ์ ์€ ๋ณด์ƒ ํ‰๊ฐ€๋กœ ์ˆ˜๋ ด
  3. ๊ด‘๋ฒ”์œ„ํ•œ ์ž‘์—… ์ง€์›: ๋‹จ๋ฐฑ์งˆ ์„ค๊ณ„(์ด์ฐจ ๊ตฌ์กฐ ๋งค์นญ, PD-L1/IFNAR2 ๊ฒฐํ•ฉ ์„ค๊ณ„), ์ž‘์€ ๋ถ„์ž ์„ค๊ณ„, ์กฐ์ ˆ DNA ์„ค๊ณ„ ๋“ฑ ๋‹ค์–‘ํ•œ ์ƒ๋ฌผ๋ถ„์ž ์„ค๊ณ„ ๊ณผ์ œ์—์„œ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ ์ž…์ฆ
  4. ๋น„๋ฏธ๋ถ„๊ฐ€๋Šฅ ๋ณด์ƒ ์ตœ์ ํ™”: ๋ฌผ๋ฆฌ ์‹œ๋ฎฌ๋ ˆ์ด์…˜์ด๋‚˜ ๊ณผํ•™ ์ง€์‹ ๊ธฐ๋ฐ˜ ๋ณด์ƒ ๋“ฑ ์ž„์˜์˜ ๋น„๋ฏธ๋ถ„๊ฐ€๋Šฅ ๋ณด์ƒํ•จ์ˆ˜์— ๋Œ€์‘ ๊ฐ€๋Šฅ

How

Figure 1

๊ทธ๋ฆผ 1: VIDD์˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ตฌ์กฐ ๋ฐ ์„ธ ๊ฐ€์ง€ ํ•ต์‹ฌ ๋‹จ๊ณ„

์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ตฌ์กฐ:

$$\mathcal{L} = KL(p_{out} || p_ฯ‰)$$

์ด๋Š” ์ „ํ–ฅ์ (forward) KL ๋ชฉ์ ํ•จ์ˆ˜๋กœ ๋ชจ๋“œ ์ปค๋ฒ„๋ง(mode covering) ํ–‰๋™์„ ์œ ๋„ํ•˜์—ฌ ๋‹ค์–‘์„ฑ ๋ณด์กด

Originality

Limitation & Further Study

Evaluation

์ดํ‰: ์ด ๋…ผ๋ฌธ์€ ์ƒ๋ฌผ๋ถ„์ž ์„ค๊ณ„์—์„œ ๋ฏธ๋ถ„๋ถˆ๊ฐ€๋Šฅํ•œ ๋ณด์ƒ ์ตœ์ ํ™”๋ผ๋Š” ์‹ค์งˆ์  ๋„์ „ ๊ณผ์ œ๋ฅผ ์˜คํ”„์ •์ฑ… ํ•™์Šต๊ณผ ์ •๋ฐฉํ–ฅ KL ๊ธฐ๋ฐ˜ ์ •์ฑ… ์ฆ๋ฅ˜๋กœ ์šฐ์•„ํ•˜๊ฒŒ ํ•ด๊ฒฐํ•œ ๊ฐ•๋ ฅํ•œ ๊ธฐ์—ฌ๋‹ค. ๋‹จ๋ฐฑ์งˆยท๋ถ„์ž ์„ค๊ณ„ ๋ถ„์•ผ์—์„œ์˜ ๊ด‘๋ฒ”์œ„ํ•œ ์‹ค์ฆ๊ณผ ๊ธฐ์กด ๋ฐฉ๋ฒ• ๋Œ€๋น„ ์•ˆ์ •์„ฑ ๋ฐ ์ƒ˜ํ”Œ ํšจ์œจ ๊ฐœ์„ ์ด ๋…ผ๋ฌธ์˜ ๊ฐ€์น˜๋ฅผ ๋†’์ธ๋‹ค. ๋‹ค๋งŒ ์ด๋ก ์  ๋ถ„์„๊ณผ ๋Œ€๊ทœ๋ชจ ๋ฌธ์ œ์— ๋Œ€ํ•œ ํ™•์žฅ์„ฑ ๊ฒ€์ฆ์ด ๋ณด๊ฐ•๋˜๋ฉด ๋”์šฑ ์šฐ์ˆ˜ํ•œ ๋…ผ๋ฌธ์ด ๋  ์ˆ˜ ์žˆ๋‹ค.

๊ฐ™์ด ๋ณด๋ฉด ์ข‹์€ ๋…ผ๋ฌธ

๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
446์—์„œ ์‚ฌ์šฉํ•˜๋Š” LLM ๋ฐ ์ƒ์„ฑํ˜• AI์˜ ๊ณผํ•™์  ์‘์šฉ์€ 004์˜ ์„œ๋ฒ ์ด๊ฐ€ ์ด๋ก ์  ๋ฐฐ๊ฒฝ์„ ํญ๋„“๊ฒŒ ์ œ๊ณตํ•œ๋‹ค.
๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
446์˜ ๋ณด์ƒ ๊ธฐ๋ฐ˜ ๋””ํ“จ์ „ ํŒŒ์ธํŠœ๋‹ ๋ฐฉ์‹์€ 682์—์„œ ์†Œ๊ฐœ๋œ ํ…Œ์ŠคํŠธํƒ€์ž„ ๋ฐ˜๋ณต์  ๋ณด์ƒ ์ตœ์ ํ™” ํ”„๋ ˆ์ž„์›Œํฌ์˜ ์ด๋ก ์  ์ถœ๋ฐœ์ ์„ ์ œ๊ณตํ•œ๋‹ค.
๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
446์˜ ๋ณด์ƒ ์œ ๋„ ํŒŒ์ธํŠœ๋‹ ๋ฐ ๋ถ„ํฌ ํ™•์žฅ ๋ฐฉ๋ฒ•์€ 867์˜ ๊ฒ€์ฆ๊ธฐ ๊ธฐ๋ฐ˜ ํ”Œ๋กœ์šฐ ์ตœ์ ํ™” ํ”„๋ ˆ์ž„์›Œํฌ์˜ ๊ทผ๊ฐ„์ด ๋œ๋‹ค.
๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
๋ณด์ƒ ์œ ๋„ ํ™•์‚ฐ๋ชจ๋ธ ๊ณ ๋„ํ™”(Iterative Distillation) ๋…ผ๋ฌธ์œผ๋กœ, Clean-Sample Markov chain ์ƒ˜ํ”Œ๋ง ์ „๋žต๊ณผ ๊ทผ๋ณธ์ ์ธ ์—ฐ๊ฒฐ์ ์„ ์„ค๋ช…ํ•œ๋‹ค.
๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
Diffusion ๋ชจ๋ธ์˜ reward-guided fine-tuning ๋ฐฉ๋ฒ•๋ก ์— ๋Œ€ํ•œ ์ฒด๊ณ„์  ๋ถ„์„ ๊ฒฐ๊ณผ๋กœ, CAGenMol์˜ ๊ฐ•ํ™”ํ•™์Šต ๊ธฐ๋ฐ˜ ๋ณด์ƒ ์„ธ๋ถ€ ๊ตฌํ˜„์— ํ•„์š”ํ•œ ์ด๋ก ์  ํ† ๋Œ€๋ฅผ ์ œ๊ณตํ•œ๋‹ค.
๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
๋ณด์ƒ ์œ ๋„ํ˜• diffusion ๋ชจ๋ธ fine-tuning์˜ ์ผ๋ฐ˜์  ์ „๋žต์„ ์ œ์‹œํ•˜๋ฉฐ, MP2D์˜ reward-guided sampling ์„ค๊ณ„์˜ ์ด๋ก ์  ๊ธฐ๋ฐ˜์ด ๋œ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
555 ๋…ผ๋ฌธ์€ ๋ถ„์ž ๊ทธ๋ž˜ํ”„ ์ƒ์„ฑ์„ ์œ„ํ•œ GAN ๊ธฐ๋ฐ˜ ์ ‘๊ทผ๋ฒ•์„ ์ œ์•ˆํ•˜์—ฌ, ํ™•์‚ฐ๋ชจ๋ธ ๊ธฐ๋ฐ˜ ์„ค๊ณ„์™€ ๋Œ€์กฐ์ ์œผ๋กœ ์ฐธ๊ณ ํ•  ์ˆ˜ ์žˆ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
๋‘ ๋…ผ๋ฌธ ๋ชจ๋‘ ๋ฏธ๋ถ„๋ถˆ๊ฐ€๋Šฅํ•œ ๋ณด์ƒ ํ•จ์ˆ˜๋กœ ํ™•์‚ฐ ๋ชจ๋ธ์„ ์ œ์–ดํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋‹ค๋ฃจ์ง€๋งŒ, 446์€ ๋ฏธ์„ธ์กฐ์ • ๊ธฐ๋ฐ˜, 269๋Š” ์ถ”๋ก  ์‹œ๊ฐ„ ๊ฐ€์ด๋“œ ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
Iterative Distillation for Reward-Guided Fine-Tuning of Diff ๋…ผ๋ฌธ์ด ๋ณด์ƒ-์œ ๋„ ๋ฏธ์„ธ์กฐ์ •์„ ํ†ตํ•œ ํ™•์‚ฐ๋ชจ๋ธ ์ผ๋ฐ˜ํ™” ํƒ์ƒ‰์„ ์‹œ๋„ํ•œ ์ ์—์„œ ๋กœ๋ด‡ ์ •์ฑ… ์ตœ์ ํ™”์— SAM์„ ์ ์šฉํ•œ ๋ณธ ๋…ผ๋ฌธ์˜ ๋Œ€์•ˆ์  ์„ฑ๊ฒฉ์„ ์ง€๋‹™๋‹ˆ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
์ƒ๋ฌผ๋ถ„์ž ์„ค๊ณ„์—์„œ ํ™•์‚ฐ๋ชจ๋ธ๋กœ reward ๊ธฐ๋ฐ˜ ์ตœ์ ํ™”๋ฅผ ์‹œ๋„ํ•œ ๋…ผ๋ฌธ์€ LLM ๊ธฐ๋ฐ˜ ํ™”ํ•™ํ•ฉ์„ฑ ์ž๋™ํ™”์™€ ๋ชฉํ‘œ๋Š” ๊ฐ™์ง€๋งŒ ๋ฐฉ๋ฒ•์ด ๋‹ค๋ฆ…๋‹ˆ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
Reward-Guided Iterative Refinement in Diffusion Models ๋…ผ๋ฌธ์€ ๋ณด์ƒ ๊ธฐ๋ฐ˜ ํ™•์‚ฐ๋ชจ๋ธ ์ตœ์ ํ™”์˜ ๋˜๋‹ค๋ฅธ ๋”ฅ๋Ÿฌ๋‹ ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
Inference-Time Alignment in Diffusion Models ๋…ผ๋ฌธ์€ reward ์‹ ํ˜ธ๋ฅผ ํ™œ์šฉํ•œ ํ™•์‚ฐ๋ชจ๋ธ ์ตœ์ ํ™”์˜ ๋˜๋‹ค๋ฅธ ์‹คํ—˜์  ์ ‘๊ทผ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
๋‘ ๋…ผ๋ฌธ ๋ชจ๋‘ diffusion ๋ชจ๋ธ์˜ reward ๊ธฐ๋ฐ˜ ๋ณด์ • ๋ฌธ์ œ๋ฅผ ๋‹ค๋ฃจ์ง€๋งŒ, ์„œ๋กœ ๋‹ค๋ฅธ ์ตœ์ ํ™” ๋ฐฉ์‹๊ณผ ์‹คํ—˜ ํ”„๋กœํ† ์ฝœ์„ ์ œ์•ˆํ•˜์—ฌ ๋น„๊ต ๋ถ„์„์ด ์œ ์šฉํ•˜๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
Reward-Guided Discrete Diffusion์€ ๋ณด์ƒํ•จ์ˆ˜๋ฅผ ํ™œ์šฉํ•œ diffusion fine-tuning์ด๋ผ๋Š” ์œ ์‚ฌ ๋ฌธ์ œ๋ฅผ ๋‹ค๋ฅธ ์ˆ˜์‹์œผ๋กœ ํ•ด๊ฒฐํ•˜๋Š” ์ตœ์‹  ์—ฐ๊ตฌ์ž…๋‹ˆ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
RNA 2์ฐจ ๊ตฌ์กฐ ์„ค๊ณ„๋ฅผ ์œ„ํ•œ ๋‹ค๋ฅธ ์ตœ์ ํ™” ๊ธฐ๋ฐ˜ ๋ฐฉ๋ฒ•์„ ์ œ์‹œํ•˜๋Š” ์—ฐ๊ตฌ์ด๋‹ค.
ํ›„์† ์—ฐ๊ตฌ
Iterative Distillation for Reward-Guided Fine-Tuning of Diffusion Models ๋…ผ๋ฌธ์€ ๋ณด์ƒ ๊ธฐ๋ฐ˜ ํŠœ๋‹๊ณผ ๊ฒฌ๊ณ ์„ฑ ๊ฐ•ํ™” ์ ‘๊ทผ์„ ํ†ตํ•œ RL ๋ชจ๋ธ ๊ฐœ์„  ๋ฐฉ๋ฒ•์„ ์ถ”๊ฐ€๋กœ ๋‹ค๋ฃน๋‹ˆ๋‹ค.
ํ›„์† ์—ฐ๊ตฌ
CAGenMol ๋…ผ๋ฌธ์€ ์กฐ๊ฑด ์ธ์ง€ ๋ฐ ๋ชฉ์ ์ถ”๊ตฌ ํ™•์‚ฐ์–ธ์–ด๋ชจ๋ธ๋กœ ์ƒ๋ฌผ๋ถ„์ž/์žฌ๋ฃŒ ์„ค๊ณ„์— reward-guided fine-tuning ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ํ™•์žฅํ•ฉ๋‹ˆ๋‹ค.
ํ›„์† ์—ฐ๊ตฌ
Reward-guided fine-tuning ์ ‘๊ทผ์„ ์‚ฌ์šฉํ•˜์—ฌ ์ƒ์„ฑ ๋ชจ๋ธ(ํŠนํžˆ diffusion ๊ธฐ๋ฐ˜)์˜ ๊ตฌ์กฐ ๋‹ค์–‘ํ™” ๋ฐ ์ œ์•ฝ ์ ์šฉ ๊ฐ€๋Šฅ์„ฑ์„ ์ถ”๊ฐ€์ ์œผ๋กœ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.
์‘์šฉ ์‚ฌ๋ก€
SAMPLE ํ”Œ๋žซํผ์€ ์ž๋™ ๋‹จ๋ฐฑ์งˆ ๊ณตํ•™์—์„œ ์‹œ๋ฃŒ ๊ณต๊ฐ„ ํƒ์ƒ‰ ํšจ์œจํ™”์— reward-guided ํ™•์‚ฐ๋ชจ๋ธ์ด ์‹ค์ œ ์‘์šฉ๋˜๋Š” ์‚ฌ๋ก€์ž…๋‹ˆ๋‹ค.
์‘์šฉ ์‚ฌ๋ก€
446์˜ ํ™•์‚ฐ๋ชจ๋ธ ๋ณด์ƒ ์ตœ์ ํ™”๋Š” AlphaFold3์™€ ๊ฐ™์€ ์ตœ์‹  ํ™•์‚ฐ๊ธฐ๋ฐ˜ ์ƒ์ฒด๊ตฌ์กฐ ์˜ˆ์ธก ๋ชจ๋ธ์˜ ํ˜„์‹ค ์ ์šฉ ํšจ์œจ์„ฑ์„ ๋†’์ด๋Š” ์‹ค์งˆ์  ๋ฐฉ๋ฒ•์„ ์ œ์‹œํ•œ๋‹ค.
๋ฐ˜๋ก /๋น„ํŒ
Hallucinations can improve large language models in drug discovery ๋…ผ๋ฌธ์€ '๋ถˆ์•ˆ์ •์„ฑ'์ด ํ•ญ์ƒ ๋ถ€์ •์ ์ด์ง€ ์•Š๋‹ค๋Š” ์‹œ๊ฐ์„ ์ œ์‹œํ•˜์—ฌ, reward-guided fine-tuning์˜ ํ•œ๊ณ„์™€ ํ•ด์„์„ ๊ท ํ˜•๊ฐ์žˆ๊ฒŒ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.
← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •