SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

์ €์ž: Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie | ๋‚ ์งœ: 2025 | DOI: 10.48550/arXiv.2501.17161 📄 PDF


Essence

Figure 1

Figure 1: V-IRL ์‹œ๊ฐ ๋„ค๋น„๊ฒŒ์ด์…˜ ํ™˜๊ฒฝ์—์„œ RL๊ณผ SFT์˜ ๋น„๊ต ์—ฐ๊ตฌ. OOD ๊ณก์„ ์€ ์„œ๋กœ ๋‹ค๋ฅธ ํ…์ŠคํŠธ ์•ก์…˜ ๊ณต๊ฐ„์„ ์‚ฌ์šฉํ•œ ๋™์ผ ์ž‘์—…์˜ ์„ฑ๋Šฅ์„ ๋‚˜ํƒ€๋ƒ„

๋ณธ ๋…ผ๋ฌธ์€ ๊ธฐ์ดˆ ๋ชจ๋ธ์˜ ์‚ฌํ›„ํ›ˆ๋ จ(post-training) ๋‹จ๊ณ„์—์„œ ์ง€๋„ํ•™์Šต ๋ฏธ์„ธ์กฐ์ •(SFT)๊ณผ ๊ฐ•ํ™”ํ•™์Šต(RL)์˜ ์ผ๋ฐ˜ํ™”(generalization) ๋Šฅ๋ ฅ์„ ๋น„๊ตํ•˜๋Š” ์ฒด๊ณ„์  ์—ฐ๊ตฌ๋กœ, RL์€ ๊ทœ์น™ ๊ธฐ๋ฐ˜ ์ถ”๋ก ๊ณผ ์‹œ๊ฐ ์ž‘์—…์—์„œ ์šฐ์ˆ˜ํ•œ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์„ ๋ณด์ด๋Š” ๋ฐ˜๋ฉด, SFT๋Š” ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์˜ ์•”๊ธฐ(memorization)์— ์น˜์ค‘ํ•œ๋‹ค๋Š” ํ•ต์‹ฌ ๋ฐœ๊ฒฌ์„ ์ œ์‹œํ•œ๋‹ค.

Motivation

Achievement

Figure 4 & 5 ๋ณ‘ํ•ฉ ๊ฐœ๋…

Figure: GeneralPoints์™€ V-IRL์—์„œ RL๊ณผ SFT์˜ ์„ฑ๊ณต๋ฅ (%) ์ถ”์ด ๋น„๊ต. RL์ด ๋ถ„ํฌ ์™ธ ๋ฐ์ดํ„ฐ(OOD)์—์„œ ์ผ๊ด€๋œ ์„ฑ๋Šฅ ๊ฐœ์„ ์„ ์œ ์ง€

  1. ์šฐ์ˆ˜ํ•œ ๊ทœ์น™ ๊ธฐ๋ฐ˜ ์ผ๋ฐ˜ํ™”: RL์€ ํ›ˆ๋ จ๋œ ๊ทœ์น™์„ ๋ฏธ๋ณด์œ (unseen) ๊ทœ์น™ ๋ณ€ํ˜•์— ์„ฑ๊ณต์ ์œผ๋กœ ์ „์ด์‹œํ‚ค๋Š” ๋ฐ˜๋ฉด, SFT๋Š” ๋ถ„ํฌ ์™ธ(out-of-distribution) ์ž‘์—…์—์„œ ํฐ ์„ฑ๋Šฅ ์ €ํ•˜๋ฅผ ๋ณด์ž„
  2. ์‹œ๊ฐ ์˜์—ญ ์ผ๋ฐ˜ํ™”: RL์€ ์ƒ‰์ƒ, ๊ณต๊ฐ„ ๋ฐฐ์น˜ ๋“ฑ ์‹œ๊ฐ ์ž…๋ ฅ ๋ณ€ํ˜•์— ๋Œ€ํ•ด์„œ๋„ ์ผ๊ด€๋œ ์ผ๋ฐ˜ํ™”๋ฅผ ๋‹ฌ์„ฑํ•˜๊ณ , V-IRL ๋ฒค์น˜๋งˆํฌ์—์„œ ์ตœ์ฒจ๋‹จ ์„ฑ๋Šฅ ๋‹ฌ์„ฑ (+33.8%: 44.0% โ†’ 77.8%)
  3. ์‹œ๊ฐ ์ธ์‹ ๋Šฅ๋ ฅ ํ–ฅ์ƒ: ๊ฒฐ๊ณผ ๊ธฐ๋ฐ˜ ๋ณด์ƒ(outcome-based reward) ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•œ RL ํ›ˆ๋ จ์ด ๋ชจ๋ธ์˜ ๊ธฐ์ € ์‹œ๊ฐ ์ธ์‹ ๋Šฅ๋ ฅ์„ ๊ฐœ์„ ํ•˜๋Š” ๋ฉ”์ปค๋‹ˆ์ฆ˜ ๊ทœ๋ช…
  4. SFT์˜ ๋ณด์กฐ ์—ญํ• : SFT๋Š” ์ถœ๋ ฅ ํฌ๋งท ์•ˆ์ •ํ™” "ํ˜•์‹ ๊ต์‚ฌ(format teacher)" ์—ญํ• ์„ ํ•˜์—ฌ RL์˜ ์„ฑ๋Šฅ ์ด๋“ ๋‹ฌ์„ฑ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•จ
  5. ์ถ”๋ก ์‹œ๊ฐ„ ๊ณ„์‚ฐ ์Šค์ผ€์ผ๋ง: ์ตœ๋Œ€ ๊ฒ€์ฆ ๋‹จ๊ณ„ ์ˆ˜ ์ฆ๋Œ€๋ฅผ ํ†ตํ•œ ์ถ”๋ก ์‹œ๊ฐ„ ๊ณ„์‚ฐ ํ™•์žฅ์ด RL ์ผ๋ฐ˜ํ™”์˜ ํ•ต์‹ฌ ์š”์†Œ์ž„์„ ์ž…์ฆ

How

Figure 2 & 3 ์ฐธ์กฐ

Figure 2-3: ๊ฒ€์ฆ์ž(verifier)๋ฅผ ์ด์šฉํ•œ ์ˆœ์ฐจ์  ์ˆ˜์ • ๊ณต์‹ํ™”. ์ƒํƒœ-์•ก์…˜ ์ „์ด ์˜ˆ์‹œ

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 4.5/5 Significance: 4.5/5 Clarity: 4/5 Overall: 4.2/5

์ดํ‰: ๋ณธ ๋…ผ๋ฌธ์€ ๋Œ€๊ทœ๋ชจ ๊ธฐ์ดˆ ๋ชจ๋ธ ํ›ˆ๋ จ์—์„œ ๊ด‘๋ฒ”์œ„ํ•˜๊ฒŒ ์‚ฌ์šฉ๋˜๋Š” ๋‘ ์ฃผ์š” ๊ธฐ๋ฒ•์˜ ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ์„ ์ฒด๊ณ„์ ์œผ๋กœ ๋น„๊ตํ•œ ์ค‘์š”ํ•œ ์‹ค์ฆ ์—ฐ๊ตฌ๋กœ, "RL์€ ์ผ๋ฐ˜ํ™”, SFT๋Š” ์•”๊ธฐ"๋ผ๋Š” ๋ช…ํ™•ํ•œ ๊ตฌ๋ถ„์„ ํ†ตํ•ด ํ–ฅํ›„ ๋ชจ๋ธ ๊ฐœ๋ฐœ ์ „๋žต์— ์‹ค์งˆ์  ์ง€์นจ์„ ์ œ๊ณตํ•œ๋‹ค. ๋‹ค๋งŒ ์ž‘์—… ๋ฒ”์œ„์™€ ๋ชจ๋ธ ๋‹ค์–‘์„ฑ ์ธก๋ฉด์—์„œ์˜ ํ™•์žฅ์ด ํ•„์š”ํ•˜๋ฉฐ, SFT-RL ์ƒํ˜ธ์ž‘์šฉ์˜ ์ตœ์ ํ™” ๋ฉ”์ปค๋‹ˆ์ฆ˜์— ๋Œ€ํ•œ ๋” ๊นŠ์€ ๋ถ„์„์ด ์š”๊ตฌ๋œ๋‹ค.

๊ฐ™์ด ๋ณด๋ฉด ์ข‹์€ ๋…ผ๋ฌธ

๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
Kimi k1.5 ๋…ผ๋ฌธ์€ RL ๊ธฐ๋ฐ˜ LLM ์ผ๋ฐ˜ํ™” ๋ฐ ์„ฑ๋Šฅ ํ™•์žฅ ์‹คํ—˜์„ ํ†ตํ•ด RL๊ณผ SFT์˜ ๊ทผ๋ณธ์  ์ฐจ์ด๋ฅผ ๋’ท๋ฐ›์นจํ•œ๋‹ค.
๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
LLM์˜ RL์„ ํ†ตํ•œ ์ถ”๋ก  ์œ ๋„ ๋ฐ ๊ฐ•ํ™”ํ•™์Šต์˜ ์ผ๋ฐ˜ํ™” ํšจ๊ณผ๋ฅผ ์‹คํ—˜์ ์œผ๋กœ ๋ถ„์„ํ•˜์—ฌ RL๊ณผ SFT์˜ ๋น„๊ต๊ตฌ๋„๋ฅผ ๋ณด์™„ํ•จ.
๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
SFT์™€ RL์˜ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ ์ฐจ์ด๋ฅผ ์‹คํ—˜์ ์œผ๋กœ ๋น„๊ตยท๋ถ„์„ํ•˜์—ฌ ๋ณธ ๋…ผ๋ฌธ์˜ ์ž๊ธฐ๊ฒ€์ฆ ์‹ฌ์ธต ํ•™์Šต ๊ตฌ์กฐ์™€ ์—ฐ๊ฒฐ๋ฉ๋‹ˆ๋‹ค.
๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
SFT Memorizes, RL Generalizes ๋…ผ๋ฌธ์€ RL ๊ธฐ๋ฐ˜ LLM reasoning ํ•™์Šต์—์„œ ์ผ๋ฐ˜ํ™”์™€ ์ปค๋ฆฌํ˜๋Ÿผ ์„ค๊ณ„์˜ ์ด๋ก ์ ๋ฐ”ํƒ•์„ ์ œ๊ณตํ•œ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
470 'Large language models can self-improve' ๋…ผ๋ฌธ์€ SFT/RL ์™ธ์—๋„ ์ž์ฒด ์ƒ์„ฑ๋œ ํ”ผ๋“œ๋ฐฑ๊ณผ ์ž๊ธฐ๊ฐœ์„  ๋ฃจํ”„๋ฅผ ํ†ตํ•œ LLM ์ผ๋ฐ˜ํ™” ํ–ฅ์ƒ ์ „๋žต์„ ๋‹ค๋ฃจ์–ด ๋Œ€์กฐ์ ์œผ๋กœ ์ฐธ๊ณ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
LLM์˜ ํƒ์ƒ‰์  ํ–‰๋™๊ณผ ๋ถˆํ™•์‹คํ•œ ํ™˜๊ฒฝ์—์„œ์˜ ์˜์‚ฌ๊ฒฐ์ • ๋Šฅ๋ ฅ์„ ๋ถ„์„ํ•˜๋Š” ์œ ์‚ฌํ•œ ์—ฐ๊ตฌ์ด๋‹ค.
ํ›„์† ์—ฐ๊ตฌ
Executable Code Actions ๋…ผ๋ฌธ์€ RL ๊ธฐ๋ฐ˜ ์—์ด์ „ํŠธ์˜ ์ฝ”๋“œ ์‹คํ–‰ยทํ‰๊ฐ€ ๊ธฐ๋ฒ•์„ ํ†ตํ•ด ๊ฐ•ํ™”ํ•™์Šต์˜ ์ผ๋ฐ˜ํ™” ์‹ค์ฆ์  ์‚ฌ๋ก€๋ฅผ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.
ํ›„์† ์—ฐ๊ตฌ
์ „๋žต์  ๋„๊ตฌ ์‚ฌ์šฉ ๋ฐ RL ๊ธฐ๋ฐ˜ ์ ์‘ ์„ฑ๋Šฅ ํ–ฅ์ƒ ์‚ฌ๋ก€๋ฅผ ํ†ตํ•ด SFT์™€ RL ์„ฑ๋Šฅ์ฐจ๊ฐ€ ์‹ค์งˆ์ ์œผ๋กœ ๋„๊ตฌ์‚ฌ์šฉ ๋งฅ๋ฝ์—์„œ ์ ์šฉ๋จ์„ ๋ณด์—ฌ์คŒ.
์‘์šฉ ์‚ฌ๋ก€
RL์˜ ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ์— ๋Œ€ํ•œ ๋ฐœ๊ฒฌ์ด Foundation Model Surrogates์˜ ๋Šฅ๋™ํ•™์Šต ํ”„๋ ˆ์ž„์›Œํฌ, ํŠนํžˆ ํŠธ๋žœ์Šคํฌ๋จธ์— RL ์ ์šฉ ๊ฐ€๋Šฅ์„ฑ ๋…ผ์˜๋กœ ํ™•์žฅ๋œ๋‹ค.
๋ฐ˜๋ก /๋น„ํŒ
๊ธฐ์กด SFT(์ง€๋„ํ•™์Šต)์™€ ๊ฐ•ํ™”ํ•™์Šต์„ ํ†ตํ•œ LLM generalization ์ฐจ์ด๋ฅผ ์‹คํ—˜์ ์œผ๋กœ ๋น„๊ตํ•ด, 449๋ฒˆ์˜ ๊ฐ•ํ™”ํ•™์Šต ๊ธฐ๋ฐ˜ ์ถ”๋ก  ๊ฐ•ํ™” ํšจ๊ณผ์™€ ๋Œ€๋น„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •