Mind the gap: Examining the self-improvement capabilities of large language models

์ €์ž: Yuda Song, Hanlin Zhang, Carson Eisenach, Sham M. Kakade, Dean Foster, Udaya Ghai | ๋‚ ์งœ: 2025 | DOI: arXiv:2412.02674 📄 PDF


Essence

Figure 1

Figure 1: ์ ์ ˆํ•œ ๊ฒ€์ฆ ๋ฐฉ๋ฒ•(์˜ˆ: CoT-S)์„ ์‚ฌ์šฉํ•  ๋•Œ, ์ƒ๋Œ€ ์ƒ์„ฑ-๊ฒ€์ฆ ๊ฐญ์ด ์‚ฌ์ „ํ•™์Šต ์—ฐ์‚ฐ๋Ÿ‰(flops)์— ๋Œ€ํ•ด ๋‹จ์กฐ์ฆ๊ฐ€ํ•˜๋Š” ํ˜„์ƒ

๋ณธ ๋…ผ๋ฌธ์€ ๋Œ€๊ทœ๋ชจ ์–ธ์–ด๋ชจ๋ธ(LLM)์˜ ์ž๊ธฐ๊ฐœ์„ (self-improvement) ๋ฉ”์ปค๋‹ˆ์ฆ˜์„ ์ฒด๊ณ„์ ์œผ๋กœ ๋ถ„์„ํ•˜๋ฉฐ, ์ƒ์„ฑ-๊ฒ€์ฆ ๊ฐญ(Generation-Verification Gap, GV-Gap)์ด๋ผ๋Š” ํ•ต์‹ฌ ์ง€ํ‘œ๋ฅผ ํ†ตํ•ด ์–ธ์–ด๋ชจ๋ธ์ด ์ž์‹ ์˜ ์ถœ๋ ฅ์„ ๊ฒ€์ฆํ•˜์—ฌ ์„ฑ๋Šฅ์„ ๊ฐœ์„ ํ•  ์ˆ˜ ์žˆ๋Š” ๋Šฅ๋ ฅ์˜ ํ•œ๊ณ„์™€ ๊ฐ€๋Šฅ์„ฑ์„ ๊ทœ๋ช…ํ•œ๋‹ค.

Motivation

Achievement

Figure 2

Figure 2: ๊ฑฐ๋ถ€ ์ƒ˜ํ”Œ๋ง(rejection sampling)์„ ์˜ˆ์‹œ๋กœ ํ•œ ์ž๊ธฐ๊ฐœ์„  ํ”„๋ ˆ์ž„์›Œํฌ์˜ ํ•ต์‹ฌ ์ •์˜ ์‹œ๊ฐํ™”

  1. ์ƒ์„ฑ-๊ฒ€์ฆ ๊ฐญ์˜ ์Šค์ผ€์ผ๋ง ํ˜„์ƒ: ํŠน์ • ๊ฒ€์ฆ ๋ฐฉ๋ฒ•(ํŠนํžˆ Chain-of-Thought-Score)์„ ์‚ฌ์šฉํ•  ๋•Œ, ์ƒ๋Œ€ GV-Gap์ด ๋ชจ๋ธ์˜ ์‚ฌ์ „ํ•™์Šต ์—ฐ์‚ฐ๋Ÿ‰(flops)์— ๋Œ€ํ•ด ๋‹จ์กฐ์ฆ๊ฐ€ํ•˜๋Š” ํ˜„์ƒ์„ ๋ฐœ๊ฒฌ. ์ด๋Š” ๋” ํฐ ๋ชจ๋ธ์ผ์ˆ˜๋ก ์ž์‹ ์˜ ์ƒ์„ฑ๋ฌผ์„ ๋” ์ž˜ ๊ฒ€์ฆํ•  ์ˆ˜ ์žˆ์Œ์„ ์‹œ์‚ฌ
  2. ๊ต์ฐจ ๊ฒ€์ฆ ๋ถ„์„: ์„œ๋กœ ๋‹ค๋ฅธ ๋ชจ๋ธ์„ ์ƒ์„ฑ๊ณผ ๊ฒ€์ฆ์— ์‚ฌ์šฉํ•  ๋•Œ, GV-Gap์€ ๊ฒ€์ฆ์ž์˜ ๋Šฅ๋ ฅ์— ๋”ฐ๋ผ ์ฆ๊ฐ€ํ•˜๊ณ  ์ƒ์„ฑ์ž์˜ ๋Šฅ๋ ฅ์— ๋”ฐ๋ผ ๊ฐ์†Œํ•˜๋Š” ์ผ๊ด€๋œ ํŒจํ„ด์„ ๊ด€์ฐฐ
  3. ๋ฐ˜๋ณต์  ์ž๊ธฐ๊ฐœ์„ ์˜ ํ•œ๊ณ„: ๋ช‡ ํšŒ์˜ ๋ฐ˜๋ณต ์ž๊ธฐ๊ฐœ์„  ํ›„ GV-Gap์ด 0์— ์ˆ˜๋ ดํ•˜๋ฉฐ, ํฌํ™” ์†๋„๋Š” ๋ชจ๋ธ ์šฉ๋Ÿ‰๊ณผ ๋ฌด๊ด€ํ•จ. ๋ฐ˜๋ณต ๊ณผ์ •์—์„œ ํšจ๊ณผ์ ์ธ ๋‹ค์–‘์„ฑ(effective diversity)์ด ์ €ํ•˜๋จ
  4. ๊ฒ€์ฆ ๋ฉ”์ปค๋‹ˆ์ฆ˜์˜ ํŠน์„ฑ: ๊ฐ™์€ ๊ฒ€์ฆ ๋ฐฉ๋ฒ•์€ ์„œ๋กœ ๋‹ค๋ฅธ ๋ชจ๋ธ์—์„œ๋„ ์ผ๊ด€๋œ ์ถ”์„ธ๋ฅผ ์œ ๋„ํ•˜์ง€๋งŒ, ์„œ๋กœ ๋‹ค๋ฅธ ๊ฒ€์ฆ ๋ฉ”์ปค๋‹ˆ์ฆ˜ ๊ฐ„์—๋Š” ์ƒ๋‹นํ•œ ๊ฒน์น˜์ง€ ์•Š์Œ. GV-Gap๊ณผ ์ƒ์„ฑ ์ •ํ™•๋„ ๊ฐ„์— ํ•„์ˆ˜์ ์ธ ์–‘์˜ ์ƒ๊ด€๊ด€๊ณ„๊ฐ€ ์—†์Œ์„ ๋ฐœ๊ฒฌ

How

Figure 3

Figure 3: ๊ต์ฐจ ๊ฐœ์„ ์—์„œ์˜ GV-Gaps. ๊ฐ ํ–‰(๊ณ ์ •๋œ ์ƒ์„ฑ์ž)์— ๋Œ€ํ•ด, ๊ฒ€์ฆ์ž ๋Šฅ๋ ฅ์ด ์ฆ๊ฐ€ํ• ์ˆ˜๋ก ๊ฐญ์ด ์ฆ๊ฐ€

์ž๊ธฐ๊ฐœ์„  ํ”„๋ ˆ์ž„์›Œํฌ์˜ ํ˜•์‹ํ™”:

```

gap(f, g) := J(f[w(รปg)]) - J(f)

```

์—ฌ๊ธฐ์„œ w๋Š” ๊ฒ€์ฆ ์ ์ˆ˜๋ฅผ ๊ฐ€์ค‘์น˜๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ํ•จ์ˆ˜. ์ƒ๋Œ€ ๊ฐญ(relative gap)์€ ์ตœ๋Œ€ ๊ฐ€๋Šฅ ๊ฐœ์„ ์— ๋Œ€ํ•œ ์ •๊ทœํ™”

์‹คํ—˜ ์„ค์ •:

ํ•ต์‹ฌ ๋ฐœ๊ฒฌ:

Originality

Limitation & Further Study

ํ•œ๊ณ„:

ํ›„์† ์—ฐ๊ตฌ ๋ฐฉํ–ฅ:

Evaluation

Novelty: 4.5/5 Technical Soundness: 4/5 Significance: 4/5 Clarity: 4.5/5 Overall: 4.2/5

์ดํ‰: ๋ณธ ๋…ผ๋ฌธ์€ LLM ์ž๊ธฐ๊ฐœ์„ ์˜ ํ•ต์‹ฌ ์ง€ํ‘œ๋ฅผ ์ •์˜ํ•˜๊ณ  ๊ด‘๋ฒ”์œ„ํ•œ ์‹ค์ฆ ๋ถ„์„์„ ํ†ตํ•ด ์Šค์ผ€์ผ๋ง ํ˜„์ƒ์„ ์ตœ์ดˆ๋กœ ๊ทœ๋ช…ํ•œ ์˜๋ฏธ ์žˆ๋Š” ์—ฐ๊ตฌ์ด๋‹ค. ์ƒ์„ฑ-๊ฒ€์ฆ ๊ฐญ์ด๋ผ๋Š” ๊ฐœ๋…์ด ํ–ฅํ›„ ์ž๊ธฐ๊ฐœ์„  ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์„ค๊ณ„์˜ ์ค‘์š”ํ•œ ๊ธฐ์ค€์ด ๋  ๊ฒƒ์œผ๋กœ ์˜ˆ์ƒ๋˜๋ฉฐ, ๋‹ค๋งŒ ๊ฒฐ๊ณผ์˜ ์ผ๋ฐ˜ํ™” ๊ฐ€๋Šฅ์„ฑ ํ™•๋Œ€์™€ ์ž‘๋™ ๋ฉ”์ปค๋‹ˆ์ฆ˜์— ๋Œ€ํ•œ ๋” ๊นŠ์€ ๋ถ„์„์ด ํ•„์š”ํ•˜๋‹ค.

๊ฐ™์ด ๋ณด๋ฉด ์ข‹์€ ๋…ผ๋ฌธ

๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
Large language models can self-improve ๋…ผ๋ฌธ์€ LLM ์ž๊ธฐ๊ฐœ์„  ๊ฐœ๋…์˜ ์ด๋ก ์  ๊ทผ๊ฑฐ์™€ ์‹คํ—˜์  ์‚ฌ๋ก€๋ฅผ ์ œ๊ณตํ•ด 538์˜ ์ž๊ธฐ๊ฐœ์„  ๋Šฅ๋ ฅ ๊ณ„๋Ÿ‰ํ™” ๋ถ„์„์— ํ† ๋Œ€๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
AI ์•ˆ์ „์„ฑ ๋ฐ ์ž๊ธฐ๊ฒ€์ฆ ๋Šฅ๋ ฅ ํ•œ๊ณ„์— ๋Œ€ํ•œ ์‹ฌ์ธต์  ํ•ด์„์€ ์ƒ์„ฑ-๊ฒ€์ฆ ๊ฐญ(GV-gap) ๋…ผ์˜์˜ ์ด๋ก ์  ๊ธฐ๋ฐ˜์„ ์ œ๊ณตํ•œ๋‹ค.
๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
๋ฒ ์ด์ง€์•ˆ ์ตœ์ ์‹คํ—˜์„ค๊ณ„์˜ ์ด๋ก ์  ๊ธฐ๋ฐ˜์„ ์ œ๊ณตํ•˜๋Š” ์—ฐ๊ตฌ์ด๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
์Šค์Šค๋กœ feedback์„ ํ†ตํ•ด ์„ฑ๋Šฅ์„ ์ ์ง„์ ์œผ๋กœ ๊ฐœ์„ ํ•˜๋Š” iterative refinement ๋ฐฉ๋ฒ•๊ณผ generation-verification gap์˜ ๋ฒค์น˜๋งˆํ‚น ์ฐจ์ด๋ฅผ ๋น„๊ตํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
LLM์ด ์ž๊ธฐ ์ถœ๋ ฅ์„ ๋‹จ๊ณ„๋ณ„๋กœ ์ ๊ฒ€(Zero-shot step-by-step self-checking)ํ•˜๋Š” ์ ‘๊ทผ์˜ ํ•œ๊ณ„ ๋ฐ ๋ณด์™„ ๋ฐฉ์•ˆ์„ ๋น„๊ต ๋ถ„์„ํ•  ์ˆ˜ ์žˆ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
LLM์˜ ์ž๊ธฐ๊ฐœ์„  ํ•œ๊ณ„์™€ ์ถ”๋ก  ๊ฒฝ๊ณ„์— ๋Œ€ํ•œ ์ฒด๊ณ„์  ๊ฒ€ํ† ๋ฅผ ํ†ตํ•ด, RBF++์˜ ๊ณ„๋Ÿ‰์  ๋ถ„์„์„ ์ƒํ˜ธ๋ณด์™„ํ•จ.
ํ›„์† ์—ฐ๊ตฌ
538 ๋…ผ๋ฌธ์€ LLM ์ž๊ธฐ ๊ฐœ์„  ๋Šฅ๋ ฅ์˜ ํ•œ๊ณ„์™€ ์ž๊ธฐ ๋ฐ˜์˜ ๊ธฐ๋ฒ•๋“ค์˜ ํšจ๊ณผ๋ฅผ ์ฒด๊ณ„์ ์œผ๋กœ ์ธก์ •ํ•˜์—ฌ, 470์˜ ์ฃผ์žฅ์„ ๋ฒค์น˜๋งˆํฌ/๋น„ํŒํ•œ๋‹ค.
ํ›„์† ์—ฐ๊ตฌ
LLM์˜ ์ž๋™ ์ž๊ธฐ ๋””๋ฒ„๊น…/์˜ค๋ฅ˜ ์ˆ˜์ • ํ›ˆ๋ จ์„ ํ†ตํ•ด 538์—์„œ ์ œ์‹œ๋œ ์ž๊ธฐ๊ฒ€์ฆ ํ•œ๊ณ„์˜ ์‹ค์ œ ๊ฐœ์„ ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.
์‘์šฉ ์‚ฌ๋ก€
Mind the gap ๋…ผ๋ฌธ์€ LLM์˜ ์ž๊ธฐ๊ฐœ์„  ๋Šฅ๋ ฅ์˜ ํ•œ๊ณ„ ๋ฐ ์‹ค์ œ ์ž๊ธฐ๊ฐœ์„  ํšจ๊ณผ๋ฅผ ์‹ค์ฆ ๋ถ„์„, ImPlicit Self-ImprovemenT ํ”„๋ ˆ์ž„์›Œํฌ๊ฐ€ ํ˜„์‹ค์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ์„ ๊ฒ€์ฆํ•ฉ๋‹ˆ๋‹ค.
์‘์šฉ ์‚ฌ๋ก€
LLM์˜ ์ž๊ธฐ๊ฐœ์„  ๋ฐ ๊ฒ€์ฆ ๋Šฅ๋ ฅ ๋ถ„์„์„ ํ†ตํ•ด ํ•ด์„๊ฐ€๋Šฅ์„ฑ๊ณผ ์•ˆ์ „์„ฑ ๋…ผ์˜๊ฐ€ ์‹ค์ œ LLM ํ™œ์šฉ์—์„œ ์–ด๋–ค ์˜๋ฏธ๋ฅผ ๊ฐ–๋Š”์ง€ ๋ณด์—ฌ์ค€๋‹ค.
์‘์šฉ ์‚ฌ๋ก€
LLM ์ž๊ธฐ๊ฐœ์„  ๋ฉ”์ปค๋‹ˆ์ฆ˜ ๋ฐ ํ•œ๊ณ„ ๋…ผ์˜๊ฐ€, ์žฌ๋ฃŒ๊ณผํ•™ ๋ถ„์•ผ ์ง€์‹-์•ˆ๋‚ด LLM์˜ ์‹ ๋ขฐ์„ฑยท์‹ค์‚ฌ์šฉ ๊ฐ€๋Šฅ์„ฑ ํ‰๊ฐ€์— ์ ์šฉ๋  ์ˆ˜ ์žˆ๋‹ค.
์‘์šฉ ์‚ฌ๋ก€
LLM์˜ ์ž๊ธฐ๊ฐœ์„  ๋ฐ ๊ฒ€์ฆ ๋Šฅ๋ ฅ ์—ฐ๊ตฌ๊ฐ€ ์‹ค์ œ ๊ณผํ•™ ์š”์•ฝ์˜ ์ผ๋ฐ˜ํ™” ํŽธํ–ฅ ์‹ค์‚ฌ๋ก€(Generalization Bias)์™€ ์—ฐ๊ฒฐ๋œ๋‹ค.
๋ฐ˜๋ก /๋น„ํŒ
์žฌ๋ฃŒ๊ณผํ•™ ํŠนํ™” LLM์˜ ์ž๊ธฐ๊ฐœ์„  ๋ฐ ์‹ ๋ขฐ์„ฑ์— ๋Œ€ํ•œ ๋…ผ์˜๊ฐ€, LLM ์ „๋ฐ˜์˜ ์ž๊ธฐ๊ฒ€์ฆ ๋Šฅ๋ ฅ ํ•œ๊ณ„์™€ ์‹ค์ œ ์ ์šฉ์˜ ์–ด๋ ค์›€์„ ๋ณด์™„์ ์œผ๋กœ ์„ค๋ช…ํ•œ๋‹ค.
๋ฐ˜๋ก /๋น„ํŒ
๊ณผํ•™ ์ •๋ณด ์ถ”์ถœ/๋ถ„์„์— LLM์˜ ์ง€์†์  ์ž๊ธฐ ๊ฐœ์„  ๊ฐ€๋Šฅ์„ฑ์— ๊ด€ํ•œ ํ•œ๊ณ„์™€ ๋ฐœ์ „ ๋ฐฉํ–ฅ์„ ์ œ์‹œํ•˜๋ฉฐ ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ ๋งˆ์ด๋‹์˜ ํ˜„์‹ค์  ๋ฌธ์ œ๋ฅผ ๋…ผ์˜ํ•ฉ๋‹ˆ๋‹ค.
← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •