Large language models can self-improve

์ €์ž: Jiaxin Huang, Shixiang Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, Jiawei Han | ๋‚ ์งœ: 2022 | DOI: N/A 📄 PDF


Essence

Figure 1: ๋ฐฉ๋ฒ•์˜ ๊ฐœ์š”. Chain-of-Thought ์˜ˆ์‹œ๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์–ธ์–ด๋ชจ๋ธ์ด ์—ฌ๋Ÿฌ ๊ฐœ์˜ CoT ์ถ”๋ก  ๊ฒฝ๋กœ๋ฅผ ์ƒ์„ฑํ•˜๊ณ , ๋‹ค์ˆ˜๊ฒฐ ํˆฌํ‘œ(Majority Voting)๋กœ ๊ณ ์‹ ๋ขฐ๋„ ๋‹ต๋ณ€์„ ์„ ํƒํ•œ ํ›„, ์ด๋ฅผ ํŒŒ์ธํŠœ๋‹ ๋ฐ์ดํ„ฐ๋กœ ํ™œ์šฉ

๊ทธ๋ฆผ 1: ๋ฐฉ๋ฒ•์˜ ๊ฐœ์š”. Chain-of-Thought ์˜ˆ์‹œ๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์–ธ์–ด๋ชจ๋ธ์ด ์—ฌ๋Ÿฌ ๊ฐœ์˜ CoT ์ถ”๋ก  ๊ฒฝ๋กœ๋ฅผ ์ƒ์„ฑํ•˜๊ณ , ๋‹ค์ˆ˜๊ฒฐ ํˆฌํ‘œ(Majority Voting)๋กœ ๊ณ ์‹ ๋ขฐ๋„ ๋‹ต๋ณ€์„ ์„ ํƒํ•œ ํ›„, ์ด๋ฅผ ํŒŒ์ธํŠœ๋‹ ๋ฐ์ดํ„ฐ๋กœ ํ™œ์šฉ

๋Œ€๊ทœ๋ชจ ์–ธ์–ด๋ชจ๋ธ(LLM)์ด ๋ ˆ์ด๋ธ” ์—†๋Š” ๋ฐ์ดํ„ฐ๋งŒ์œผ๋กœ ์ž๊ธฐ ์ƒ์„ฑ ๊ณ ์‹ ๋ขฐ๋„ ์ถ”๋ก (reasoning) ๊ฒฝ๋กœ๋ฅผ ํ†ตํ•ด ์ž๊ฐ€ ๊ฐœ์„ (self-improve)ํ•  ์ˆ˜ ์žˆ์Œ์„ ์ž…์ฆํ•œ ๋…ผ๋ฌธ์ด๋‹ค. Chain-of-Thought ํ”„๋กฌํŒ…๊ณผ ์ž๊ธฐ ์ผ๊ด€์„ฑ(self-consistency)์„ ํ™œ์šฉํ•˜์—ฌ ๊ฐ๋… ์‹ ํ˜ธ ์—†์ด ๋ชจ๋ธ์˜ ์ถ”๋ก  ๋Šฅ๋ ฅ์„ ํ–ฅ์ƒ์‹œํ‚จ๋‹ค.

Motivation

Achievement

  1. ๋„๋ฉ”์ธ ๋‚ด ์„ฑ๋Šฅ ๋Œ€ํญ ํ–ฅ์ƒ: 540B ํŒŒ๋ผ๋ฏธํ„ฐ PaLM ๋ชจ๋ธ์—์„œ GSM8K 74.4%โ†’82.1%, DROP 78.2%โ†’83.0%, OpenBookQA 90.0%โ†’94.4%, ANLI-A3 63.4%โ†’67.9%๋กœ ๊ฐœ์„ . ๊ฐ๋… ์‹ ํ˜ธ ์—†์ด ์ตœ์‹  ์ˆ˜์ค€(state-of-the-art) ์„ฑ๋Šฅ ๋‹ฌ์„ฑ
  2. ๋„๋ฉ”์ธ ์™ธ ์ผ๋ฐ˜ํ™”(Out-of-Domain Generalization): AQUA, StrategyQA, MNLI ๋“ฑ ํ›ˆ๋ จ ๋ถ„ํฌ์™€ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ์…‹์—์„œ๋„ ์„ฑ๋Šฅ ๊ฐœ์„ . ์ž๊ฐ€ ๊ฐœ์„ ์ด ํŠน์ • ๋ฐ์ดํ„ฐ์…‹์— ๊ณผ์ ํ•ฉ(overfitting)๋˜์ง€ ์•Š๊ณ  ์ผ๋ฐ˜์  ์ถ”๋ก  ๋Šฅ๋ ฅ์„ ํ–ฅ์ƒ์‹œํ‚ด์„ ์ž…์ฆ
  3. ์ƒํƒœ ์ถ”์  ๋Šฅ๋ ฅ: ์‹ ๋ขฐ๋„ ์ฒ™๋„(confidence measure)๊ฐ€ ์‹ค์ œ ์ •ํ™•๋„์™€ ๋†’์€ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๋ณด์ž„(Figure 2). ๋ชจ๋ธ์ด ์ž์‹ ์˜ ์‹ ๋ขฐ๋„๋ฅผ ์ •ํ™•ํžˆ ํ‰๊ฐ€ํ•  ์ˆ˜ ์žˆ์Œ์„ ์‹œ์‚ฌ

How

Figure 3: PaLM-540B์—์„œ ๋‹ค์ค‘ ๊ฒฝ๋กœ ์ƒ˜ํ”Œ๋ง์„ ์‚ฌ์šฉํ•œ GSM8K ํ…Œ์ŠคํŠธ ์ง‘ํ•ฉ์—์„œ์˜ ์ •ํ™•๋„ ๊ฒฐ๊ณผ

๊ทธ๋ฆผ 3: PaLM-540B์—์„œ ๋‹ค์ค‘ ๊ฒฝ๋กœ ์ƒ˜ํ”Œ๋ง์„ ์‚ฌ์šฉํ•œ GSM8K ํ…Œ์ŠคํŠธ ์ง‘ํ•ฉ์—์„œ์˜ ์ •ํ™•๋„ ๊ฒฐ๊ณผ

Originality

Limitation & Further Study

Evaluation

์ดํ‰: ์ด ๋…ผ๋ฌธ์€ ๋ ˆ์ด๋ธ” ์—†๋Š” ๋ฐ์ดํ„ฐ๋กœ ๋Œ€๊ทœ๋ชจ ์–ธ์–ด๋ชจ๋ธ์ด ์ž๊ฐ€ ๊ฐœ์„ ํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ช…ํ™•ํžˆ ์ž…์ฆํ•œ ์ค‘์š”ํ•œ ์—ฐ๊ตฌ๋‹ค. Chain-of-Thought์™€ ์ž๊ธฐ ์ผ๊ด€์„ฑ์„ ์ฐฝ์˜์ ์œผ๋กœ ์กฐํ•ฉํ•˜์—ฌ ๊ฐ•๋ ฅํ•œ ์ž๋™ ๊ฐ๋… ์‹ ํ˜ธ๋ฅผ ์–ป์—ˆ์œผ๋ฉฐ, ๋„๋ฉ”์ธ ๋‚ด์™ธ ๋‹ค์ˆ˜ ๋ฐ์ดํ„ฐ์…‹์—์„œ ์ƒํƒœ ์ถ”์  ์ˆ˜์ค€์˜ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ–ˆ๋‹ค. ๋‹ค๋งŒ ์‹ ๋ขฐ๋„ ํ‰๊ฐ€์˜ ์ •๊ต์„ฑ, ์˜ค๋ฅ˜ ์ฆํญ ์œ„ํ—˜, ๊ณ„์‚ฐ ๋น„์šฉ ๋“ฑ์˜ ํ•œ๊ณ„๊ฐ€ ์žˆ์œผ๋‚˜, ๊ฐ๋… ์‹ ํ˜ธ ์˜์กด์„ฑ์„ ํฌ๊ฒŒ ์ค„์ผ ์ˆ˜ ์žˆ๋‹ค๋Š” ์ ์—์„œ ์‹ค๋ฌด์  ๊ฐ€์น˜๊ฐ€ ๋งค์šฐ ๋†’๋‹ค.

๊ฐ™์ด ๋ณด๋ฉด ์ข‹์€ ๋…ผ๋ฌธ

๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
LLM์ด ์•”๋ฌต์ ์œผ๋กœ self-improvement๋ฅผ ํ•™์Šตํ•˜๋„๋ก ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์‹œํ•ด, ์ž๊ธฐ ๊ฐœ์„  ๋…ผ๋ฌธ์˜ ๊ธฐ๋ฐ˜์ด ๋œ๋‹ค.
๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
470์€ LLM์˜ ์ž๊ธฐ ๊ฐœ์„  ๊ฐ€๋Šฅ์„ฑ์— ๋Œ€ํ•œ ์ด๋ก ์ /์‹คํ—˜์  ๋…ผ์˜๋กœ 746์˜ ๊ธฐ์ดˆ์  ๋ฐฐ๊ฒฝ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
470์˜ LLM ์ž๊ธฐ๊ฐœ์„  ๋Šฅ๋ ฅ ๊ด€๋ จ ์—ฐ๊ตฌ๋Š” 180์—์„œ ํ‰๊ฐ€ํ•˜๋Š” ๋Šฅ๋™์  ์ •๋ณด์ˆ˜์ง‘ ๋ฐ ์ „๋žต ์ ์‘์˜ ์ด๋ก ์  ๊ธฐ๋ฐ˜์ด ๋œ๋‹ค.
๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
LLM์˜ ์ž๊ธฐ๊ฐœ์„ ๊ณผ self-improvement ๊ตฌ์กฐ์— ๋Œ€ํ•œ ์ด๋ก ์  ๋ถ„์„์„ ๋ฐ”ํƒ•์œผ๋กœ, 447๋ฒˆ ๋…ผ๋ฌธ์˜ ๋ฐ˜๋ณต์  ๊ฐ•ํ™”ํ•™์Šต ๊ตฌ์กฐ๋ฅผ ๋’ท๋ฐ›์นจํ•ฉ๋‹ˆ๋‹ค.
๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
LLM์˜ ์ž๊ธฐ ๊ฐœ์„ ๊ณผ ๋ฐ˜๋ณต์  ์˜ค๋ฅ˜ ์ˆ˜์ • ๋ฉ”์ปค๋‹ˆ์ฆ˜์˜ ๊ธฐ๋ณธ ์›๋ฆฌ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ, 598๋ฒˆ ๋…ผ๋ฌธ์˜ ๊ฐ•ํ™”ํ•™์Šต์  self-correction ํ”„๋กœ์„ธ์Šค์˜ ๊ธฐ์ดˆ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
LLM์˜ ์ž๊ธฐ๊ฐœ์„ ๊ณผ ๋ฐ˜์‚ฌ(reflection) ๋ฐฉ๋ฒ•๋ก ์˜ ํ•œ๊ณ„๋ฅผ ๋ถ„์„ํ•ด, DLPO์—์„œ ๋‹ค๋ฃจ๋Š” ํ”„๋กฌํ”„ํŠธ ์ตœ์ ํ™” ๊ฐœ์„ ์˜ ์ด๋ก ์  ๋ฐฐ๊ฒฝ์„ ์ œ๊ณตํ•œ๋‹ค.
๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
Large language models can self-improve ๋…ผ๋ฌธ์€ LLM์˜ ์ธ๊ฐ„์ˆ˜์ค€ ์ถ”๋ก  ๋ฐ ์ ์‘๋ ฅ์˜ ๋ฐœ์ „์ƒ์„ ์กฐ๋ช…ํ•˜๋ฉฐ, ํŠœ๋ง ํ…Œ์ŠคํŠธ ํ†ต๊ณผ์™€ ๊ฐ™์€ ์„ฑ์ทจ์˜ ์ด๋ก ์  ๊ธฐ๋ฐ˜์ด ๋ฉ๋‹ˆ๋‹ค.
๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
Large language models can self-improve ๋…ผ๋ฌธ์€ LLM ์ž๊ธฐ๊ฐœ์„  ๊ฐœ๋…์˜ ์ด๋ก ์  ๊ทผ๊ฑฐ์™€ ์‹คํ—˜์  ์‚ฌ๋ก€๋ฅผ ์ œ๊ณตํ•ด 538์˜ ์ž๊ธฐ๊ฐœ์„  ๋Šฅ๋ ฅ ๊ณ„๋Ÿ‰ํ™” ๋ถ„์„์— ํ† ๋Œ€๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
353์˜ LLM์„ ํ™œ์šฉํ•œ ์ž๋™ํ™” ๋ฐ ์ž๊ธฐ๊ฐœ์„  AI ์—์ด์ „ํŠธ ์„ค๋ฌธ์€ 470 ์—ฐ๊ตฌ์˜ ๊ธฐ์ˆ โ€ง์ด๋ก ์  ํ† ๋Œ€๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
Self-Refine ๋…ผ๋ฌธ์€ LLM์˜ iterative self-feedback์„ ํ†ตํ•œ ๋‹ต๋ณ€ ๊ฐœ์„  ๊ตฌ์กฐ๋ฅผ ์ œ์‹œํ•˜๋ฉฐ, ์ž๊ธฐ ๊ฐœ์„ ๊ณผ ์ž๊ธฐ ์ ๊ฒ€์˜ ๋™ํ–ฅ์„ ๊ฐ™์ด ์ดํ•ดํ•  ์ˆ˜ ์žˆ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
Large language models can self-improve ๋…ผ๋ฌธ์€ LLM์˜ ์ž๊ธฐ ๊ฐœ์„  ๋Šฅ๋ ฅ์„ ํ‰๊ฐ€ํ•˜๋Š” ๋…ผ๋ฌธ์œผ๋กœ, Selfcheck์˜ ๋‹จ๊ณ„๋ณ„ ์ž๊ธฐ๊ฒ€์ฆ๊ณผ ๋‹ฌ๋ฆฌ ์žฅ๊ธฐ์  ์ž๊ธฐํ•™์Šต ์ธก๋ฉด์„ ๋…ผ์˜ํ•œ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
470 'Large language models can self-improve' ๋…ผ๋ฌธ์€ SFT/RL ์™ธ์—๋„ ์ž์ฒด ์ƒ์„ฑ๋œ ํ”ผ๋“œ๋ฐฑ๊ณผ ์ž๊ธฐ๊ฐœ์„  ๋ฃจํ”„๋ฅผ ํ†ตํ•œ LLM ์ผ๋ฐ˜ํ™” ํ–ฅ์ƒ ์ „๋žต์„ ๋‹ค๋ฃจ์–ด ๋Œ€์กฐ์ ์œผ๋กœ ์ฐธ๊ณ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
ํ›„์† ์—ฐ๊ตฌ
747์€ zero-shot self-checking ๋ฐ ์ž๊ธฐ ์ผ๊ด€์„ฑ ๊ธฐ๋ฐ˜ reasoning ๊ฐ•ํ™” ์ „๋žต์„ ํ†ตํ•ด LLM์˜ self-improve๋ฅผ ์‹ค์ œ๋กœ ๊ฒ€์ฆํ•œ๋‹ค.
ํ›„์† ์—ฐ๊ตฌ
470 ๋…ผ๋ฌธ์€ LLM์˜ ์ž๊ธฐ ๊ฐœ์„ ๋Šฅ๋ ฅ(์ž๊ธฐ ์ˆ˜์ •, self-improvement)์˜ ์ฒด๊ณ„์  ์‹ค์ฆ์„ ์ œ๊ณตํ•˜์—ฌ, 314์— ์ œ์•ˆ๋œ PIT(self-improvement ํ”„๋ ˆ์ž„)์˜ ํšจ๊ณผ๋ฅผ ์‹คํ—˜์ ์œผ๋กœ ํ™•์žฅํ•œ๋‹ค.
ํ›„์† ์—ฐ๊ตฌ
447์€ ์ž๊ธฐ ์ธ์„ผํ‹ฐ๋ธŒ๋ฅผ ํ†ตํ•œ LLM ์ž๊ธฐ๊ฐœ์„  ๋ฐฉ์‹์œผ๋กœ, 470์˜ ์ž๊ธฐ ์ผ๊ด€์„ฑ๊ณผ ์‹ ๋ขฐ๋„ ์ถ”๋ก  ๊ธฐ๋ฐ˜ ์ž๊ธฐ๊ฐœ์„  ์‹คํ—˜์˜ ํ›„์† ๋ฐœ์ „์ž…๋‹ˆ๋‹ค.
ํ›„์† ์—ฐ๊ตฌ
538 ๋…ผ๋ฌธ์€ LLM ์ž๊ธฐ ๊ฐœ์„  ๋Šฅ๋ ฅ์˜ ํ•œ๊ณ„์™€ ์ž๊ธฐ ๋ฐ˜์˜ ๊ธฐ๋ฒ•๋“ค์˜ ํšจ๊ณผ๋ฅผ ์ฒด๊ณ„์ ์œผ๋กœ ์ธก์ •ํ•˜์—ฌ, 470์˜ ์ฃผ์žฅ์„ ๋ฒค์น˜๋งˆํฌ/๋น„ํŒํ•œ๋‹ค.
ํ›„์† ์—ฐ๊ตฌ
470์€ LLM์˜ ์ž๊ธฐ๊ฐœ์„  ํ•™์Šต๊ณผ ์ž๊ธฐ๋ฐ˜์„ฑ์˜ ๊ธฐ๋ฒ•์„ ๋‹ค๋ฃจ์–ด, 845์—์„œ ์ œ์•ˆํ•œ RISE ํ”„๋ ˆ์ž„์›Œํฌ์˜ ํ™•์žฅ ๋…ผ์˜๋ฅผ ๋ณด์™„ํ•œ๋‹ค.
๋ฐ˜๋ก /๋น„ํŒ
Large language models can self-improve ๋…ผ๋ฌธ์€ LLM์˜ ์ž๊ธฐ๊ฐœ์„  ๊ฐ€๋Šฅ์„ฑ์„ ์‹คํ—˜์ ์œผ๋กœ ์ฃผ์žฅํ•˜๋ฉฐ, self-correction ํ•œ๊ณ„๋ผ๋Š” ๋ณธ ๋…ผ๋ฌธ๊ณผ ๋…ผ์ ์ด ๋Œ€์กฐ๋ฉ๋‹ˆ๋‹ค.
๋ฐ˜๋ก /๋น„ํŒ
LLM์˜ self-improvement(์ž๊ธฐ๊ฐœ์„ ) ๋Šฅ๋ ฅ์ด ์‹ค์ œ๋กœ ๊ฐ€๋Šฅํ•œ์ง€์— ๋Œ€ํ•ด ์‹คํ—˜์ ์œผ๋กœ ๋…ผ์˜, ์ถ”๋ก  ๊ฒฝ๊ณ„ ํ”„๋ ˆ์ž„์›Œํฌ์— ๋น„ํŒ์  ์ž…์žฅ ์ œ์‹œ.
← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •