Large Language Models Cannot Self-Correct Reasoning Yet

์ €์ž: Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu | ๋‚ ์งœ: 2023 | DOI: 10.48550/arXiv.2310.01798 📄 PDF


Essence

๋Œ€๊ทœ๋ชจ ์–ธ์–ด ๋ชจ๋ธ(LLM)๋“ค์€ ์™ธ๋ถ€ ํ”ผ๋“œ๋ฐฑ ์—†์ด ์ž์‹ ์˜ ์ถ”๋ก  ์˜ค๋ฅ˜๋ฅผ ์ž๋™์œผ๋กœ ์ˆ˜์ •ํ•˜์ง€ ๋ชปํ•˜๋ฉฐ, ์˜คํžˆ๋ ค ์ž๊ธฐ ์ˆ˜์ •(self-correction) ํ›„ ์„ฑ๋Šฅ์ด ์ €ํ•˜๋œ๋‹ค๋Š” ๊ฒƒ์„ ์‹ค์ฆ์ ์œผ๋กœ ์ฆ๋ช…ํ•œ๋‹ค.

Motivation

Achievement

Figure 1

๋‘ ๋ผ์šด๋“œ์˜ ์ž๊ธฐ ์ˆ˜์ • ํ›„ ๋‹ต๋ณ€ ๋ณ€ํ™” ๋ถ„์„: ๋ณ€ํ™” ์—†์Œ, ์˜ฌ๋ฐ”๋ฅธโ†’์ž˜๋ชป๋œ, ์ž˜๋ชป๋œโ†’์˜ฌ๋ฐ”๋ฅธ ๋ฒ”์ฃผ๋ณ„ ๋น„์œจ

  1. ์˜ค๋ผํด ๋ผ๋ฒจ ๋ฌธ์ œ: GSM8K, CommonSenseQA, HotpotQA์—์„œ ์˜ค๋ผํด ๋ผ๋ฒจ์„ ์‚ฌ์šฉํ•œ ์ž๊ธฐ ์ˆ˜์ •์€ ์ƒ๋‹นํ•œ ์„ฑ๋Šฅ ๊ฐœ์„ (7-15%)์„ ๋ณด์ด์ง€๋งŒ, ์™ธ๋ถ€ ํ”ผ๋“œ๋ฐฑ ์—†๋Š” ๋‚ด์žฌ์  ์ž๊ธฐ ์ˆ˜์ •์—์„œ๋Š” ๋ชจ๋“  ๋ชจ๋ธ๊ณผ ๋ฐ์ดํ„ฐ์…‹์—์„œ ์„ฑ๋Šฅ ์ €ํ•˜ ๊ด€์ฐฐ (GPT-3.5: GSM8K 75.9%โ†’75.1%, CommonSenseQA 75.8%โ†’38.1%; GPT-4: 95.5%โ†’91.5%, 82.0%โ†’79.5%).
  2. ๋‹ค์ค‘ ์—์ด์ „ํŠธ ํ† ๋ก ์˜ ํ•œ๊ณ„: ์—ฌ๋Ÿฌ LLM ์ธ์Šคํ„ด์Šค๊ฐ€ ์„œ๋กœ์˜ ๋‹ต์„ ๋น„ํŒํ•˜๋Š” ๋‹ค์ค‘ ์—์ด์ „ํŠธ ํ† ๋ก (Multi-Agent Debate)์€ ๋™๋“ฑํ•œ ์‘๋‹ต ์ˆ˜๋ฅผ ๊ธฐ์ค€์œผ๋กœ ์ž๊ธฐ์ผ๊ด€์„ฑ(self-consistency)๋ณด๋‹ค ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋ณด์ด์ง€ ๋ชปํ•จ.
  3. ํ”„๋กฌํ”„ํŠธ ์„ค๊ณ„ ๋ฌธ์ œ: ์ผ๋ถ€ ๊ธฐ์กด ์—ฐ๊ตฌ์˜ ๊ฐœ์„  ํšจ๊ณผ๋Š” ์ดˆ๊ธฐ ์‘๋‹ต ์ƒ์„ฑ ์‹œ ๋ถ€์ตœ์ (sub-optimal) ํ”„๋กฌํ”„ํŠธ๋ฅผ ์‚ฌ์šฉํ•œ ๊ฒƒ์—์„œ ๋น„๋กฏ๋จ. ํ”ผ๋“œ๋ฐฑ์„ ์ดˆ๊ธฐ ์ง€์‹œ์‚ฌํ•ญ์— ํ†ตํ•ฉํ•˜๋ฉด ์ž๊ธฐ ์ˆ˜์ •์„ ์‚ฌ์šฉํ•œ ๊ฒƒ๋ณด๋‹ค ๋” ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ์–ป์Œ.

How

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 4/5 Significance: 5/5 Clarity: 5/5 Overall: 4.5/5

์ดํ‰: ์ด ๋…ผ๋ฌธ์€ LLM์˜ ์ž๊ธฐ ์ˆ˜์ • ๋Šฅ๋ ฅ์— ๋Œ€ํ•œ ๊ธฐ์กด ๋‚™๊ด€์  ์ฃผ์žฅ๋“ค์„ ์ฒด๊ณ„์ ์œผ๋กœ ๋น„ํŒํ•˜๊ณ  ์‹ค์ œ ํ•œ๊ณ„๋ฅผ ์ž…์ฆํ•จ์œผ๋กœ์จ, ์ด ๋ถ„์•ผ์˜ ํ‰๊ฐ€ ๊ธฐ์ค€์„ ๋†’์ด๊ณ  ํ–ฅํ›„ ์—ฐ๊ตฌ ๋ฐฉํ–ฅ์„ ์žฌ์„ค์ •ํ•˜๋Š” ์ค‘์š”ํ•œ ๊ธฐ์—ฌ๋ฅผ ํ•œ๋‹ค. ํŠนํžˆ ์™ธ๋ถ€ ํ”ผ๋“œ๋ฐฑ ์—†๋Š” ์‹ค์ œ ์กฐ๊ฑด์—์„œ์˜ ์„ฑ๋Šฅ ํ‰๊ฐ€๋Š” ์‹ค์šฉ์  ๊ฐ€์น˜๊ฐ€ ๋†’๋‹ค.

๊ฐ™์ด ๋ณด๋ฉด ์ข‹์€ ๋…ผ๋ฌธ

๋‹ค๋ฅธ ์ ‘๊ทผ
CRITIC ๋…ผ๋ฌธ์€ LLM์ด ๋„๊ตฌ๋ฅผ ํ™œ์šฉํ•  ๋•Œ๋งŒ ์ž๊ธฐ ์ˆ˜์ •์ด ๊ฐ€๋Šฅํ•˜๋‹ค๋Š” ์ ์„ ์ œ์•ˆํ•˜๋ฉฐ, ์™ธ๋ถ€ ํ”ผ๋“œ๋ฐฑ/ํˆด ๋‚ด์žฅ ๋“ฑ ์ฐจ์ด์ ์„ ์‹ค์ฆ์ ์œผ๋กœ ๋น„๊ตํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
683๋ฒˆ ๋…ผ๋ฌธ์€ Reasoning ๊ธฐ๋ฐ˜์˜ Reward Modeling์„ ํ†ตํ•ด LLM ์ž๊ธฐ๊ต์ •์˜ ์กฐ๊ฑด๊ณผ ๊ฐ€๋Šฅ์„ฑ์„ ํญ๋„“๊ฒŒ ๋ถ„์„ํ•˜์—ฌ 471๋ฒˆ์˜ ๋น„ํŒ์  ๊ฒฐ๋ก ๊ณผ ๊ท ํ˜• ์žˆ๊ฒŒ ์ฝ๊ธฐ์— ์ ํ•ฉํ•˜๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
LLM ์—์ด์ „ํŠธ์˜ ์ž๊ธฐ ๊ต์ • ์‹คํŒจ ๋ฌธ์ œ๋ฅผ ๋‹ค๋ฃจ๋Š” ๋Œ€์‹ , ์ž๋™ ์—ฐ๊ตฌ ์‹œ์Šคํ…œ ์„ค๊ณ„์— ์ ‘๊ทผํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.
ํ›„์† ์—ฐ๊ตฌ
Selfcheck ๋…ผ๋ฌธ์€ LLM์˜ ์Šคํ…๋ณ„ ์ž๊ธฐ ์ ๊ฒ€ ๋ฐ ์ž๊ฐ€ ๊ฒ€์ฆ ์„ฑ๋Šฅ ํ•œ๊ณ„์™€ ๊ฐœ์„  ์•„์ด๋””์–ด๋ฅผ ์‹คํ—˜์ ์œผ๋กœ ์ œ์‹œํ•˜์—ฌ, LLM์˜ ์ž๊ธฐ ์ˆ˜์ • ๊ฐ€๋Šฅ์„ฑ ๋…ผ์˜์— ๊นŠ์ด๋ฅผ ๋”ํ•ฉ๋‹ˆ๋‹ค.
ํ›„์† ์—ฐ๊ตฌ
LLM ์ž์œจ์„ฑ์˜ ์ง„ํ™”์™€ ๋‹จ๊ณ„๋ณ„ ๋ฐœ์ „์„ ๋ถ„์„ํ•˜๋ฉฐ, ์ž๊ธฐ๊ต์ • ์‹คํŒจ ํ˜„์ƒ๋„ ๊ตฌ์กฐ์ ์œผ๋กœ ์กฐ๋งํ•ฉ๋‹ˆ๋‹ค.
๋ฐ˜๋ก /๋น„ํŒ
์ž์ฒด ๊ฐœ์„ (self-correction) ํ˜น์€ ์ง€์†์  ์‚ฌ์ „ํ•™์Šต ์ ‘๊ทผ๋ฒ•์˜ ํ•œ๊ณ„ ๋ฐ ์œ„ํ—˜์„ ์‹ค์ฆ์ ์œผ๋กœ ๋ณด์—ฌ์ฃผ๋Š” ์—ฐ๊ตฌ๋กœ, ๋„๋ฉ”์ธ ์ ์‘์˜ ์‹ค์ œ ๊ฐœ์„  ํšจ๊ณผ์™€ ๋Œ€๋น„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
๋ฐ˜๋ก /๋น„ํŒ
Large language models can self-improve ๋…ผ๋ฌธ์€ LLM์˜ ์ž๊ธฐ๊ฐœ์„  ๊ฐ€๋Šฅ์„ฑ์„ ์‹คํ—˜์ ์œผ๋กœ ์ฃผ์žฅํ•˜๋ฉฐ, self-correction ํ•œ๊ณ„๋ผ๋Š” ๋ณธ ๋…ผ๋ฌธ๊ณผ ๋…ผ์ ์ด ๋Œ€์กฐ๋ฉ๋‹ˆ๋‹ค.
๋ฐ˜๋ก /๋น„ํŒ
Self-Refine ๋…ผ๋ฌธ์€ LLM์˜ ์ž๊ธฐ ํ”ผ๋“œ๋ฐฑ ๊ธฐ๋ฐ˜ ์ˆœ์ฐจ์  ๊ฐœ์„ ์ด ํšจ๊ณผ์ ์ผ ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ฃผ์–ด ์ž๊ธฐ๊ต์ •์˜ ํ•œ๊ณ„๋ฅผ ๋น„ํŒ์ ์œผ๋กœ ์กฐ๋ช…ํ•ฉ๋‹ˆ๋‹ค.
๋ฐ˜๋ก /๋น„ํŒ
Large Language Models Cannot Self-Correct Reasoning Yet ๋…ผ๋ฌธ์€ LLM์˜ ์ž๊ธฐ ์ˆ˜์ • ๋Šฅ๋ ฅ์˜ ์‹ค์ œ์  ํ•œ๊ณ„๋ฅผ ๋น„ํŒ์ ์œผ๋กœ ๋ถ„์„ํ•˜์—ฌ, ์ž๊ธฐ๊ต์ • ํ”„๋ ˆ์ž„์›Œํฌ์˜ ์•ฝ์  ๋…ผ์˜์™€ ๋ณด์™„ ํ•„์š”์„ฑ์— ์ฐธ๊ณ ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
๋ฐ˜๋ก /๋น„ํŒ
471 ๋…ผ๋ฌธ์€ LLM์˜ ์ž๊ธฐ์˜ค๋ฅ˜ ์ˆ˜์ • ํ•œ๊ณ„๋ฅผ ์ง€์ , 747์˜ ๊ฒฐ๊ณผ์™€ ๋Œ€์กฐํ•ด ๋ณผ ์ˆ˜ ์žˆ๋‹ค.
๋ฐ˜๋ก /๋น„ํŒ
LLM์˜ ์ž๊ธฐ์ˆ˜์ •ยท์ž๊ธฐ๊ฒ€์ฆ์˜ ํ•œ๊ณ„์™€ XoT ๋“ฑ ๋‹ค์–‘ํ•œ ๋ฐฉ๋ฒ•์— ๋Œ€ํ•œ ๋น„ํŒ์  ์‹œ๊ฐ์„ ์ œ์‹œํ•ด Wrong-of-Thought ํ”„๋ ˆ์ž„์›Œํฌ์˜ ํ•„์š”์„ฑ์„ ๋ถ€๊ฐํ•ฉ๋‹ˆ๋‹ค.
๋ฐ˜๋ก /๋น„ํŒ
471๋ฒˆ ๋…ผ๋ฌธ์€ LLM์ด ์ถ”๋ก ์  ์˜ค๋ฅ˜๋ฅผ ์•„์ง์€ ์ž์ฒด์ ์œผ๋กœ ๊ต์ •ํ•˜์ง€ ๋ชปํ•œ๋‹ค๊ณ  ์ง€์ ํ•˜๋ฉฐ, 736๋ฒˆ์˜ ๋‹ค์ค‘ ์‹ ๋ขฐ์„ฑ ํ‰๊ฐ€ ๊ฒฐ๊ณผ๋ฅผ ๋น„ํŒ์ ์œผ๋กœ ๋Œ์•„๋ณผ ๊ทผ๊ฑฐ๋ฅผ ์ œ์‹œํ•œ๋‹ค.
๋ฐ˜๋ก /๋น„ํŒ
LLM์ด ์•„์ง ๋…ผ๋ฆฌ์  ์ถ”๋ก  ๋‹จ๊ณ„์—์„œ ์ž๊ธฐ๊ฒ€์ฆ ๋ฐ reasoning ํ•œ๊ณ„๊ฐ€ ์‹ฌ๊ฐํ•˜๋‹ค๋Š” ์ ์„ ์ฒด๊ณ„์ ์œผ๋กœ ๋น„ํŒํ•˜๋ฉฐ RBF++์˜ ํ•œ๊ณ„/ํ•„์š”์„ฑ์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.
๋ฐ˜๋ก /๋น„ํŒ
LLM์˜ ์ž๊ธฐ ์ˆ˜์ • ํ•œ๊ณ„๋ฅผ ๋น„ํŒ์ ์œผ๋กœ ๋ถ„์„ํ•˜๋Š” ๋…ผ๋ฌธ์œผ๋กœ, PAG ํ”„๋ ˆ์ž„์›Œํฌ์˜ ์ด๋ก ์  ๋ฐฐ๊ฒฝ๊ณผ ๋ฌธ์ œ์ ์„ ํ•จ๊ป˜ ์ดํ•ดํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •