Towards reasoning era: A survey of long chain-of-thought for reasoning large language models

์ €์ž: Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, Ji-Rong Wen | ๋‚ ์งœ: 2025 | DOI: N/A 📄 PDF


Essence

Figure 2

Long CoT์™€ Short CoT์˜ ๊ตฌ๋ณ„: ๊นŠ์€ ์ถ”๋ก (Deep Reasoning), ๊ด‘๋ฒ”์œ„ํ•œ ํƒ์ƒ‰(Extensive Exploration), ์‹คํ˜„ ๊ฐ€๋Šฅํ•œ ๋ฐ˜์„ฑ(Feasible Reflection)์˜ ์„ธ ๊ฐ€์ง€ ํ•ต์‹ฌ ํŠน์„ฑ

OpenAI-o1๊ณผ DeepSeek-R1 ๊ฐ™์€ ์ถ”๋ก  ๋Œ€ํ˜•์–ธ์–ด๋ชจ๋ธ(RLLMs)์˜ ์„ฑ๊ณต์€ ์žฅ๋ฌธ์˜ ์ฒด์ธ์˜ค๋ธŒ์˜ํŠธ(Long CoT) ํŠน์„ฑ์— ๊ธฐ์ธํ•˜๋ฉฐ, ๋ณธ ๋…ผ๋ฌธ์€ Long CoT์™€ ์ „ํ†ต์  Short CoT์˜ ๊ตฌ๋ณ„, ํ•ต์‹ฌ ํŠน์„ฑ, ๊ทธ๋ฆฌ๊ณ  ๊ด€๋ จ ํ˜„์ƒ๋“ค์— ๋Œ€ํ•œ ์ตœ์ดˆ์˜ ์ข…ํ•ฉ์  ๋ถ„์„์„ ์ œ๊ณตํ•œ๋‹ค.

Motivation

Achievement

Figure 1

์ง€๋‚œ 3๋…„๊ฐ„ ์„ ํƒ๋œ Long CoT์˜ ์ง„ํ™”: ๊นŠ์€ ์ถ”๋ก , ์‹คํ˜„ ๊ฐ€๋Šฅํ•œ ๋ฐ˜์„ฑ, ๊ด‘๋ฒ”์œ„ํ•œ ํƒ์ƒ‰์˜ ์„ธ ๊ฐ€์ง€ ํŠน์„ฑ์„ ์ƒ‰์ƒ ๋ถ„๊ธฐ๋กœ ํ‘œํ˜„

Figure 3

Long CoT์˜ ๋ถ„๋ฅ˜๋ฒ•: ๊นŠ์€ ์ถ”๋ก  ํ˜•์„ฑ(์ž์—ฐ์–ด, ๊ตฌ์กฐํ™”๋œ ์–ธ์–ด, ์ž ์žฌ ๊ณต๊ฐ„), ๊นŠ์€ ์ถ”๋ก  ํ•™์Šต(๋ชจ๋ฐฉํ•™์Šต, ์ž๊ธฐํ•™์Šต), ์‹คํ˜„ ๊ฐ€๋Šฅํ•œ ๋ฐ˜์„ฑ(์ „์ฒด ํ”ผ๋“œ๋ฐฑ, ํ”„๋กœ์„ธ์Šค ํ”ผ๋“œ๋ฐฑ), ๊ด‘๋ฒ”์œ„ํ•œ ํƒ์ƒ‰(ํƒ์ƒ‰ ์Šค์ผ€์ผ๋ง, ๋‚ด๋ถ€/์™ธ๋ถ€ ํƒ์ƒ‰)

  1. ์ฒด๊ณ„์  ๊ตฌ๋ณ„: Long CoT๋ฅผ ํ˜•์‹์ ์œผ๋กœ ์ •์˜ํ•˜๊ณ  Short CoT์™€์˜ ์ฐจ์ด๋ฅผ ์ˆ˜์‹ํ™”ํ•จ.
    • Short CoT: $\text{CoT}_S = R(\{n_i\}^k_{i=1}|(k \leq B_s) \land (j=1 \Leftrightarrow \forall i \leq k, n_i \to n_{i+j}) \land (\forall i \neq j \leq k, n_i \neq n_j))$
    • Long CoT๋Š” ๊ฒฝ๊ณ„ $B_l \gg B_s$๋กœ ํ™•์žฅํ•˜๋ฉฐ, ๊นŠ์ด ์ œ์•ฝ์„ ์™„ํ™”ํ•จ
  2. ์„ธ ๊ฐ€์ง€ ํ•ต์‹ฌ ํŠน์„ฑ ์ •์˜:
    • Deep Reasoning: ๋ณต์žกํ•œ ๊ตฌ์กฐ ์ „๋ฐ˜์—์„œ ์—„๋ฐ€ํ•œ ๋…ผ๋ฆฌ์  ๋ถ„์„์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๋Šฅ๋ ฅ
    • Extensive Exploration: ํ‰ํ–‰ ๋ถˆํ™•์‹ค ๋…ธ๋“œ ์ƒ์„ฑ ๋ฐ ์•Œ๋ ค์ง„ ๋…ผ๋ฆฌ์—์„œ ๋ฏธ์ง€์˜ ๋…ผ๋ฆฌ๋กœ์˜ ์ „ํ™˜
    • Feasible Reflection: ๋…ผ๋ฆฌ์  ์—ฐ๊ฒฐ์˜ ํ”ผ๋“œ๋ฐฑ ๋ฐ ์ •์ œ
  3. ํ•ซ ํ˜„์ƒ์˜ ์ฒด๊ณ„์  ๋ถ„์„: overthinking, inference-time scaling, "Aha Moment" ๋“ฑ์˜ ์ถœํ˜„ ๋ฉ”์ปค๋‹ˆ์ฆ˜ ์„ค๋ช…

How

Figure 5

๊นŠ์€ ์ถ”๋ก ์˜ ์„ธ ๊ฐ€์ง€ ์ฃผ์š” ํ˜•์‹: ์ž์—ฐ์–ด(CoT, MathPrompter), ๊ตฌ์กฐํ™”๋œ ์–ธ์–ด(PoT, CoC), ์ž ์žฌ ๊ณต๊ฐ„(Quiet-STaR, PlanningTokens)

Deep Reasoning Formation (๊นŠ์€ ์ถ”๋ก  ํ˜•์„ฑ):

Deep Reasoning Learning (๊นŠ์€ ์ถ”๋ก  ํ•™์Šต):

Feasible Reflection (์‹คํ˜„ ๊ฐ€๋Šฅํ•œ ๋ฐ˜์„ฑ):

Extensive Exploration (๊ด‘๋ฒ”์œ„ํ•œ ํƒ์ƒ‰):

Originality

Limitation & Further Study

Evaluation

์ดํ‰: ๋ณธ ๋…ผ๋ฌธ์€ RLLMs์˜ ์ค‘์‹ฌ ๊ธฐ์ˆ ์ธ Long CoT๋ฅผ ์ฒ˜์Œ์œผ๋กœ ์ฒด๊ณ„์ ์œผ๋กœ ๋ถ„์„ํ•œ ์ค‘์š”ํ•œ ์ข…ํ•ฉ ์„ค๋ฌธ์œผ๋กœ, ๋ช…ํ™•ํ•œ ๋ถ„๋ฅ˜ ์ฒด๊ณ„์™€ ํ’๋ถ€ํ•œ ์‚ฌ๋ก€๋ฅผ ์ œ๊ณตํ•˜์—ฌ ํ›„์† ์—ฐ๊ตฌ์˜ ์ง€๋„๋ฅผ ์ œ์‹œํ•œ๋‹ค. ๋‹ค๋งŒ ์ด๋ก ์  ๊นŠ์ด์™€ ์ผ๋ถ€ ํ˜„์ƒ์˜ ์„ค๋ช…์ด ์ถ”๊ฐ€ ๋ฐœ์ „์˜ ์—ฌ์ง€๋ฅผ ๋‚จ๊ธด๋‹ค.

๊ฐ™์ด ๋ณด๋ฉด ์ข‹์€ ๋…ผ๋ฌธ

๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
Generative language modeling for automated theorem proving ๋…ผ๋ฌธ์€ ์ฒด์ธ์˜ค๋ธŒ์˜ํŠธ์™€ LLM ๊ธฐ๋ฐ˜ ์ˆ˜ํ•™ ์ฆ๋ช… ์ƒ์„ฑ์ด๋ผ๋Š” ์ฃผ์ œ์—์„œ 833 ๋…ผ๋ฌธ์˜ ํ•ต์‹ฌ ๋…ผ์˜์˜ ์ด๋ก ์  ๊ธฐ๋ฐ˜์ด ๋œ๋‹ค.
๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
Towards reasoning era ๋…ผ๋ฌธ์€ ์žฅ๊ธฐ ์ฒด์ธ์˜ค๋ธŒ์˜ํŠธ ๊ธฐ๋ฐ˜ LLM ์ถ”๋ก ์˜ ํ•ต์‹ฌ ํŠน์„ฑ๊ณผ ํ˜„์ƒ์„ ์ข…ํ•ฉํ•ด, ๋น„ํ˜•์‹-ํ˜•์‹ ์ฆ๋ช… ๋ณ€ํ™˜์˜ ๊ฐ€์น˜์™€ ํ•œ๊ณ„๋ฅผ ์ด๋ก ์ ์œผ๋กœ ์กฐ๋ช…ํ•œ๋‹ค.
๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
155 ๋…ผ๋ฌธ์€ ๊ณ ํ’ˆ์งˆ ์—ฐ๊ตฌ ์•„์ด๋””์–ด ์ƒ์‚ฐ์˜ ์›๋™๋ ฅ์„ ์‹ฌ์ธต ๋ถ„์„ํ•˜์—ฌ, 833์—์„œ ๋‹ค๋ฃฌ ์žฅ๊ธฐ ์ฒด์ธ์˜ค๋ธŒ์˜ํŠธ ์ถ”๋ก ์˜ ํ˜์‹ ์„ฑ๊ณผ ์—ฐ๊ด€๋œ๋‹ค.
๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
๊ธด chain-of-thought ์ถ”๋ก  ๋Šฅ๋ ฅ ํ‰๊ฐ€๋ฅผ ๋‹ค๋ฃจ๋Š” ์ข…ํ•ฉ์  ๋ฆฌ๋ทฐ๋กœ, ์ถ”๋ก  ๊ฒฝ๊ณ„ ์ธก์ • ์ฒด๊ณ„์— ๋Œ€ํ•œ ์ด๋ก ์  ๊ธฐ์ดˆ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
๋‘˜ ๋‹ค LLM์˜ ๋‹ค๋‹จ๊ณ„, ์ฒด์ธ์˜ค๋ธŒ์˜ํŠธ(Chain-of-Thought) ์ถ”๋ก  ๊ณผ์ • ๋ถ„์„์— ์ดˆ์ ์„ ๋‘์ง€๋งŒ, 242๋Š” ๋„๊ตฌ์™€์˜ ์ƒํ˜ธ์ž‘์šฉ ๊ธฐ๋ฐ˜ ์ž๊ฐ€์ˆ˜์ •์— ์ง‘์ค‘ํ•ฉ๋‹ˆ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
๊ธด ์‚ฌ๊ณ  ์‚ฌ์Šฌ ์„œ๋ฒ ์ด๋Š” LLM์˜ ์ถ”๋ก  ๋Šฅ๋ ฅ์„ ์ฒด๊ณ„์ ์œผ๋กœ ๋ถ„์„ํ•˜๋ฉฐ, ๊ฐ€์„ค ๋ฐœ๊ฒฌ๊ณผ ๊ทœ์น™ ํ•™์Šต ์„œ๋ฒ ์ด์™€ ์ƒํ˜ธ ๋ณด์™„์ ์ธ ๊ด€์ ์„ ์ œ๊ณตํ•œ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
346๋ฒˆ ๋…ผ๋ฌธ์€ ๊ณผํ•™์  ๋ฌธ์ œ ํ•ด๊ฒฐ์„ ์œ„ํ•œ ํŒŒ์šด๋ฐ์ด์…˜ ๋ชจ๋ธ ์„œ๋ธŒ์…‹ ์ ์šฉ์„ ๋‹ค๋ฃจ์–ด, chain-of-thought ๊ธฐ๋ฐ˜ reasoning์˜ ํ™•์žฅ ๋˜๋Š” ๋Œ€์กฐ์  ๋ฐฉํ–ฅ์„ฑ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
ํ›„์† ์—ฐ๊ตฌ
Draft, sketch, and prove ๋…ผ๋ฌธ์€ ์žฅ๊ธฐ ์ฒด์ธ์˜ค๋ธŒ์˜ํŠธ์  ์ถ”๋ก ์„ ์œ„ํ•ด ๋น„ํ˜•์‹-ํ˜•์‹ ๋ณ€ํ™˜ ์ ‘๊ทผ์„ ์ œ์•ˆํ•˜๋ฉฐ, 833 ๋…ผ๋ฌธ์˜ ๋กฑ CoT์™€ ์‹ค์ œ ์ฆ๋ช… ์‹œ์Šคํ…œ ์—ฐ๊ฒฐ ์ธก๋ฉด์—์„œ ์—ฐ๊ด€์„ฑ์ด ๋†’๋‹ค.
ํ›„์† ์—ฐ๊ตฌ
785 ๋…ผ๋ฌธ์€ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ฒด์ธ ์˜ค๋ธŒ ์˜ํŠธ ํ•™์Šต์ „๋žต์„ ์‹ค์ œ๋กœ ๋Œ€ํ˜• ๋ชจ๋ธ์— ์ ์šฉ ๋ฐ ํ‰๊ฐ€ํ•˜์—ฌ, reasoning era๋กœ์˜ ์ง„์ž…์—์„œ Long CoT์˜ ์‹ค๋ฌด์  ์˜ํ–ฅ๋ ฅ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.
ํ›„์† ์—ฐ๊ตฌ
833 ๋…ผ๋ฌธ์ด ๊ตฌ์ฒด์ ์œผ๋กœ Long CoT ๊ธฐ๋ฐ˜ ์ถ”๋ก  ๋ชจ๋ธ์„ ๋ถ„์„ํ•˜๋Š” ํ•œํŽธ, 746 ๋…ผ๋ฌธ์€ ์ž๊ธฐ๋ฐ˜๋ณต์„ ํ†ตํ•œ ์ฒด์ธ์˜ค๋ธŒ์˜ํŠธ ๊ฐœ์„ ์„ ๊ตฌํ˜„ํ•ด Long CoT ํŒจ๋Ÿฌ๋‹ค์ž„์˜ ์‹ค์ œ ํšจ๊ณผ ์ธก๋ฉด์„ ๋ณด์—ฌ์ค€๋‹ค.
ํ›„์† ์—ฐ๊ตฌ
์‹ฌ์ธต์ ยทํŠธ๋ฆฌ๊ตฌ์กฐ ๊ธฐ๋ฐ˜ ์งˆ๋ฌธ ํ”„๋ ˆ์ž„์›Œํฌ๋กœ ์žฅ๋ฌธ์˜ ์ฒด์ธ์˜ค๋ธŒ์˜ํŠธ ์ถ”๋ก  ํ˜„์ƒ์„ ๊ธฐ์ˆ ํ‰๊ฐ€ํ•˜๋Š” ์ƒˆ๋กœ์šด ๋ฐฉ๋ฒ•์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.
ํ›„์† ์—ฐ๊ตฌ
๊ธด ์‚ฌ๊ณ  ์‚ฌ์Šฌ(Long Chain-of-Thought)์— ๋Œ€ํ•œ ์„œ๋ฒ ์ด๋กœ, DeepSeek-R1์ด ๊ฐœ์ฒ™ํ•œ RL ๊ธฐ๋ฐ˜ ์ถ”๋ก ์˜ ์ด๋ก ์  ๋ฐฐ๊ฒฝ๊ณผ ์ตœ์‹  ๋™ํ–ฅ์„ ํญ๋„“๊ฒŒ ๋‹ค๋ฃฌ๋‹ค.
← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •