Curriculum Reinforcement Learning from Easy to Hard Tasks Improves LLM Reasoning

์ €์ž: Shubham Parashar, Shurui Gui, Xiner Li, Hongyi Ling, Sushil Vemuri | ๋‚ ์งœ: 2025 | DOI: 10.48550/arXiv.2506.06632 📄 PDF


Essence

Figure 2

E2H Reasoner์˜ ์ž‘์—… ๋ถ„ํ•ด: ํ•™์Šต์ด ์ง„ํ–‰๋จ์— ๋”ฐ๋ผ ์ž๋ช…(Trivial) โ†’ ์‰ฌ์›€(Easy) โ†’ ์ค‘๊ฐ„(Medium) โ†’ ์–ด๋ ค์›€(Hard) ์ž‘์—…์œผ๋กœ ์ ์ง„์  ์ „ํ™˜

๋ณธ ๋…ผ๋ฌธ์€ ๋Œ€๊ทœ๋ชจ ์–ธ์–ด๋ชจ๋ธ(LLM)์˜ ์ถ”๋ก  ๋Šฅ๋ ฅ์„ ๊ฐ•ํ™”ํ•™์Šต(RL)๊ณผ ์ปค๋ฆฌํ˜๋Ÿผ ํ•™์Šต์„ ๊ฒฐํ•ฉํ•˜์—ฌ ๊ฐœ์„ ํ•˜๋Š” E2H Reasoner ๋ฐฉ๋ฒ•์„ ์ œ์‹œํ•œ๋‹ค. ์ž‘์—…์„ ๋‚œ์ด๋„๋ณ„๋กœ ๋ถ„ํ•ดํ•˜๊ณ  ํ™•๋ฅ ์  ์Šค์ผ€์ค„๋Ÿฌ๋ฅผ ํ†ตํ•ด ์‰ฌ์šด ์ž‘์—…์—์„œ ์–ด๋ ค์šด ์ž‘์—…์œผ๋กœ ์ ์ง„์  ํ•™์Šต์„ ์ˆ˜ํ–‰ํ•จ์œผ๋กœ์จ, ๋‹จ์ˆœ RL๋งŒ์œผ๋กœ๋Š” ํ•ด๊ฒฐ ๋ถˆ๊ฐ€๋Šฅํ•œ ์ถ”๋ก  ๋ฌธ์ œ๋ฅผ ํ•™์Šต ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•œ๋‹ค.

Motivation

Achievement

Figure 1

Pass@k ํ‰๊ฐ€์—์„œ E2H๊ฐ€ ๊ธฐ์ € ๋ชจ๋ธ์„ ์ƒํšŒ: (a) Countdown, (b) Blocksworld, (c) LLaMA 3.2 3B์˜ ์ถ”๋ก  ์˜ˆ์‹œ

  1. ์‹ค์ฆ์  ์„ฑ๊ณผ: 5๊ฐœ ์ถ”๋ก  ์ž‘์—…(Blocksworld, Countdown, MATH, AQuA, GSM8K)์—์„œ ์ตœ๊ณ  ์„ฑ๋Šฅ(SOTA) ๋‹ฌ์„ฑ. ํŠนํžˆ ๊ธฐ์ € ๋ชจ๋ธ์ด 0-shot์œผ๋กœ ํ•ด๊ฒฐ ๋ถˆ๊ฐ€๋Šฅํ•œ ๋ฌธ์ œ๊นŒ์ง€ ํ•™์Šตํ•˜์—ฌ ๋†’์€ pass@k ๊ฐ’ ๋‹ฌ์„ฑ
  2. ์ด๋ก ์  ๋ณด์žฅ: Approximate Policy Iteration ํ”„๋ ˆ์ž„์›Œํฌ ๋‚ด์—์„œ E2H Reasoner์˜ ์ˆ˜๋ ด์„ฑ์„ ์ฆ๋ช…ํ•˜๊ณ , ์ ์ ˆํ•˜๊ฒŒ ๋ถ„ํ•ด๋œ ์ž‘์—…์„ ํ†ตํ•œ ์ปค๋ฆฌํ˜๋Ÿผ ํ•™์Šต์ด ์ง์ ‘ ํ•™์Šต๋ณด๋‹ค ์ ์€ ํ‘œ๋ณธ์œผ๋กœ๋„ ์ˆ˜๋ ด ๊ฐ€๋Šฅํ•จ์„ ๋ณด์˜€๋‹ค (finite-sample complexity bound ๋„์ถœ)
  3. ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ: ์ปค๋ฆฌํ˜๋Ÿผ ํ•™์Šต์„ ํ†ตํ•ด ๋ชจ๋ธ์ด ๋ถ„ํฌ ๋‚ด ๋‚œ์ œ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๋ถ„ํฌ ์™ธ(OOD) ์ž‘์—…์œผ๋กœ์˜ ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ์„ ๊ฐ•ํ™”

How

Figure 3

์ฝ”์‚ฌ์ธ ๊ธฐ๋ฐ˜ ์Šค์ผ€์ค„๋ง ๋ฉ”์ปค๋‹ˆ์ฆ˜ (Gaussian Sampler๋ฅผ ํ†ตํ•œ ๋™์  ์ž‘์—… ๋น„์ค‘ ์กฐ์ •)

๋ฐฉ๋ฒ•๋ก ์˜ ํ•ต์‹ฌ ์š”์†Œ:

Originality

Limitation & Further Study

Evaluation

์ดํ‰: ๋ณธ ๋…ผ๋ฌธ์€ LLM ์ถ”๋ก  ํ•™์Šต์„ ์œ„ํ•ด ์ปค๋ฆฌํ˜๋Ÿผ ํ•™์Šต๊ณผ ๊ฐ•ํ™”ํ•™์Šต์„ ๊ฒฐํ•ฉํ•œ ์‹ค์งˆ์ ์œผ๋กœ ํšจ๊ณผ์ ์ธ ๋ฐฉ๋ฒ•์„ ์ œ์‹œํ•˜๋ฉฐ, ์ด๋ก ์  ์ˆ˜๋ ด ๋ณด์žฅ๊ณผ ์‹ค์ฆ์  ์šฐ์ˆ˜์„ฑ์„ ๋™์‹œ์— ์ œ๊ณตํ•œ๋‹ค. ๋‹ค๋งŒ ๋‚œ์ด๋„ ๋ถ„ํ•ด์˜ ์ž๋™ํ™”, ๋Œ€๊ทœ๋ชจ ๋ชจ๋ธ ๊ฒ€์ฆ, ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ๋ฏผ๊ฐ๋„ ๋ถ„์„ ๋“ฑ์˜ ๋ณด์™„์ด ์žˆ์œผ๋ฉด ์˜ํ–ฅ๋ ฅ์ด ๋”์šฑ ์ฆ๋Œ€๋  ๊ฒƒ์œผ๋กœ ํŒ๋‹จ๋œ๋‹ค.

๊ฐ™์ด ๋ณด๋ฉด ์ข‹์€ ๋…ผ๋ฌธ

๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
์ปค๋ฆฌํ˜๋Ÿผ RL์—์„œ ์ถ”๋ก ์  ํ–‰์œ„ ์กฐํ•ฉ ๋ฐ ReAct ๋ฐฉ์‹ ์ ์šฉ์˜ ์ด๋ก ์  ๊ธฐ๋ฐ˜์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
SFT Memorizes, RL Generalizes ๋…ผ๋ฌธ์€ RL ๊ธฐ๋ฐ˜ LLM reasoning ํ•™์Šต์—์„œ ์ผ๋ฐ˜ํ™”์™€ ์ปค๋ฆฌํ˜๋Ÿผ ์„ค๊ณ„์˜ ์ด๋ก ์ ๋ฐ”ํƒ•์„ ์ œ๊ณตํ•œ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
746์€ ์Šค์Šค๋กœ ์ž๊ธฐ ํ”ผ๋“œ๋ฐฑ์„ ํ†ตํ•œ ์ ์ง„์  ์ถ”๋ก  ์„ฑ๋Šฅ ๊ฐœ์„ ์„ ๊ตฌํ˜„ํ•˜์—ฌ, 249์˜ ๋‚œ์ด๋„ ๊ธฐ๋ฐ˜ ์ปค๋ฆฌํ˜๋Ÿผ ํ•™์Šต ๋ฐฉ์‹๊ณผ๋Š” ๋‹ค๋ฅด์ง€๋งŒ ์œ ์‚ฌ ๋ฌธ์ œ๋ฅผ ๋‹ค๋ฅธ ๋ฐฉ๋ฒ•์œผ๋กœ ์ ‘๊ทผํ•œ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
449๋Š” LLM์„ RL ๊ธฐ๋ฐ˜ ๊ฐ•ํ™”ํ•™์Šต์œผ๋กœ ํ™•์žฅํ•˜๋Š” ๋‹ค์–‘ํ•œ ์ ‘๊ทผ์„ ์ œ์•ˆํ•ด, 249์˜ ์ปค๋ฆฌํ˜๋Ÿผ-๊ฐ•ํ™”ํ•™์Šต ์กฐํ•ฉ๊ณผ ๋น„๊ต๋œ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
์ž๋™ํ™”๋œ ๊ณผํ•™ํƒ๊ตฌ์—์„œ RL ๊ธฐ๋ฐ˜ ๋ฐ˜๋ณต์  ์‚ฌ๊ณ (think-loop) ํ”„๋ ˆ์ž„์›Œํฌ์˜ ๋˜ ๋‹ค๋ฅธ ์„ค๊ณ„ ๋ฐฉ์‹์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
๊ฐ•ํ™”ํ•™์Šต์„ ํ†ตํ•œ LLM ์ถ”๋ก  ๋Šฅ๋ ฅ ํ–ฅ์ƒ์„ ์œ„ํ•œ ๋Œ€์•ˆ์  ์ ‘๊ทผ๋ฒ•์„ ์ œ์‹œํ•œ๋‹ค.
← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •