Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

์ €์ž: Bowen Jin, Hansi Zeng, Zhenrui Yue, Dong Wang, Hamed Zamani | ๋‚ ์งœ: 2025 | DOI: 10.48550/arXiv.2503.09516 📄 PDF


Essence

๊ฐ•ํ™”ํ•™์Šต(RL)์„ ํ†ตํ•ด ๋Œ€์–ธ์–ด๋ชจ๋ธ(LLM)์ด ์ถ”๋ก  ๊ณผ์ • ์ค‘ ๊ฒ€์ƒ‰ ์—”์ง„์„ ์ž๋™์œผ๋กœ ํ˜ธ์ถœํ•˜๊ณ  ํ™œ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ํ•™์Šตํ•˜๋Š” ํ”„๋ ˆ์ž„์›Œํฌ Search-R1์„ ์ œ์•ˆํ•˜๋ฉฐ, ๊ธฐ์กด RAG ๋Œ€๋น„ ์ตœ๋Œ€ 41%์˜ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ๋‹ฌ์„ฑํ•œ๋‹ค.

Motivation

Achievement

Figure 1

Figure 1: PPO์™€ GRPO์—์„œ ๊ฒ€์ƒ‰ ์—”์ง„์„ ํ™œ์šฉํ•œ ํ›ˆ๋ จ ๊ณผ์ •. ๋กค์•„์›ƒ ์ค‘ LLM์€ ๊ฒ€์ƒ‰ ์—”์ง„๊ณผ ๋‹ค์ค‘ ํ„ด ์ƒํ˜ธ์ž‘์šฉ ์ˆ˜ํ–‰

  1. ์„ฑ๋Šฅ ํ–ฅ์ƒ: Qwen2.5-7B์—์„œ ๊ธฐ์กด RAG ๋Œ€๋น„ ํ‰๊ท  41% ์ƒ๋Œ€ ๊ฐœ์„ , Qwen2.5-3B์—์„œ 20% ๊ฐœ์„  (7๊ฐœ QA ๋ฐ์ดํ„ฐ์…‹ ํ‰๊ฐ€)
  2. ์•ˆ์ •์  ํ›ˆ๋ จ: ๊ฒ€์ƒ‰๋œ ํ† ํฐ์˜ ์†์‹ค ๋งˆ์Šคํ‚น์œผ๋กœ RL ์ตœ์ ํ™” ์•ˆ์ •์„ฑ ํ™•๋ณด
  3. ํ•ด์„ ๊ฐ€๋Šฅ์„ฑ: RL ๋ฐฉ๋ฒ• ์„ ํƒ, LLM ๋ชจ๋ธ ์ฐจ์ด, ์‘๋‹ต ๊ธธ์ด ๋™์—ญํ•™์— ๋Œ€ํ•œ ์‹ค์ฆ์  ํ†ต์ฐฐ ์ œ๊ณต

How

Figure 2

Figure 2: PPO vs GRPO ์ˆ˜๋ ด ๋น„๊ต

RL ๊ฐ์ฒด ํ•จ์ˆ˜ (๊ฒ€์ƒ‰ ์—”์ง„ ํ†ตํ•ฉ):

ํ•ต์‹ฌ ๊ธฐ์ˆ :

```

J_PPO(ฮธ) = min(ฯ€ฮธ/ฯ€_old ยท A, clip(ฯ€ฮธ/ฯ€_old, 1-ฮต, 1+ฮต) ยท A)

```

Figure 3

Figure 3: ๊ฒ€์ƒ‰๋œ ํ† ํฐ ์†์‹ค ๋งˆ์Šคํ‚น ์—ฐ๊ตฌ

Originality

Limitation & Further Study

ํ›„์† ์—ฐ๊ตฌ:

Evaluation

์ดํ‰: Search-R1์€ ๊ฒ€์ƒ‰ ์—”์ง„ ํ˜ธ์ถœ์„ RL ์ตœ์ ํ™”์— ์ฒด๊ณ„์ ์œผ๋กœ ํ†ตํ•ฉํ•œ ์‹ค์šฉ์  ํ”„๋ ˆ์ž„์›Œํฌ๋กœ, ๊ฐ•๋ ฅํ•œ ์‹คํ—˜ ๊ฒฐ๊ณผ์™€ ๊ตฌํ˜„ ์ƒ์„ธํ•จ์ด ๊ฐ•์ ์ด๋‚˜, ์ด๋ก ์  ๊นŠ์ด์™€ ๊ณ„์‚ฐ ํšจ์œจ์„ฑ์— ๋Œ€ํ•œ ์ถ”๊ฐ€ ๋ถ„์„์ด ์š”๊ตฌ๋œ๋‹ค.

๊ฐ™์ด ๋ณด๋ฉด ์ข‹์€ ๋…ผ๋ฌธ

๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
๋Œ€๊ทœ๋ชจ ๊ณผํ•™ LLM ๋ฐ ๋‹ค์–‘ํ•œ ํ™œ์šฉ ์‚ฌ๋ก€๋ฅผ ํฌ๊ด„์ ์œผ๋กœ ์กฐ์‚ฌํ•œ 004 ๋…ผ๋ฌธ์€ 740์˜ RL-RAG ์ ‘๊ทผ๋ฒ•์ด ๊ธฐ์กด RAG ๋ฐ LLM ๊ธฐ๋ฐ˜ QA ๋ฌธํ—Œ์—์„œ ์œ„์น˜ํ•˜๋Š” ์ง€์ ์„ ํŒŒ์•…ํ•˜๋Š” ๋ฐ ์œ ์šฉํ•ฉ๋‹ˆ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
๋Œ€๊ทœ๋ชจ ๋„๊ตฌ ๋งˆ์Šคํ„ฐ๋ฆฌ(16,000+ ํˆด)์— LLM์ด ๋„๋‹ฌํ•˜๋Š” ํ”„๋ ˆ์ž„์›Œํฌ๋กœ, Search-R1์˜ ๊ฒ€์ƒ‰ ๋ฐ ๋„๊ตฌ ์ƒํ˜ธ์ž‘์šฉ ์„ฑ๋Šฅ๊ณผ ๋น„๊ต ๊ฐ€๋Šฅ.
๋‹ค๋ฅธ ์ ‘๊ทผ
์—์ด์ „ํŠธ๊ฐ€ ๊ฒ€์ƒ‰ ์—”์ง„๊ณผ ์—ฐ๊ณ„ํ•ด ๊ณผํ•™์  ์•„์ด๋””์–ด๋ฅผ ๋”์šฑ ํšจ์œจ์ ์œผ๋กœ ๋ฐœ๊ฒฌํ•˜๋Š” ๋ฐฉ๋ฒ•๋ก ์„ ์ œ๊ณตํ•˜์—ฌ, ์ง‘๋‹จ์  ์—์ด์ „ํŠธ ํ˜‘์—…์˜ ํ™•์žฅ ์‚ฌ๋ก€๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
LLM์˜ ๊ฒ€์ƒ‰ ๊ธฐ๋ฐ˜ ์ถ”๋ก  ๋Šฅ๋ ฅ ๊ฐ•ํ™”์— RL ๋ฐฉ๋ฒ•์„ ์ ์šฉํ•œ ReSearch ๋…ผ๋ฌธ๊ณผ ๋™์ผ ๋ฌธ์ œ์˜ ์ƒํ˜ธ๋น„๊ต๊ฐ€ ๊ฐ€๋Šฅํ•จ.
๋‹ค๋ฅธ ์ ‘๊ทผ
WebThinker ๋…ผ๋ฌธ์€ ๊ฒ€์ƒ‰ ๊ณผ์ •์„ ์‹ฌํ™”ํ•˜๋Š” RAG ๊ธฐ๋ฐ˜ ๋ชจ๋ธ๋กœ, Search-R1๊ณผ ์œ ์‚ฌ ๊ณผ์ œ์— ๋Œ€ํ•ด ๋‹ค๋ฅธ ๊ฐ•ํ™”ํ•™์Šต ์ „๋žต์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.
ํ›„์† ์—ฐ๊ตฌ
871 'WebAgent-R1' ๋…ผ๋ฌธ์€ ์›น ์—์ด์ „ํŠธ RL ํ•™์Šต์˜ ์ตœ์‹  ๊ธฐ๋ฒ•๊ณผ ํ‰๊ฐ€ ๊ธฐ์ค€์„ ์ œ๊ณตํ•˜์—ฌ 740 ๋…ผ๋ฌธ์ด ๋‹ค๋ฃจ๋Š” Search-RL ๋ฐฉ์‹์˜ ์‹ค์šฉ ์ ์šฉ์„ฑ๊ณผ ํ•œ๊ณ„๋ฅผ ๋ณด๋‹ค ์ž…์ฒด์ ์œผ๋กœ ์ดํ•ดํ•˜๋Š” ๋ฐ ๋„์›€์ด ๋ฉ๋‹ˆ๋‹ค.
ํ›„์† ์—ฐ๊ตฌ
740๋ฒˆ ๋…ผ๋ฌธ์€ ์›น ๊ธฐ๋ฐ˜ ์ •๋ณดํƒ์ƒ‰๊ณผ ์ถ”๋ก ์„ ์œ„ํ•œ LLM ํ•™์Šต ๋ฐฉ๋ฒ•์„ ์ œ์‹œํ•ด 447๋ฒˆ์˜ ํƒ์ƒ‰์  Reasoning ์—์ด์ „ํŠธ ๊ฐœ๋…์„ ํ™•์žฅํ•ฉ๋‹ˆ๋‹ค.
← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •