DR$^{3}$-Eval: Towards Realistic and Reproducible Deep Research Evaluation

์ €์ž: Qianqian Xie, Qingheng Xiong, He Zhu, Tiantian Xia, Xueming Han, Fanyu Meng, Jiakai Wang, Zhiqi Bai, ๅงœๆˆๅบท, Zhaohui Wang, Yubin Guo, Yuqing Wen, ่Œ…ๅ˜‰้˜ณ, Zijie Zhang, Shihao Li, Yanghai Wang, Yuxiang Ren, Junlan Feng, Jiaheng Liu | ๋‚ ์งœ: 2026-04-16 | DOI: 10.48550/arxiv.2604.14683 📄 PDF


Essence

Figure 1

Figure 1: Comparison of deep research benchmarks. Given raw

DRยณ-Eval์€ Deep Research Agent์˜ ํ‰๊ฐ€๋ฅผ ์œ„ํ•œ ํ˜„์‹ค์ ์ด๊ณ  ์žฌํ˜„ ๊ฐ€๋Šฅํ•œ ๋ฒค์น˜๋งˆํฌ๋กœ, ์‚ฌ์šฉ์ž ์ œ๊ณต ๋‹ค์ค‘ ๋ชจ๋‹ฌ ํŒŒ์ผ๊ณผ ์ •์  ์ƒŒ๋“œ๋ฐ•์Šค ์ฝ”ํผ์Šค๋ฅผ ๊ฒฐํ•ฉํ•˜์—ฌ ๋ฆฌํฌํŠธ ์ƒ์„ฑ ์ž‘์—…์„ ํ‰๊ฐ€ํ•œ๋‹ค.

Motivation

Achievement

Figure 2

Figure 2: Overview of the DR3-Eval framework. (1) Data construction synthesizes search paths from

How

Figure 2

Figure 2: Overview of the DR3-Eval framework. (1) Data construction synthesizes search paths from

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 4/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: DRยณ-Eval์€ Deep Research Agent ํ‰๊ฐ€์— ํ˜„์‹ค์„ฑ๊ณผ ์žฌํ˜„์„ฑ์„ ๋™์‹œ์— ํ™•๋ณดํ•œ ํ˜์‹ ์ ์ธ ๋ฒค์น˜๋งˆํฌ์ด๋ฉฐ, ์—ญ๋ฐฉํ–ฅ ๊ตฌ์ถ•, ๋‹ค์ค‘ ๋ชจ๋‹ฌ ํŒŒ์ผ ์ง€์›, ๋‹ค์ฐจ์› ํ‰๊ฐ€ ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ํ†ตํ•ด ๊ธฐ์กด ๋ฒค์น˜๋งˆํฌ์˜ ํ•œ๊ณ„๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ํ•ด๊ฒฐํ–ˆ๋‹ค. ๊ด‘๋ฒ”์œ„ํ•œ ์‹คํ—˜์„ ํ†ตํ•ด ํ˜„์žฌ LLM ๊ธฐ๋ฐ˜ DRA์˜ ๊ฒ€์ƒ‰ ๊ฒฌ๊ณ ์„ฑ๊ณผ ํ™˜๊ฐ ์ œ์–ด๋ผ๋Š” ํ•ต์‹ฌ ์•ฝ์ ์„ ๋…ธ์ถœํ•˜์—ฌ ํ–ฅํ›„ ๊ฐœ์„  ๋ฐฉํ–ฅ์„ ์ œ์‹œํ•œ๋‹ค.

๊ฐ™์ด ๋ณด๋ฉด ์ข‹์€ ๋…ผ๋ฌธ

๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
AI ์—์ด์ „ํŠธ ํ‰๊ฐ€ ๋ฐฉ๋ฒ•๋ก ์˜ ์ด๋ก ์  ๊ธฐ๋ฐ˜์„ ์ œ๊ณตํ•œ๋‹ค.
๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
AI ์—์ด์ „ํŠธ์˜ ๋ฆฌํฌํŠธ ์ƒ์„ฑ ์ž‘์—… ํ‰๊ฐ€๋ฅผ ์œ„ํ•œ ๋ฐฉ๋ฒ•๋ก ์  ๊ธฐ์ดˆ๋ฅผ ์ œ๊ณตํ•œ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
Deep Research Agent ํ‰๊ฐ€๋ฅผ ์œ„ํ•œ ๋‹ค๋ฅธ ๋ฒค์น˜๋งˆํฌ ๋˜๋Š” ํ‰๊ฐ€ ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์‹œํ•œ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
LLM ๊ธฐ๋ฐ˜ ์—์ด์ „ํŠธ์˜ ์—ฐ๊ตฌ ์ž‘์—… ์ˆ˜ํ–‰ ๋Šฅ๋ ฅ์„ ํ‰๊ฐ€ํ•˜๋Š” ๋Œ€์•ˆ์  ์ ‘๊ทผ๋ฒ•์„ ์ œ์‹œํ•œ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
๋ฌธ์„œ ์ž‘์„ฑ์—์„œ ์žฌ๊ท€์  ๊ณ„ํš๊ณผ ๋™์  ๊ตฌ์กฐ ํ†ตํ•ฉ์„ ์œ„ํ•œ ์œ ์‚ฌํ•œ ์ ‘๊ทผ๋ฒ•์„ ์ œ์‹œํ•œ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
LLM ๊ธฐ๋ฐ˜ ๋‹ค์ค‘์—์ด์ „ํŠธ๋ฅผ ํ™œ์šฉํ•œ ์†Œํ”„ํŠธ์›จ์–ด ๊ฐœ๋ฐœ ์ž๋™ํ™”์˜ ๊ด€๋ จ ์—ฐ๊ตฌ์ด๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
๊ณผํ•™-์ •์ฑ… ์—ฐ๊ณ„๋ฅผ ์œ„ํ•œ ํ…์ŠคํŠธ ๋ณ€ํ™˜์„ ๋‹ค๋ฅธ ๋ฐฉ๋ฒ•๋ก ์œผ๋กœ ์ ‘๊ทผํ•˜๋Š” ์—ฐ๊ตฌ์ด๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
์—ฐ๊ตฌ ์—์ด์ „ํŠธ ๋˜๋Š” AI ๊ธฐ๋ฐ˜ ์ •๋ณด ๊ฒ€์ƒ‰ ์‹œ์Šคํ…œ์˜ ํ‰๊ฐ€ ๋ฒค์น˜๋งˆํฌ๋ฅผ ๊ฐœ๋ฐœํ•˜๋Š” ์œ ์‚ฌํ•œ ๋ชฉ์ ์„ ๊ณต์œ ํ•œ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
AI ์‹œ์Šคํ…œ์˜ ์„ฑ๋Šฅ ํ‰๊ฐ€๋ฅผ ์œ„ํ•œ ๋ฒค์น˜๋งˆํฌ ์„ค๊ณ„์™€ ์žฌํ˜„ ๊ฐ€๋Šฅ์„ฑ์„ ๋‹ค๋ฃจ๋Š” ๊ด€๋ จ ์—ฐ๊ตฌ์ด๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
AI ์—์ด์ „ํŠธ ๊ธฐ๋ฐ˜ ์ •๋ณด ์ฒ˜๋ฆฌ ๋ฐ ๋ฆฌํฌํŠธ ์ƒ์„ฑ ์ž‘์—…์˜ ํ‰๊ฐ€๋ฅผ ๋‹ค๋ฃจ๋Š” ๊ด€๋ จ ์—ฐ๊ตฌ์ด๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
LLM ๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ ์—์ด์ „ํŠธ ํ‰๊ฐ€๋ฅผ ์œ„ํ•œ ๋Œ€์•ˆ์  ๋ฒค์น˜๋งˆํฌ๋ฅผ ์ œ์‹œํ•œ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
Deep Research ๋˜๋Š” ๋ณต์žกํ•œ AI ์ž‘์—…์˜ ํ‰๊ฐ€ ๋ฐฉ๋ฒ•๋ก ์„ ๋‹ค๋ฃจ๋Š” ์œ ์‚ฌํ•œ ์—ฐ๊ตฌ์ด๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
LLM์˜ ์ถ”๋ก  ๋Šฅ๋ ฅ ํ‰๊ฐ€ ๋ฒค์น˜๋งˆํฌ๋กœ ์œ ์‚ฌํ•œ ๋ฐฉ๋ฒ•๋ก ๊ณผ ๋ชฉํ‘œ๋ฅผ ๊ณต์œ ํ•œ๋‹ค.
ํ›„์† ์—ฐ๊ตฌ
execution-grounded evaluation ๊ฐœ๋…์„ ํŠน์ • ๋„๋ฉ”์ธ์— ์ ์šฉํ•˜์—ฌ ํ™•์žฅํ•œ ์—ฐ๊ตฌ๋‹ค.
← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •