SciReplicate-Bench: Benchmarking LLMs in Agent-driven Algorithmic Reproduction from Research Papers

์ €์ž: Yanzheng Xiang, Hanqi Yan, Shuyin Ouyang, Lin Gui, Yulan He (King's College London | ๋‚ ์งœ: 2025 | DOI: 10.48550/arXiv.2504.00255 📄 PDF


Essence

Figure 1

Figure 1:

๋ณธ ๋…ผ๋ฌธ์€ ์—ฐ๊ตฌ ๋…ผ๋ฌธ์—์„œ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์„ค๋ช…์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์ฝ”๋“œ๋ฅผ ์ƒ์„ฑํ•˜๋Š” LLM์˜ ๋Šฅ๋ ฅ์„ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•ด SciReplicate-Bench๋ผ๋Š” ๋ฒค์น˜๋งˆํฌ๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ์ด๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ดํ•ด์™€ ์ฝ”๋”ฉ ์ „๋ฌธ์„ฑ์ด๋ผ๋Š” ๋‘ ๊ฐ€์ง€ ํ•ต์‹ฌ ์—ญ๋Ÿ‰์ด ํ•„์š”ํ•œ ๋ณตํ•ฉ์ ์ธ ๊ณผ์ œ์ด๋‹ค.

Motivation

Achievement

Figure 2

Figure 2: A grouped bar chart illustrating the frequency of tool usage by different models.

How

Figure 1

Figure 1:

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 4/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: ๋ณธ ๋…ผ๋ฌธ์€ LLM์˜ ๊ณผํ•™์  ์žฌํ˜„์„ฑ ํ‰๊ฐ€๋ผ๋Š” ์ค‘์š”ํ•˜๊ณ  ๋ฏธ๊ฐœ์ฒ™๋œ ์˜์—ญ์— ์ฒซ ๋ฒˆ์งธ ์ „์šฉ ๋ฒค์น˜๋งˆํฌ๋ฅผ ์ œ๊ณตํ•œ๋‹ค. SciReplicate-Bench์™€ reasoning graph accuracy ๋ฉ”ํŠธ๋ฆญ์€ ํ•™์ˆ ์ ์œผ๋กœ ๊ฐ€์น˜ ์žˆ์œผ๋ฉฐ, ์‹คํ–‰ ๊ธฐ๋ฐ˜์˜ ๊ฐ๊ด€์  ํ‰๊ฐ€๋กœ ๊ธฐ์กด ์—ฐ๊ตฌ์˜ ํ•œ๊ณ„๋ฅผ ๊ทน๋ณตํ–ˆ๋‹ค. ๋‹ค๋งŒ ๋ฒค์น˜๋งˆํฌ ๊ทœ๋ชจ ํ™•๋Œ€์™€ overthinking ํ˜„์ƒ์˜ ์‹ฌ์ธต ๋ถ„์„์ด ํ–ฅํ›„ ํ•„์š”ํ•˜๋‹ค.

๊ฐ™์ด ๋ณด๋ฉด ์ข‹์€ ๋…ผ๋ฌธ

๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
AI ์—์ด์ „ํŠธ์˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์žฌํ˜„์„ฑ๊ณผ ์‹คํ—˜์ž๋™ํ™” ํ‰๊ฐ€์— ์ดˆ์ ์„ ๋‘” ๋ฒค์น˜๋งˆํฌ์™€์˜ ๋น„๊ต๋ฅผ ํ†ตํ•ด ํ‰๊ฐ€๋ฐฉ์‹ ์ง„ํ™”๋ฅผ ๋ถ„์„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
Why AI cannot do good science without humans ๋…ผ๋ฌธ์€ AI๊ฐ€ ์—ฐ๊ตฌ ์žฌํ˜„์„ฑ ์ž๋™ํ™”์—์„œ ๊ฒช๋Š” ์ธ๊ฐ„์  ํ•œ๊ณ„๋ฅผ ๋…ผ์˜ํ•˜์—ฌ, SciReplicate์˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์žฌํ˜„ ํ‰๊ฐ€์™€ ์—ฐ๊ฒฐ๋œ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
Evaluating large language models trained on code ๋…ผ๋ฌธ์€ LLM์ด ๋…ผ๋ฌธ ๊ธฐ๋ฐ˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์—์„œ ์ƒˆ๋กœ์šด ์ฝ”๋“œ ์ƒ์„ฑ ์ž‘์—…์„ ํ•ด๊ฒฐํ•˜๋Š” ์—ญ๋Ÿ‰์„ ์ธก์ •ํ•˜๋Š” ๋Œ€์•ˆ์  ์ ‘๊ทผ์„ ์ทจํ•ฉ๋‹ˆ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
SciCode๋Š” ๊ณผํ•™ ์—ฐ๊ตฌ์—์„œ ์ฝ”๋“œ ๊ตฌํ˜„ ๋Šฅ๋ ฅ์— ๋Œ€ํ•œ ๋ฒค์น˜๋งˆํฌ ๋ฐ์ดํ„ฐ์…‹์„ ์ œ์‹œํ•˜์—ฌ SciReplicate-Bench์™€ ์ง์ ‘ ๋น„๊ตํ•  ์ˆ˜ ์žˆ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
SWE-bench๋Š” ์‹ค์ œ ์†Œํ”„ํŠธ์›จ์–ด ๊ตฌํ˜„ ๋ฐ ๋ฌธ์ œ ํ•ด๊ฒฐ ๊ณผ์ œ์—์„œ LLM์˜ ์ฝ”๋“œ ์ƒ์„ฑ ๋ฐ ์ดํ•ด ์—ญ๋Ÿ‰์„ ํ‰๊ฐ€ํ•˜์—ฌ, SciReplicate-Bench์™€ ์ฝ”๋“œ๊ธฐ๋ฐ˜ ํ‰๊ฐ€์ถ•์„ ๊ณต์œ ํ•œ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
Dynamic multi-agent orchestration and retrieval ๋…ผ๋ฌธ์€ ๋‹ค์ค‘ ์—์ด์ „ํŠธ ๊ธฐ๋ฐ˜์˜ ๋ณต์žกํ•œ AI ์—ฐ๊ตฌ ์ž‘์—… ์ž๋™ํ™”์— ์ค‘์ ์„ ๋‘๋ฉฐ, ์ฝ”๋“œ ๊ธฐ๋ฐ˜ ์žฌํ˜„์„ฑ ํ‰๊ฐ€์™€ ์ƒํ˜ธ๋ณด์™„์  ์ ‘๊ทผ๋ฒ•์„ ์ œ์‹œํ•œ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
Autoreproduce ๋…ผ๋ฌธ๋„ AI ๊ธฐ๋ฐ˜์˜ ์‹คํ—˜ ์žฌํ˜„ ์ž๋™ํ™”์™€ ๋ฒค์น˜๋งˆํฌ ๊ตฌ์ถ•์„ ๋‹ค๋ฃจ๋ฏ€๋กœ, SciReplicate-Bench์™€ ์œ ์‚ฌ ๊ด€์ ์—์„œ ์ •์ฑ…ยท๊ธฐ์ˆ ์  ๋น„๊ต๊ฐ€ ๊ฐ€๋Šฅํ•˜๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
์—์ด์ „ํŠธ ๊ธฐ๋ฐ˜ ํ•™์ˆ ์  ์‹คํ—˜ ๋ฐ ๊ฐ•ํ™”ํ•™์Šต ์‹ค๋ ฅ ํ‰๊ฐ€๋ฅผ ๋‹ค๋ฃจ์–ด, ์›นํ™˜๊ฒฝ ๊ธฐ๋ฐ˜ ์—์ด์ „ํŠธ์˜ ์‹ค์ œ ์ ์šฉ ์˜ˆ์‹œ์™€ ์„ฑ๋Šฅ ๋น„๊ต๊ฐ€ ๊ฐ€๋Šฅํ•˜๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
์–ธ์–ด ๋ชจ๋ธ์˜ ์ฝ”๋“œ ์ถ”๋ก  ๋ฐ ์•Œ๊ณ ๋ฆฌ์ฆ˜์  ๋ฌธ์ œ ํ•ด๊ฒฐ ๋Šฅ๋ ฅ์„ ํ‰๊ฐ€ํ•˜๋Š” ์œ ์‚ฌํ•œ ๋ฒค์น˜๋งˆํฌ ์—ฐ๊ตฌ์ด๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
๋…ผ๋ฌธ์œผ๋กœ๋ถ€ํ„ฐ ์ฝ”๋“œ๋ฅผ ์ž๋™์œผ๋กœ ์ƒ์„ฑํ•˜๋Š” ์œ ์‚ฌํ•œ ์ ‘๊ทผ ๋ฐฉ์‹์„ ๋‹ค๋ฃจ๋Š” ์—ฐ๊ตฌ์ด๋‹ค.
ํ›„์† ์—ฐ๊ตฌ
์˜ค๋ฏน์Šค ๋ฐ์ดํ„ฐ ๊ธฐ๋ฐ˜ ์ž‘์—…์—์„œ AI ๊ณผํ•™์ž ๋ฒค์น˜๋งˆํฌ๋กœ ๋ฐ”์ด์˜คํ…์ŠคํŠธ ๋งˆ์ด๋‹ ๋ฐ BioBERT ํ™œ์šฉ ์‚ฌ๋ก€๊ฐ€ ๋ฒ”์šฉ ๋ชจ๋ธ ํ‰๊ฐ€์— ์ง๊ฒฐ๋ฉ๋‹ˆ๋‹ค.
ํ›„์† ์—ฐ๊ตฌ
Exp-bench๋Š” AI๊ฐ€ ์™„์ „ ์ข…๋ฃŒํ˜• ์—ฐ๊ตฌ ์‹คํ—˜์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋Š”์ง€๋ฅผ ํ‰๊ฐ€ํ•˜๋ฏ€๋กœ, SciReplicate-Bench์˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์žฌํ˜„ํ‰๊ฐ€๋ฅผ ํ™•์žฅํ•œ ์‚ฌ๋ก€์ด๋‹ค.
ํ›„์† ์—ฐ๊ตฌ
SciReplicate-Bench๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜์  ์žฌํ˜„ ์‹คํ—˜์„ ๋ฒค์น˜๋งˆํฌํ™”ํ•˜์—ฌ, ๋„์‹œ ์ธ๊ณผ ์ถ”๋ก  ์—ฐ๊ตฌ ์ž๋™ํ™” ์‹œ์Šคํ…œ์˜ ๊ฐ๊ด€์  ํ‰๊ฐ€ ๋ฐ ํ™•์žฅ์— ์ฐธ๊ณ ํ•  ๋งŒํ•ฉ๋‹ˆ๋‹ค.
์‘์šฉ ์‚ฌ๋ก€
Towards LLM-based Fact Verification on News Claims ๋…ผ๋ฌธ์—์„œ ๋‹จ๊ณ„์  ํ”„๋กฌํ”„ํŠธ ๊ธฐ๋ฐ˜ ์ฆ๊ฑฐ ๊ฒ€์ฆ ๋ฐฉ์‹์„ ํ™œ์šฉํ•˜์—ฌ ๋…ผ๋ฌธ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์žฌํ˜„ ํ‰๊ฐ€ ๋ฐฉ์‹์—๋„ ์˜๊ฐ์„ ์ค„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
์‘์šฉ ์‚ฌ๋ก€
617(Phi-4)์ฒ˜๋Ÿผ LLM์˜ STEM/์‹คํ—˜ ํ‰๊ฐ€๋ฅผ ๋‹ค๋ฃจ๋Š” 731(SciReplicate-Bench)์€ ์‹ค์ œ ์‹คํ—˜์  ๋ณต์ œ์„ฑ ๊ฒ€์ฆ์— LLM์„ ์ ์šฉํ•œ ๊ตฌ์ฒด์  ์‚ฌ๋ก€์ž…๋‹ˆ๋‹ค.
← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •