EAIRA: Establishing a Methodology for Evaluating AI Models as Scientific Research Assistants

์ €์ž: Franck Cappello, Sandeep Madireddy, Robert Underwood, Neil Getty, Nicholas Lee-Ping Chia, Nesar Ramachandra, Josh Nguyen, Murat Keceli, Tanwi Mallick, Zilinghan Li, Marieme Ngom, Chenhui Zhang, Angel Yanguas-Gil, Evan Antoniuk, Bhavya Kailkhura, Minyang Tian, Yufeng Du, Yuan-Sen Ting, Azton Wells, Bogdan Nicolae, Avinash Maurya, M. Mustafa Rafique, Eliu Huerta, Bo Li, Ian Foster, Rick Stevens | ๋‚ ์งœ: 2025 | DOI: 10.48550/ARXIV.2502.20309 📄 PDF


Essence

Figure 1

Fig. 1. The AGIL approach to generate scalable MCQ benchmarks. The current version of the AI4S benchmark contains only m

๋ณธ ๋…ผ๋ฌธ์€ Argonne National Laboratory์—์„œ ๊ฐœ๋ฐœํ•œ EAIRA๋ผ๋Š” ์ข…ํ•ฉ์ ์ธ ํ‰๊ฐ€ ๋ฐฉ๋ฒ•๋ก ์„ ์ œ์‹œํ•œ๋‹ค. ์ด ๋ฐฉ๋ฒ•๋ก ์€ Multiple Choice Questions, Open Response, Lab-Style Experiments, Field-Style Experiments ๋“ฑ ๋„ค ๊ฐ€์ง€ ํ‰๊ฐ€ ๊ธฐ๋ฒ•์„ ๊ฒฐํ•ฉํ•˜์—ฌ LLMs์˜ ๊ณผํ•™ ์—ฐ๊ตฌ ์กฐ์ˆ˜๋กœ์„œ์˜ ๋Šฅ๋ ฅ์„ ์ฒด๊ณ„์ ์œผ๋กœ ํ‰๊ฐ€ํ•œ๋‹ค.

Motivation

Achievement

Figure 1

Fig. 1. The AGIL approach to generate scalable MCQ benchmarks. The current version of the AI4S benchmark contains only m

์—ฌ๋Ÿฌ LLM ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ ๋ถ„์„: GPT-4o, Gemini, Claude ๋“ฑ ์ฃผ์š” ๋ชจ๋ธ๋“ค์˜ ๋Šฅ๋ ฅ์„ ๋‹ค์–‘ํ•œ ๊ณผํ•™ ๋„๋ฉ”์ธ์—์„œ ๋น„๊ต ํ‰๊ฐ€ํ•จ.\nEAIRA ๋ฐฉ๋ฒ•๋ก ์˜ ์ˆ˜๋ฆฝ: ๋„ค ๊ฐ€์ง€ ํ‰๊ฐ€ ๊ธฐ๋ฒ•์„ ํ†ตํ•ฉํ•œ ํฌ๊ด„์  ๋ฐฉ๋ฒ•๋ก ์„ ๊ฐœ๋ฐœํ•˜์—ฌ LLM์˜ ๊ณผํ•™์  ์ง€์‹, ์ถ”๋ก  ๋Šฅ๋ ฅ, ์‹ ๋ขฐ์„ฑ์„ ์ข…ํ•ฉ์ ์œผ๋กœ ํ‰๊ฐ€ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•จ.\nํ˜์‹ ์  ํ‰๊ฐ€ ๊ธฐ๋ฒ•: Lab-style๊ณผ Field-style ์‹คํ—˜์ด๋ผ๋Š” ์ƒˆ๋กœ์šด ํ‰๊ฐ€ ๊ธฐ๋ฒ•์„ ๋Œ€๊ทœ๋ชจ๋กœ ์ฒ˜์Œ ๋„์ž…ํ•˜์—ฌ ์‹ค์ œ ์—ฐ๊ตฌ ํ™˜๊ฒฝ์—์„œ์˜ LLM ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ•จ.\n๋‹ค์ค‘ ๋„๋ฉ”์ธ ๋ฒค์น˜๋งˆํฌ(AI4S) ๊ฐœ๋ฐœ: ๊ณผํ•™ ๋ถ„์•ผ์— ํŠนํ™”๋œ ํ†ตํ•ฉ ๋ฒค์น˜๋งˆํฌ๋ฅผ ๊ตฌ์ถ•ํ•˜์—ฌ ๋„๋ฉ”์ธ ์ „๋ฌธ๊ฐ€์˜ ์ง€์‹๊ณผ LLM ํŒ์ •์ž์˜ ๋Šฅ๋ ฅ์„ ๊ฒฐํ•ฉํ•จ.\n์ ์‘ ๊ฐ€๋Šฅํ•œ ํ”„๋ ˆ์ž„์›Œํฌ ์„ค๊ณ„: ๋น ๋ฅด๊ฒŒ ๋ณ€ํ™”ํ•˜๋Š” LLM ๊ธฐ์ˆ ์— ๋Œ€์‘ํ•˜๊ธฐ ์œ„ํ•ด ๋ฐฉ๋ฒ•๋ก ์„ ์ง€์†์ ์œผ๋กœ ์ง„ํ™”์‹œํ‚ฌ ์ˆ˜ ์žˆ๋„๋ก ์„ค๊ณ„ํ•จ.

How

Figure 1

Fig. 1. The AGIL approach to generate scalable MCQ benchmarks. The current version of the AI4S benchmark contains only m

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 4/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: ๋ณธ ๋…ผ๋ฌธ์€ LLMs๋ฅผ ๊ณผํ•™ ์—ฐ๊ตฌ ์กฐ์ˆ˜๋กœ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•œ ํฌ๊ด„์ ์ด๊ณ  ํ˜์‹ ์ ์ธ ๋ฐฉ๋ฒ•๋ก ์„ ์ œ์‹œํ•œ๋‹ค. ํŠนํžˆ Lab-style๊ณผ Field-style ์‹คํ—˜์ด๋ผ๋Š” ์ƒˆ๋กœ์šด ํ‰๊ฐ€ ๊ธฐ๋ฒ•์„ ๋Œ€๊ทœ๋ชจ๋กœ ๋„์ž…ํ•˜์—ฌ ๊ธฐ์กด ๋ฒค์น˜๋งˆํฌ์˜ ํ•œ๊ณ„๋ฅผ ๊ทน๋ณตํ•˜๋ ค๋Š” ์‹œ๋„๊ฐ€ ๋งค์šฐ ๊ฐ€์น˜ ์žˆ๋‹ค. ๋‹ค๋งŒ, ํ˜„์žฌ ๋ฐฉ๋ฒ•๋ก ์ด ๊ณผํ•™ ๋„๋ฉ”์ธ์˜ ์ผ๋ถ€์—์„œ๋งŒ ๊ฐœ๋ฐœ๋˜์—ˆ๊ณ , ์ž๋ฐœ์  ์ฐธ์—ฌ์— ๊ธฐ๋ฐ˜ํ•œ ํ‰๊ฐ€์˜ ๋Œ€ํ‘œ์„ฑ ๋ฌธ์ œ๊ฐ€ ๋‚จ์•„์žˆ๋‹ค. ์ „์ฒด์ ์œผ๋กœ LLM ํ‰๊ฐ€ ๋ถ„์•ผ์— ์ค‘์š”ํ•œ ๊ธฐ์—ฌ๋ฅผ ํ•˜๋Š” ๋…ผ๋ฌธ์ด๋ฉฐ, ํ–ฅํ›„ ๊ณผํ•™ AI์˜ ์‹ ๋ขฐ๋„ ํ‰๊ฐ€๋ฅผ ์œ„ํ•œ ๊ธฐ์ดˆ๊ฐ€ ๋  ๊ฒƒ์œผ๋กœ ๊ธฐ๋Œ€๋œ๋‹ค.

๊ฐ™์ด ๋ณด๋ฉด ์ข‹์€ ๋…ผ๋ฌธ

๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
AAAR-1.0์€ AI๊ฐ€ ์—ฐ๊ตฌ ์ง€์›์— ๋ฏธ์น˜๋Š” ์ž ์žฌ๋ ฅ์„ ๋‹ค๋ฃจ๋ฉฐ, EAIRA์˜ ํ‰๊ฐ€ ํ”„๋ ˆ์ž„์›Œํฌ ๊ฐœ๋ฐœ์˜ ์ด๋ก ์  ์ถœ๋ฐœ์ ์ด ๋ฉ๋‹ˆ๋‹ค.
๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
๊ณผํ•™์  ๋ฐœ๊ฒฌ์—์„œ ๊ฐ€์„ค ์ƒ์„ฑ ๋ฐ ํ‰๊ฐ€ ๋ฐฉ๋ฒ•๋ก ์— ๋Œ€ํ•œ ์ฒด๊ณ„์  ์„œ๋ฒ ์ด๋ฅผ ์ œ๊ณตํ•˜์—ฌ EAIRA ํ‰๊ฐ€ ํ”„๋ ˆ์ž„์›Œํฌ์˜ ์ด๋ก ์  ํ† ๋Œ€๋ฅผ ํ˜•์„ฑํ•œ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
3379 ๋…ผ๋ฌธ์€ AI ๊ธฐ๋ฐ˜ ๊ณผํ•™์  ๋ฌธํ—Œ ํ‰๊ฐ€ ๋ฐ ์—ฐ๊ฒฐ์„ฑ ํ‰๊ฐ€ ๋ฐฉ๋ฒ•๋ก ์„ ์ œ์•ˆ, 593 ๋…ผ๋ฌธ ์‹œ์Šคํ…œ์˜ ํ‰๊ฐ€ ํ”„๋ ˆ์ž„์›Œํฌ์™€ ๋น„๊ต๋œ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
HypoBench๋Š” LLM์˜ ๊ณผํ•™์  ๊ฐ€์„ค ์ƒ์„ฑ/๊ฒ€์ฆ ์—ญ๋Ÿ‰์„ ์—„๋ฐ€ํ•˜๊ฒŒ ๋ฒค์น˜๋งˆํ‚นํ•˜๋Š” ํ”„๋ ˆ์ž„์›Œํฌ๋กœ, EAIRA์™€ ์œ ์‚ฌํ•˜๊ฒŒ AI์˜ ๊ณผํ•™์  ํ‰๊ฐ€๋ฅผ ๋‹ค์–‘ํ•œ ์‹œ๊ฐ์—์„œ ์ ‘๊ทผํ•ฉ๋‹ˆ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
ScienceAgentBench๋Š” LLM ๊ธฐ๋ฐ˜ ๊ณผํ•™ ์—์ด์ „ํŠธ์˜ ํ‰๊ฐ€ ๊ธฐ์ค€์„ ์ œ๊ณตํ•˜๋ฏ€๋กœ EAIRA์™€ ๋น„๊ตํ•˜์—ฌ ๋‹ค์–‘ํ•œ ๊ณผํ•™์  ์ž๋™ํ™” ํ‰๊ฐ€ ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ํƒ์ƒ‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
ํ›„์† ์—ฐ๊ตฌ
AI ์—ฐ๊ตฌ ์ „๋ฐ˜์˜ ์—”๋“œ-ํˆฌ-์—”๋“œ ์ž๋™ํ™” ๋ฒค์น˜๋งˆํ‚น ์‚ฌ๋ก€๋“ค์„ ํ†ตํ•ด EAIRA์˜ ์‹คํ—˜์„ฑ๊ณผ ๋น„๊ต ๊ฐ€๋Šฅํ•œ ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์‹œํ•œ๋‹ค.
์‘์šฉ ์‚ฌ๋ก€
The Virtual Lab ๋…ผ๋ฌธ์€ AI Scientist๊ฐ€ ์‹ค์ œ ๊ณผํ•™์  ํƒ๊ตฌ ๊ณผ์ •(์˜ˆ์‹œ: ๋‚˜๋…ธ๋ฐ”๋”” ์„ค๊ณ„)์— ์“ฐ์ธ ์‚ฌ๋ก€๋ฅผ ์ œ์‹œํ•ด EAIRA์˜ ํ‰๊ฐ€๋ฐฉ๋ฒ• ์‹ค์ „ ์ ์šฉ ์˜ˆ์‹œ๋กœ ์—ฐ๊ฒฐ๋ฉ๋‹ˆ๋‹ค.
← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •