Blade: Benchmarking language model agents for data-driven science

์ €์ž: Ken Gu, Ruoxi Shang, Ren Jiang, Keying Kuang, Ren Lin | ๋‚ ์งœ: 2024 | DOI: arXiv:2408.09667 📄 PDF


Essence

Figure 1

Figure 1: Overview of BLADE. We gathered research questions and datasets from existing research papers,

BLADE๋Š” data-driven science๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š” language model ๊ธฐ๋ฐ˜ agents๋ฅผ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•œ ๋ฒค์น˜๋งˆํฌ์ด๋‹ค. 12๊ฐœ์˜ ๋ฐ์ดํ„ฐ์…‹๊ณผ ์—ฐ๊ตฌ ์งˆ๋ฌธ์— ๋Œ€ํ•ด expert data scientists๋กœ๋ถ€ํ„ฐ ์ˆ˜์ง‘ํ•œ ground truth ๋ถ„์„์„ ๊ธฐ๋ฐ˜์œผ๋กœ, agents์˜ ๋‹ค๋ฉด์ ์ธ ๋ถ„์„ ์ ‘๊ทผ์„ ์ž๋™์œผ๋กœ ํ‰๊ฐ€ํ•œ๋‹ค.

Motivation

Achievement

Figure 4

Figure 4: Average precision (top row) and coverage@10 (bottom row) percentages averaged across datasets in

BLADE ๋ฒค์น˜๋งˆํฌ ๊ตฌ์ถ•: 12๊ฐœ์˜ datasets, 188๊ฐœ์˜ multiple choice ๋ฐ 536๊ฐœ์˜ ground truth ๋ถ„์„ ๊ฒฐ์ •์œผ๋กœ ๊ตฌ์„ฑ๋œ first-of-its-kind ๋ฒค์น˜๋งˆํฌ ์™„์„ฑ. ์ž๋™ ํ‰๊ฐ€ ํ”„๋ ˆ์ž„์›Œํฌ: ๋‹ค์–‘ํ•œ ํ‘œํ˜„ ํ˜•์‹์˜ ๋ถ„์„์„ ๋งค์นญํ•˜๊ธฐ ์œ„ํ•œ value/graph-based matching ๋ฐ LM-based matching ๋ฐฉ๋ฒ• ๊ฐœ๋ฐœ. ์ข…ํ•ฉ ํ‰๊ฐ€ ๊ฒฐ๊ณผ: ๋‹ค์–‘ํ•œ LMs์™€ ReAct agent์˜ ๊ฐ•์ ๊ณผ ์•ฝ์ ์„ ์ฒด๊ณ„์ ์œผ๋กœ ๋ถ„์„ํ•˜์—ฌ, LMs์ด ๊ธฐ๋ณธ ๋ถ„์„์—๋Š” ์ ํ•ฉํ•˜์ง€๋งŒ conceptual variable formulation (coverage 13% ์ดํ•˜)๊ณผ variable operationalization (coverage 27% ์ดํ•˜)์—์„œ ํฐ ํ•œ๊ณ„๋ฅผ ๋ณด์ž„์„ ์ž…์ฆ.

How

Figure 1

Figure 1: Overview of BLADE. We gathered research questions and datasets from existing research papers,

โ€ข Crowd-sourced expert annotations๋ฅผ ํ†ตํ•ด multiple valid analysis approaches๋ฅผ ๋ฐ˜์˜ํ•œ ํฌ๊ด„์ ์ธ ground truth ์ˆ˜์ง‘\nโ€ข ์—ฐ๊ตฌ ์งˆ๋ฌธ์— ๋Œ€ํ•œ alternative decisions validation ๋ฐ unjustifiable decisions ํฌํ•จ์œผ๋กœ ํ‰๊ฐ€ ๊ธฐ์ค€์˜ ๊ฑด์ „์„ฑ ํ™•๋ณด\nโ€ข Conceptual variables, data transformations, statistical models๋ฅผ ๊ฐœ๋ณ„์ ์œผ๋กœ ํ‘œํ˜„ํ•˜๊ณ  ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•œ structured representation ์„ค๊ณ„\nโ€ข Value ๊ธฐ๋ฐ˜ ๋งค์นญ(๋ณ€์ˆ˜๋ช…, ์ˆ˜์น˜), graph ๊ธฐ๋ฐ˜ ๋งค์นญ(๋ฐ์ดํ„ฐ ๋ณ€ํ™˜ ๊ตฌ์กฐ), LM ๊ธฐ๋ฐ˜ ๋งค์นญ(์˜๋ฏธ๋ก ์  ๋™๋“ฑ์„ฑ)์„ ๊ฒฐํ•ฉํ•œ ๋‹ค์ธต ๋งค์นญ ์ „๋žต\nโ€ข ReAct agent๋ฅผ ํ†ตํ•œ agents์˜ ์‹ค์ œ ์„ฑ๋Šฅ ์ธก์ • ๋ฐ baseline ์ œ๊ณต

Originality

โ€ข ์ƒˆ๋กœ์šด ํ‰๊ฐ€ ๊ด€์ : ๊ธฐ์กด์˜ ๋‹จ์ˆœ ์ž‘์—… ๊ธฐ๋ฐ˜ ๋ฒค์น˜๋งˆํฌ์™€ ๋‹ฌ๋ฆฌ open-ended scientific analysis์˜ ๋ณต์žกํ•œ decision-making์„ evaluation ๋Œ€์ƒ์œผ๋กœ ์‚ผ์Œ\nโ€ข ํฌ๊ด„์  ground truth ์„ค๊ณ„: crowd-sourced analysis์—์„œ alternative decisions, negative examples๊นŒ์ง€ ์ฒด๊ณ„์ ์œผ๋กœ ์ˆ˜์ง‘ํ•˜์—ฌ ๋‹ค์–‘ํ•œ ์ •๋‹นํ•œ ์ ‘๊ทผ๋ฒ•์„ ์ธ์ •\nโ€ข ๋‹ค์ธต ์ž๋™ ํ‰๊ฐ€ ๋ฉ”์ปค๋‹ˆ์ฆ˜: ์ฝ”๋“œ ์ˆ˜์ค€(value/graph matching)๋ถ€ํ„ฐ ๊ฐœ๋… ์ˆ˜์ค€(conceptual variable matching)๊นŒ์ง€ ์ด์งˆ์ ์ธ ๋ถ„์„ ๊ฒฐ์ •์„ ์œ ์—ฐํ•˜๊ฒŒ ๋น„๊ต\nโ€ข ์‹ค์ œ ๊ณผํ•™ ๋ฐ์ดํ„ฐ ํ™œ์šฉ: ๊ต๊ณผ์„œ๋‚˜ synthetic ๋ฐ์ดํ„ฐ๊ฐ€ ์•„๋‹Œ published research papers์˜ ์‹ค์ œ ๋ฐ์ดํ„ฐ์™€ ์งˆ๋ฌธ ์‚ฌ์šฉ

Limitation & Further Study

โ€ข Ground truth ์ˆ˜์ง‘์˜ ์Šค์ผ€์ผ ์ œ์•ฝ: 12๊ฐœ datasets๋งŒ ํฌํ•จ๋˜์–ด ์žˆ์–ด benchmark์˜ generalizability์™€ coverage๊ฐ€ ์ œํ•œ์ ์ผ ์ˆ˜ ์žˆ์Œ\nโ€ข Expert annotation์˜ ์ฃผ๊ด€์„ฑ: ์–ด๋–ค ๋ถ„์„ ๊ฒฐ์ •์ด \"์ •๋‹นํ•œ(justifiable)\"์ธ์ง€์— ๋Œ€ํ•œ ํŒ๋‹จ์ด ์—ฌ์ „ํžˆ expert judgment์— ์˜์กด\nโ€ข ํ‰๊ฐ€ ๋ฉ”ํŠธ๋ฆญ์˜ ์ œํ•œ: Average precision๊ณผ coverage@10 ์ค‘์‹ฌ์˜ ํ‰๊ฐ€๋กœ agents์˜ ๋ถ„์„ ํ’ˆ์งˆ์ด๋‚˜ scientific validity์— ๋Œ€ํ•œ ๋” ๊นŠ์€ ํ†ต์ฐฐ์ด ๋ถ€์กฑํ•  ์ˆ˜ ์žˆ์Œ\nโ€ข LM agents์˜ ์„ฑ๋Šฅ ํ‰๊ฐ€ ๋‹ค์–‘์„ฑ ๋ถ€์กฑ: ReAct agent ์™ธ ๋‹ค๋ฅธ agent ์•„ํ‚คํ…์ฒ˜๋‚˜ ๋” ์ƒˆ๋กœ์šด LM๋“ค(GPT-4o, Claude 3 ๋“ฑ)์— ๋Œ€ํ•œ ํ‰๊ฐ€ ๋ถ€์žฌ\nโ€ข Data semantics ์ดํ•ด์˜ ํ•œ๊ณ„: benchmark๊ฐ€ agents์˜ domain-specific data ์ดํ•ด๋„๋ฅผ ์ถฉ๋ถ„ํžˆ ํ…Œ์ŠคํŠธํ•˜์ง€ ๋ชปํ•  ๊ฐ€๋Šฅ์„ฑ\n\nํ›„์† ์—ฐ๊ตฌ ๋ฐฉํ–ฅ:\nโ€ข ๋” ๋งŽ์€ domains์™€ datasets๋ฅผ ํฌํ•จํ•œ benchmark ํ™•์žฅ\nโ€ข Agent์˜ ๋ถ„์„ ๊ณผ์ •์˜ interpretability์™€ scientific validity์— ๋Œ€ํ•œ ๋” ๊นŠ์€ ๋ถ„์„\nโ€ข ๋‹ค์–‘ํ•œ agent architectures์™€ ์ตœ์‹  LMs์— ๋Œ€ํ•œ ํ‰๊ฐ€ ์ถ”๊ฐ€\nโ€ข Human-in-the-loop evaluation์„ ํ†ตํ•œ agents์˜ ์‹ค์ œ ๊ณผํ•™์  ๊ฐ€์น˜ ๊ฒ€์ฆ

Evaluation

Novelty: 4/5 Technical Soundness: 4/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: BLADE๋Š” data-driven science์—์„œ LM agents๋ฅผ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•œ ์ฒซ ๋ฒˆ์งธ ์ข…ํ•ฉ์ ์ด๊ณ  ์›์น™์ ์ธ ๋ฒค์น˜๋งˆํฌ๋กœ์„œ, open-ended ๋ถ„์„ ์ž‘์—…์˜ ๋ณต์žก์„ฑ์„ ๋‹ค์ธต์  ์ž๋™ ํ‰๊ฐ€ ๋ฐฉ๋ฒ•์œผ๋กœ ์ฒ˜๋ฆฌํ•œ๋‹ค๋Š” ์ ์—์„œ ์˜์˜๊ฐ€ ํฌ๋‹ค. ์‹ค์ œ ๋…ผ๋ฌธ ๋ฐ์ดํ„ฐ์™€ expert crowd-sourced annotations๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ ๊ฒฌ๊ณ ํ•œ ground truth ๊ตฌ์ถ•๊ณผ ์„ธ๋ฐ€ํ•œ decision-level evaluation์€ agents์˜ ์‹ค์ œ analytical capabilities๋ฅผ ํŒŒ์•…ํ•˜๋Š” ๋ฐ ์ค‘์š”ํ•œ ๊ธฐ์—ฌ๋ฅผ ํ•œ๋‹ค. ๋‹ค๋งŒ 12๊ฐœ dataset์˜ ์ œํ•œ์  ๊ทœ๋ชจ์™€ ReAct ์™ธ ๋‹ค์–‘ํ•œ agent architectures์˜ ๋ถ€์žฌ๋Š” ํ–ฅํ›„ ๊ฐœ์„ ์ด ํ•„์š”ํ•œ ๋ถ€๋ถ„์ด๋‹ค.

๊ฐ™์ด ๋ณด๋ฉด ์ข‹์€ ๋…ผ๋ฌธ

๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
Blade ๋…ผ๋ฌธ์€ LLM ์—์ด์ „ํŠธ์˜ ๊ณผํ•™์  ์˜์‚ฌ๊ฒฐ์ • ๋ฒค์น˜๋งˆํฌ๋กœ, Agentreview์—์„œ ์‹œ๋ฎฌ๋ ˆ์ด์…˜๋˜๋Š” ํ”ผ์–ด๋ฆฌ๋ทฐ ์—์ด์ „ํŠธ ์„ค์ •๊ณผ ์ด๋ก ์ ์œผ๋กœ ์—ฐ๊ฒฐ๋œ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
๋ฐ์ดํ„ฐ ๊ธฐ๋ฐ˜ ๊ณผํ•™ ํƒ๊ตฌ๋ฅผ ์œ„ํ•œ ์–ธ์–ด ๋ชจ๋ธ ์—์ด์ „ํŠธ ๋ฒค์น˜๋งˆํฌ๋กœ, MLAgentBench์™€ ๋‹ค๋ฅธ ํ‰๊ฐ€ ๊ธฐ์ค€ ๋ฐ ๋ฌธ์ œ ์„ค์ •์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
๋‹ค์ค‘ ์—์ด์ „ํŠธ ๊ธฐ๋ฐ˜ ๋ฐ์ดํ„ฐ ์‚ฌ์ด์–ธ์Šค ํƒœ์Šคํฌ ์˜ค์ผ€์ŠคํŠธ๋ ˆ์ด์…˜ ๋“ฑ ์‹ค์ œ ์—ฐ๊ตฌ ๋ฌธ์ œ ํ•ด๊ฒฐ๋Šฅ๋ ฅ ๋น„๊ต๊ฐ€ ๊ฐ€๋Šฅํ•˜๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
์ธํ”ผ์—์ด์ „ํŠธ-DABench ์—ญ์‹œ ์—์ด์ „ํŠธ์˜ ๋ฐ์ดํ„ฐ ๋ถ„์„ ๋ฒค์น˜๋งˆํฌ๋ฅผ ๋‹ค๋ฃจ์–ด, BLADE์™€ ํ‰๊ฐ€ ์„ค๊ณ„์™€ ํ•œ๊ณ„์ ์„ ๋น„๊ตํ•  ์ˆ˜ ์žˆ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
Blade ์—ญ์‹œ ๋ฐ์ดํ„ฐ ๊ธฐ๋ฐ˜ ๋ฐœ๊ฒฌ ๋Šฅ๋ ฅ์„ LLM agent๋กœ ํ‰๊ฐ€ํ•˜๋Š” ๋ฒค์น˜๋งˆํฌ๋กœ, DiscoveryBench์™€ ํ‰๊ฐ€ ๋ฐฉ์‹์ด๋‚˜ ํ•œ๊ณ„์  ๋น„๊ต๊ฐ€ ์œ ์šฉํ•ฉ๋‹ˆ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
๋ฐ์ดํ„ฐ ๊ธฐ๋ฐ˜ ๊ณผํ•™ ๋ฐœ๊ฒฌ์—์„œ LLM ์—์ด์ „ํŠธ์˜ ๋ถ„์„ ๋Šฅ๋ ฅ์„ ํ‰๊ฐ€ํ•˜๋Š” ์œ ์‚ฌํ•œ ๋ฒค์น˜๋งˆํฌ ์—ฐ๊ตฌ์ด๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
๊ณผํ•™์  ๋ฐ์ดํ„ฐ ๋ถ„์„ ์ž๋™ํ™”๋ฅผ ์œ„ํ•œ LLM ์—์ด์ „ํŠธ์˜ ๋‹ค์ค‘ ๋ถ„์„ ๊ฒฝ๋กœ๋ฅผ ํƒ๊ตฌํ•˜๋Š” ๊ด€๋ จ ์—ฐ๊ตฌ์ด๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
์›น ํƒ์ƒ‰ ๋ฐ ์ •๋ณด ์ˆ˜์ง‘์„ ํ†ตํ•œ AI ์—์ด์ „ํŠธ์˜ ๊ณผํ•™์  ๋ฐœ๊ฒฌ ๋Šฅ๋ ฅ์„ ํ‰๊ฐ€ํ•˜๋Š” ๊ด€๋ จ ๋ฒค์น˜๋งˆํฌ์ด๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
๊ณผํ•™ ์—ฐ๊ตฌ์—์„œ LLM์˜ ๋ฐ์ดํ„ฐ ๋ถ„์„ ๋ฐ ์ถ”๋ก  ๋Šฅ๋ ฅ์„ ํ‰๊ฐ€ํ•˜๋Š” ๋ฒค์น˜๋งˆํฌ๋ฅผ ์ œ์‹œํ•˜๋Š” ์œ ์‚ฌํ•œ ๋…ผ๋ฌธ์ด๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
์‹ค์ œ ๋ฐ์ดํ„ฐ์…‹์„ ํ™œ์šฉํ•œ LLM ๊ธฐ๋ฐ˜ ๋ฐ์ดํ„ฐ ๊ณผํ•™ ์—์ด์ „ํŠธ์˜ ์„ฑ๋Šฅ ํ‰๊ฐ€๋ฅผ ๋‹ค๋ฃจ๋Š” ๊ด€๋ จ ์—ฐ๊ตฌ์ด๋‹ค.
์‘์šฉ ์‚ฌ๋ก€
๊ณผํ•™์  ์ง€์‹ ํ‰๊ฐ€ ๋ฐ ๋ฉ€ํ‹ฐ๋ ˆ๋ฒจ ํ…Œ์ŠคํŠธ๋ฅผ ํ†ตํ•ด ์‹ค์ œ ๊ณผํ•™ ๋ฐœ๊ฒฌ ๋ฌธ์ œ๋กœ ๋ฒค์น˜๋งˆํฌ ๊ตฌ์„ฑ์„ ํ™•์žฅํ•œ ์‚ฌ๋ก€๋ฅผ ์ฐธ๊ณ ํ•  ์ˆ˜ ์žˆ๋‹ค.
์‘์šฉ ์‚ฌ๋ก€
Towards a Science of AI Agent Reliability ๋…ผ๋ฌธ์—์„œ ์ œ์•ˆํ•œ ์‹ ๋ขฐ์„ฑ ๋ฉ”ํŠธ๋ฆญ์ด ์‹ค์ œ ๋ฐ์ดํ„ฐ ๊ธฐ๋ฐ˜ ๊ณผํ•™ ๋ถ„์„ ์—์ด์ „ํŠธ ํ‰๊ฐ€(BLADE ๋“ฑ)์— ์–ด๋–ป๊ฒŒ ์ ์šฉ๋  ์ˆ˜ ์žˆ๋Š”์ง€ ๋ณด์—ฌ์ค€๋‹ค.
← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •