StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language Models

์ €์ž: Zhicheng Guo, Sijie Cheng, Hao Wang, Shihao Liang, Yujia Qin, Peng Li, Zhiyuan Liu, Maosong Sun, Yang Liu | ๋‚ ์งœ: 2025-03-05 | DOI: 10.48550/arXiv.2403.07714 📄 PDF


Essence

Figure 1

ToolBench์—์„œ ๋ณด๊ณ ๋œ ์„ฑ๋Šฅ๊ณผ ์žฌํ˜„๋œ ์„ฑ๋Šฅ์˜ ๋น„๊ต: ๋ช‡ ๊ฐœ์›” ํ›„ ๋™์ผํ•œ ์„ค์ •์—์„œ ์žฌํ˜„ํ–ˆ์„ ๋•Œ ์ƒ๋‹นํ•œ ์„ฑ๋Šฅ ์ €ํ•˜ ๋ฐœ์ƒ

๋Œ€๊ทœ๋ชจ ์–ธ์–ด ๋ชจ๋ธ(LLM)์ด ๋„๊ตฌ๋ฅผ ํ™œ์šฉํ•˜๋Š” ๋Šฅ๋ ฅ์„ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•ด ์•ˆ์ •์ ์ธ ๋ฒค์น˜๋งˆํฌ๊ฐ€ ํ•„์ˆ˜์ ์ธ๋ฐ, ๊ธฐ์กด ToolBench๋Š” ์‹ค์‹œ๊ฐ„ API์˜ ๋ถˆ์•ˆ์ •์„ฑ์œผ๋กœ ์ธํ•ด ๊ฒฐ๊ณผ ์žฌํ˜„์„ฑ์ด ๋–จ์–ด์ง„๋‹ค. ๋ณธ ๋…ผ๋ฌธ์€ ๊ฐ€์ƒ API ์„œ๋ฒ„์™€ ์•ˆ์ •์ ์ธ ํ‰๊ฐ€ ์‹œ์Šคํ…œ์„ ํ†ตํ•ด ์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•œ StableToolBench๋ฅผ ์ œ์•ˆํ•œ๋‹ค.

Motivation

Achievement

Figure 3

ToolBench์˜ API ์ƒํƒœ ๋ณ€ํ™”: ์„ฑ๊ณต 44.4%, ์—ฐ๊ฒฐ ๋ถˆ๊ฐ€ 14.8%, ํŒŒ์‹ฑ ์˜ค๋ฅ˜ 25.9% ๋“ฑ

  1. ์•ˆ์ •์  ๋ฒค์น˜๋งˆํฌ ๊ตฌ์ถ•: ๊ฐ€์ƒ API ์„œ๋ฒ„(์บ์‹ฑ + ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ)์™€ ๊ฐœ์„ ๋œ ํ‰๊ฐ€ ์‹œ์Šคํ…œ์œผ๋กœ API ๋ณ€ํ™”์— ๊ฐ•๊ฑดํ•œ ํ‰๊ฐ€ ํ™˜๊ฒฝ ์ œ๊ณต
  2. ์„ฑ๋Šฅ ์•ˆ์ •์„ฑ ์ž…์ฆ: Figure 4์—์„œ API ์‹คํŒจ์œจ์ด ์ฆ๊ฐ€ํ•ด๋„ ์ƒˆ๋กœ์šด ํ‰๊ฐ€ ์ง€ํ‘œ๋Š” ์ผ๊ด€๋œ ๊ฒฐ๊ณผ ์œ ์ง€(๊ธฐ์กด ๋ฐฉ์‹์€ 10-50% API ์‹คํŒจ ์‹œ 5-25% ์„ฑ๋Šฅ ์ €ํ•˜)
  3. ํ‰๊ฐ€ ์‹œ์Šคํ…œ ๊ฐœ์„ : GPT-3.5์˜ ํŒ๋ณ„ ๋ถˆ๊ฐ€ ๋ฌธ์ œ(Table 1์˜ "Unsure" ํ•ญ๋ชฉ)๋ฅผ GPT-4๋กœ ๋Œ€์ฒดํ•˜์—ฌ ์•ˆ์ •์„ฑ ํ–ฅ์ƒ

How

Figure 2

ToolBench์˜ Pass Rate ํ‰๊ฐ€ ๋ฐฉ์‹: "Unsure" ์ƒํƒœ์—์„œ ์ž„์˜ ๊ฒฐ์ •์œผ๋กœ ์ธํ•œ ๋ถˆ์•ˆ์ •์„ฑ

๊ฐ€์ƒ API ์„œ๋ฒ„ (Virtual API Server)

์•ˆ์ •์  ํ‰๊ฐ€ ์‹œ์Šคํ…œ (Stable Evaluation System)

Originality

Limitation & Further Study

Evaluation

์ดํ‰: StableToolBench๋Š” ๊ธฐ์กด ๋Œ€๊ทœ๋ชจ ๋„๊ตฌ ํ•™์Šต ๋ฒค์น˜๋งˆํฌ์˜ ์žฌํ˜„์„ฑ ์œ„๊ธฐ์— ๋Œ€ํ•œ ์‹ค์งˆ์ ์ด๊ณ  ํšจ๊ณผ์ ์ธ ํ•ด๊ฒฐ์ฑ…์„ ์ œ์‹œํ•œ๋‹ค. ํŠนํžˆ API ๋ถˆ์•ˆ์ •์„ฑ๊ณผ ํ‰๊ฐ€ ์‹œ์Šคํ…œ์˜ ์•ฝ์ ์„ ๋™์‹œ์— ํ•ด๊ฒฐํ•œ ์ ์ด ๊ฐ€์น˜ ์žˆ์œผ๋‚˜, LLM ๊ธฐ๋ฐ˜ ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ์˜ ์‹ ๋ขฐ์„ฑ ๊ฒ€์ฆ๊ณผ ์žฅ๊ธฐ ์•ˆ์ •์„ฑ ๋ณด์žฅ ์ธก๋ฉด์—์„œ ๋ณด์™„์ด ํ•„์š”ํ•˜๋‹ค.

๊ฐ™์ด ๋ณด๋ฉด ์ข‹์€ ๋…ผ๋ฌธ

๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
LLM์˜ ๋„๊ตฌ์‚ฌ์šฉ ํƒ๊ตฌ๋ฒ•(ํˆด ํ™œ์šฉ ๊ฐ•ํ™”ํ•™์Šต)์— ๋Œ€ํ•œ ์ด๋ก ์  ๊ธฐ๋ฐ˜์„ ์ œ๊ณตํ•˜์—ฌ, 769์˜ ๋„๊ตฌ๋Šฅ๋ ฅ ๋ฒค์น˜๋งˆํฌ ์„ค๊ณ„์— ์ฐธ๊ณ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
StableToolBench์˜ ๋ฒค์น˜๋งˆํฌ ํ™˜๊ฒฝ์€ ์‹คํ–‰ ๊ฐ€๋Šฅํ•œ ์ฝ”๋“œ ๊ธฐ๋ฐ˜ ์—์ด์ „ํŠธ ํ‰๊ฐ€ ๋ฐฉ์‹(CodeAct)์˜ ์‹ ๋ขฐ์„ฑ๊ณผ ์žฌํ˜„์„ฑ ๋ณด์žฅ์„ ์œ„ํ•œ ๊ธฐ๋ฐ˜์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
LLM With Tools: A Survey ๋…ผ๋ฌธ์€ ํˆด ๊ธฐ๋ฐ˜ LLM ํ™œ์šฉ์— ๋Œ€ํ•œ ์ „๋ฐ˜์ ์ธ ์ด๋ก ์  ๋ฐฐ๊ฒฝ์„ ์ œ๊ณตํ•˜์—ฌ StableToolBench์˜ ์—ฐ๊ตฌ ๋ฐฉํ–ฅ ์„ค์ •์— ํฐ ์˜ํ–ฅ์„ ๋ฏธ์นœ๋‹ค.
๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
๋‹ค์ค‘ ๋‹จ๊ณ„ ๊ณผํ•™ ๋„๊ตฌ ์‚ฌ์šฉ ์—์ด์ „ํŠธ์˜ ํ‰๊ฐ€ ํ™˜๊ฒฝ์„ ๋งˆ๋ จํ•˜์—ฌ ToolBench ๋ฐ StableToolBench ๋น„๊ต ํ‰๊ฐ€์˜ ๊ธฐ๋ฐ˜์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
Auto-research ํ”„๋ ˆ์ž„์›Œํฌ์—์„œ ToolBench๊ฐ€ ๋‹ค๋ฃจ๋Š” ๋„๊ตฌ ํ™œ์šฉ ๋ฌธ์ œ์™€ ์œ ์‚ฌํ•œ LLM ์‹คํ—˜ ์ž๋™ํ™” ๋ฒค์น˜๋งˆํฌ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
ํ›„์† ์—ฐ๊ตฌ
StableToolBench๋Š” ์ฝ”๋“œ ์ƒ์„ฑยท๋””๋ฒ„๊น…์šฉ LLM ํ‰๊ฐ€ ์Šค์œ„ํŠธ๋กœ, ์—ฐ๊ตฌํ˜„์žฅ์—์„œ LLM ๊ธฐ๋ฐ˜ ์†Œํ”„ํŠธ์›จ์–ด ๊ณตํ•™ ํšจ์šฉ์„ ๊ฒ€์ฆํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.
ํ›„์† ์—ฐ๊ตฌ
Cocoa ๋…ผ๋ฌธ์€ ์ธ๊ฐ„๊ณผ AI ์—์ด์ „ํŠธ์˜ ๊ณต๋™ ๊ณ„ํš ๋ฐ ์‹คํ–‰ ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์•ˆํ•˜๋ฉฐ ๋„๊ตฌ ํ™œ์šฉ ์•ˆ์ •์„ฑ ๋ฌธ์ œ ํ•ด๊ฒฐ์— ๋Œ€ํ•œ ์ถ”๊ฐ€์ ์ธ ์ ์šฉ์„ ๋ณด์—ฌ์ค€๋‹ค.
์‘์šฉ ์‚ฌ๋ก€
StableToolBench๋Š” ToolBench์˜ ์‹คํ–‰ยทํ‰๊ฐ€ ๋ถˆ์•ˆ์ •์„ฑ ๊ฐœ์„  ์ ‘๊ทผ์ด CodeAct ๋ฐฉ์‹์˜ ์—์ด์ „ํŠธ ํ‰๊ฐ€์™€ ์‹ค์ œ ๊ฒ€์ฆ ํ™˜๊ฒฝ์—์„œ ์–ด๋–ป๊ฒŒ ์—ฐ๊ณ„๋  ์ˆ˜ ์žˆ๋Š”์ง€ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.
← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •