YC-Bench: Benchmarking AI Agents for Long-Term Planning and Consistent Execution

์ €์ž: Muyu He, Adit Jain, Anand Kumar, Vincent Tu, Soumyadeep Bakshi, Sachin Patro, Nazneen Rajani | ๋‚ ์งœ: 2026 | DOI: 10.48550/ARXIV.2604.01212 📄 PDF


Essence

Figure 1

Figure 1 Overview of YC-Bench. The agent interacts with the environment through CLI commands (blue) and receives structu

YC-Bench๋Š” LLM ์—์ด์ „ํŠธ์˜ ์žฅ๊ธฐ ๊ณ„ํš๊ณผ ์ผ๊ด€๋œ ์‹คํ–‰ ๋Šฅ๋ ฅ์„ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•œ ๋ฒค์น˜๋งˆํฌ๋กœ, 1๋…„ ๋™์•ˆ ์ˆ˜๋ฐฑ ํ„ด์„ ๊ฑฐ์ณ ์‹œ๋ฎฌ๋ ˆ์ด์…˜๋œ ์Šคํƒ€ํŠธ์—…์„ ์šด์˜ํ•˜๋„๋ก ํ•˜๋Š” POMDP ๊ธฐ๋ฐ˜ ํ™˜๊ฒฝ์„ ์ œ๊ณตํ•œ๋‹ค. ๋ถˆ์ถฉ์‹คํ•œ ํด๋ผ์ด์–ธํŠธ์™€ ์ฆ๊ฐ€ํ•˜๋Š” ๊ธ‰์—ฌ ๋น„์šฉ ๋“ฑ ์ ๋Œ€์  ๋™์  ํ™˜๊ฒฝ์—์„œ ์ง์› ๊ด€๋ฆฌ, ๊ณ„์•ฝ ์„ ํƒ, ํ˜„๊ธˆ ํ๋ฆ„ ๊ด€๋ฆฌ์˜ ๋ณตํ•ฉ์  ์˜์‚ฌ๊ฒฐ์ •์„ ์š”๊ตฌํ•œ๋‹ค.

Motivation

Achievement

Figure 2

Figure 2 Out of the 12 models that we benchmark on YC-Bench, 5 models are profitable and only 3 turn a substantial profi

12๊ฐœ ๋ชจ๋ธ ๋ฒค์น˜๋งˆํฌ ๊ฒฐ๊ณผ: GPT-5.4, Claude Opus 4.6, GLM-5, Gemini, Grok ๋“ฑ ์ตœ์‹  frontier ๋ชจ๋ธ๊ณผ open-source ๋ชจ๋ธ ํ‰๊ฐ€. Claude Opus 4.6์ด ํ‰๊ท  $1.27M์œผ๋กœ ๊ฐ€์žฅ ๋†’์€ ์ตœ์ข… ์ž๊ธˆ์„ ๋‹ฌ์„ฑํ–ˆ๊ณ , GLM-5๋Š” 11๋ฐฐ ๋‚ฎ์€ inference cost๋กœ $1.21M ๋‹ฌ์„ฑ. ์‹คํŒจ ๋ถ„์„: 12๊ฐœ ์ค‘ 3๊ฐœ๋งŒ ์ดˆ๊ธฐ ์ž๋ณธ $200K๋ฅผ ์ดˆ๊ณผํ•˜์—ฌ ์ผ๊ด€์„ฑ ์žˆ๊ฒŒ ์„ฑ๊ณต. ์ฃผ์š” ์˜ˆ์ธก ์ธ์ž: Scratchpad ์‚ฌ์šฉ์ด ์„ฑ๊ณต์˜ ๊ฐ€์žฅ ๊ฐ•๋ ฅํ•œ ์˜ˆ์ธก ์ธ์ž. ์‹คํŒจ ๋ชจ๋“œ ๋ถ„์„: ์ ๋Œ€์  ํด๋ผ์ด์–ธํŠธ ์‹๋ณ„ ์‹คํŒจ๊ฐ€ ํŒŒ์‚ฐ์˜ 47% ์ฐจ์ง€, over-parallelization ๋“ฑ distinct failure mode ๋ฐœ๊ฒฌ.

How

Figure 3

Figure 3 We observe that better models are able to build client trust over time by strategically selecting clients. What

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 4/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: YC-Bench๋Š” LLM ์—์ด์ „ํŠธ์˜ ์žฅ๊ธฐ ๊ณ„ํš๊ณผ ์ผ๊ด€๋œ ์‹คํ–‰ ๋Šฅ๋ ฅ์„ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•œ ์ž˜ ์„ค๊ณ„๋œ ๋ฒค์น˜๋งˆํฌ๋กœ, ๊ธฐ์กด ๋ฒค์น˜๋งˆํฌ์˜ ํ•œ๊ณ„๋ฅผ ๋ช…ํ™•ํžˆ ์ธ์‹ํ•˜๊ณ  ์ด๋ฅผ ๊ทน๋ณตํ•˜๋Š” ํ˜์‹ ์ ์ธ ํ™˜๊ฒฝ์„ ์ œ์‹œํ•œ๋‹ค. 12๊ฐœ frontier ๋ชจ๋ธ์˜ ๊ด‘๋ฒ”์œ„ํ•œ ํ‰๊ฐ€, ์ฒด๊ณ„์ ์ธ ์‹คํŒจ ๋ชจ๋“œ ๋ถ„์„, open-source ์ œ๊ณต์œผ๋กœ ์ปค๋ฎค๋‹ˆํ‹ฐ์— ์ค‘์š”ํ•œ ๊ธฐ์—ฌ๋ฅผ ํ•œ๋‹ค. ๋‹ค๋งŒ simulation generalization, context window ์„ค์ •์˜ ์ •๋‹น์„ฑ, ๋” ์ •๊ตํ•œ ๋ฉ”๋ชจ๋ฆฌ ๋ฉ”์ปค๋‹ˆ์ฆ˜ ์ง€์› ๋“ฑ ๋ณด์™„์ด ํ•„์š”ํ•˜๋‹ค.

๊ฐ™์ด ๋ณด๋ฉด ์ข‹์€ ๋…ผ๋ฌธ

๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
LLM ๊ธฐ๋ฐ˜ ์‚ฌํšŒ ์‹œ๋ฎฌ๋ ˆ์ด์…˜์˜ ๊ตฌ์กฐ์™€ ํ‰๊ฐ€๋ฐฉ์•ˆ์„ ์„œ๋ฒ ์ดํ•˜์—ฌ, YC-Bench์˜ ์„ค๊ณ„์™€ ํ‰๊ฐ€์˜ ์ด๋ก ์  ํ† ๋Œ€๋ฅผ ์ œ๊ณตํ•œ๋‹ค.
๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
HIAGENT๋Š” ์„œ๋ธŒ๊ณจ ์ค‘์‹ฌ์˜ ๊ณ„์ธต์  ์ž‘์—… ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ ์ ‘๊ทผ์„ ํ†ตํ•ด ์žฅ๊ธฐ ํ”Œ๋ž˜๋‹๊ณผ ์—์ด์ „ํŠธ ์ผ๊ด€์„ฑ ์—ฐ๊ตฌ์˜ ์ด๋ก ์  ํ† ๋Œ€๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
์žฅ๊ธฐ๊ฐ„ยท๋Œ€๊ทœ๋ชจ ๊ณ„ํš ๋ฐ ์‚ฌํšŒ์  ์ƒํ˜ธ์ž‘์šฉ์„ ์—์ด์ „ํŠธ ๋ฒค์น˜๋งˆํฌ๋กœ ์ œ์‹œํ•˜์—ฌ, ์žํŒ๊ธฐ-๋น„์ฆˆ๋‹ˆ์Šค ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ํ™˜๊ฒฝ๊ณผ ๋ชฉ์ ์ด ํก์‚ฌํ•ฉ๋‹ˆ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
3398์€ ์žฅ๊ธฐ ๊ณผํ•™์  ๊ณ„ํš๊ณผ ์—์ด์ „ํŠธ ํ‰๊ฐ€์— ์ง‘์ค‘ํ•œ ๋ฒค์น˜๋งˆํฌ๋กœ, ์›Œํฌํ”Œ๋กœ์šฐ ํ˜„์žฅ์„ฑ์„ ๊ฐ™์ด ๊ณ ๋ฏผํ•ฉ๋‹ˆ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
์žฅ๊ธฐ์  ์ผ๊ด€์„ฑ ํ‰๊ฐ€๋ฅผ ์œ„ํ•œ ์—์ด์ „ํŠธ ๋ฒค์น˜๋งˆํฌ๋ผ๋Š” ๋™์ผ ๋ฌธ์ œ์— ๋Œ€ํ•ด ์„œ๋กœ ๋‹ค๋ฅธ ํ™˜๊ฒฝ๊ณผ ๋ฐฉ์‹์„ ์ œ๊ณตํ•œ๋‹ค.
← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •