SEVerA: Verified Synthesis of Self-Evolving Agents

์ €์ž: Debangshu Banerjee, Changming Xu, Gagandeep Singh | ๋‚ ์งœ: 2026.03 | DOI: ๋ฏธ์ œ๊ณต 📄 PDF


Essence

Figure 1

Fig. 1. Overview of SEVerA, which operates in three key steps. (1) Search: Given the task information, library

๋ณธ ๋…ผ๋ฌธ์€ ์ž๊ฐ€ ์ง„ํ™”ํ•˜๋Š” LLM ์—์ด์ „ํŠธ๋ฅผ ์•ˆ์ „ํ•˜๊ฒŒ ํ•ฉ์„ฑํ•˜๊ธฐ ์œ„ํ•ด Formally Guarded Generative Models (FGGM)์„ ๋„์ž…ํ•˜๊ณ , ํ˜•์‹ ๊ฒ€์ฆ๊ณผ ๊ทธ๋ž˜๋””์–ธํŠธ ๊ธฐ๋ฐ˜ ์ตœ์ ํ™”๋ฅผ ๊ฒฐํ•ฉํ•œ SEVerA ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์‹œํ•œ๋‹ค. Search-Verify-Learn ์„ธ ๋‹จ๊ณ„๋ฅผ ํ†ตํ•ด ํ˜•์‹์  ์ •ํ™•์„ฑ ๋ณด์žฅ๊ณผ ์ž‘์—… ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ๋™์‹œ์— ๋‹ฌ์„ฑํ•œ๋‹ค.

Motivation

Achievement

Figure 2

Fig. 2. Auto-synthesized guarded GM with

0 ์ œ์•ฝ ์œ„๋ฐ˜ ๋‹ฌ์„ฑ\n- HumanEvalDafny: 97.0% ๊ฒ€์ฆ ์„ฑ๊ณต๋ฅ  (์ตœ๊ณ  ๊ธฐ์ค€ 86.9% ๋Œ€๋น„)\n- GSM-Symbolic: 66.0% ์ •ํ™•๋„ (์ตœ๊ณ  constrained-decoding 44.7% ๋Œ€๋น„)\n- ฯ„2-bench ํ•ญ๊ณต์‚ฌ ๋„๋ฉ”์ธ: 52.6% ํ†ต๊ณผ์œจ (Qwen3-8B ๊ธฐ์ค€, Claude Sonnet 4.5 ๊ธฐ๋ฐ˜ Agent-C๋„ ์ดˆ๊ณผ)\nํ˜•์‹ ์ •ํ™•์„ฑ ์ฆ๋ช…: Theorem 5.4โ€”๋ฐ˜ํ™˜ ์—์ด์ „ํŠธ๋Š” ๋ชจ๋“  ์ž…๋ ฅ๊ณผ ํŒŒ๋ผ๋ฏธํ„ฐ์—์„œ ํ–‰๋™ ๋ช…์„ธ ๋งŒ์กฑ\n์„ฑ๋Šฅ ๊ฐœ์„ : Theorem 5.5โ€”์ œ์•ฝ ์—†๋Š” ๋ชจ๋ธ ๋Œ€๋น„ ์ž‘์—… ์†์‹ค ์—†์Œ ๋ณด์žฅ, ๋ฏธ์ค€์ˆ˜ ์‹œ ์—„๊ฒฉํ•œ ๊ฐœ์„  ์ฆ๋ช…

How

Figure 1

Fig. 1. Overview of SEVerA, which operates in three key steps. (1) Search: Given the task information, library

โ€ข first-order logic ๊ณ„์•ฝ์„ LLM ์ถœ๋ ฅ ๋ฌธ์ž์—ด์—๋งŒ ์ ์šฉํ•˜์—ฌ ๊ฐœ๋ฐฉํ˜•/ํ์‡„ํ˜• ๋ชจ๋ธ ๋ชจ๋‘ ์ง€์›\nโ€ข ๊ฒ€์ฆ๊ธฐ(verifier)์˜ ํ˜•์‹ ์ฆ๋ช…๊ณผ ๊ทธ๋ž˜๋””์–ธํŠธ ๊ธฐ๋ฐ˜ ์ตœ์ ํ™”์˜ ๋ถ„๋ฆฌ๋กœ ๊ธฐ์กด GRPO, ๋ฏธ์„ธ์กฐ์ • ๋„๊ตฌ ์žฌ์‚ฌ์šฉ ๊ฐ€๋Šฅ\nโ€ข ํ˜•์‹์  ์ œ์•ฝ์ด ํ”„๋กœ๊ทธ๋žจ ํƒ์ƒ‰ ๊ณต๊ฐ„์„ ์ •์ œ(prune)ํ•˜์—ฌ ๋” ๋†’์€ ํ’ˆ์งˆ ์—์ด์ „ํŠธ๋กœ ์•ˆ๋‚ดํ•˜๋Š” ๊ธ์ •์  ๋ถ€์ž‘์šฉ ํ™œ์šฉ

Originality

โ€ข FGGM์˜ novel ๊ฐœ๋…: ์ƒ์„ฑ ๋ชจ๋ธ ํ˜ธ์ถœ ์ˆ˜์ค€์—์„œ first-order logic ๊ธฐ๋ฐ˜ ํ˜•์‹ ๊ณ„์•ฝ ๋„์ž…์œผ๋กœ ๊ธฐ์กด constrained decoding์˜ ๊ตฌ๋ฌธ-์ค‘์‹ฌ ์ œ์•ฝ์„ ์˜๋ฏธ๋ก ์  ์ˆ˜์ค€์œผ๋กœ ํ™•์žฅ\nโ€ข ํ˜•์‹ ๊ฒ€์ฆ๊ณผ ๋ฏธ๋ถ„ ๊ฐ€๋Šฅ ํ•™์Šต์˜ ๋ช…์‹œ์  ๋ถ„๋ฆฌ ์„ค๊ณ„: Verify ๋‹จ๊ณ„๊ฐ€ ๋ชจ๋“  ํŒŒ๋ผ๋ฏธํ„ฐ์— ๋Œ€ํ•œ ์ •ํ™•์„ฑ์„ ๋ณด์žฅํ•˜๋ฏ€๋กœ Learn์—์„œ ์ œ์•ฝ ์—†๋Š” ์ตœ์ ํ™”๋กœ ํ™˜์› ๊ฐ€๋Šฅ\nโ€ข ์ž๊ฐ€ ์ง„ํ™” ์—์ด์ „ํŠธ์— ํ˜•์‹ ๋ณด์žฅ์„ ์ฒ˜์Œ์œผ๋กœ ๋ถ€์—ฌํ•œ ์ข…ํ•ฉ ํ”„๋ ˆ์ž„์›Œํฌ

Limitation & Further Study

์ด๋ก ์  ํ•œ๊ณ„: Theorem 5.5์˜ "์ถฉ๋ถ„ ์กฐ๊ฑด"์ด ์‹ค์ œ ์ ์šฉ ๊ฐ€๋Šฅ์„ฑ์ด ์ œํ•œ์ ์ผ ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์–ด๋–ค ์ƒํ™ฉ์—์„œ ์ด ์กฐ๊ฑด์ด ๋งŒ์กฑ๋˜๋Š”์ง€์— ๋Œ€ํ•œ ์‹ค์ฆ์  ๋ถ„์„์ด ๋ถ€์กฑ\n์‹คํ—˜ ๋ฒ”์œ„: ๊ธฐ์กด program repair, code generation ๊ฐ™์€ ๊ด‘๋ฒ”์œ„ํ•œ LLM ์—์ด์ „ํŠธ ์ž‘์—…์— ๋Œ€ํ•œ ํ‰๊ฐ€ ๋ถ€์žฌ, ๊ณ„์‚ฐ ๋น„์šฉ(ํ˜•์‹ ๊ฒ€์ฆ ์˜ค๋ฒ„ํ—ค๋“œ) ๋ถ„์„ ๋ฏธํก\n์ƒ์„ฑ ๋ชจ๋ธ ์ œ์•ฝ: FGGM์˜ rejection sampler๋Š” ๊ณ„์•ฝ ๋งŒ์กฑ ์ƒ˜ํ”Œ์ด ๋งค์šฐ ํฌ๊ท€ํ•œ ๊ฒฝ์šฐ ์ƒ˜ํ”Œ๋ง ๋น„ํšจ์œจ ๊ฐ€๋Šฅ์„ฑ\nํ›„์† ์—ฐ๊ตฌ: (1) ๋” ๊ด‘๋ฒ”์œ„ํ•œ agent ๋„๋ฉ”์ธ ํ‰๊ฐ€, (2) ๋ณตํ•ฉ ํ˜•์‹ ๋ช…์„ธ์— ๋Œ€ํ•œ ๊ฒ€์ฆ ํ™•์žฅ์„ฑ ๋ถ„์„, (3) ์ œ์•ฝ-์„ฑ๋Šฅ ํŠธ๋ ˆ์ด๋“œ์˜คํ”„ ๋ฉ”์ปค๋‹ˆ์ฆ˜์˜ ์ด๋ก ์  ์‹ฌํ™”

Evaluation

Novelty: 4/5 Technical Soundness: 4/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: ๋ณธ ๋…ผ๋ฌธ์€ ์ž๊ฐ€ ์ง„ํ™” LLM ์—์ด์ „ํŠธ์˜ ํ˜•์‹์  ์•ˆ์ „์„ฑ์ด๋ผ๋Š” ๊ธด๊ธ‰ํ•œ ๋ฌธ์ œ๋ฅผ ์ฒ˜์Œ์œผ๋กœ ์ฒด๊ณ„์ ์œผ๋กœ ๋‹ค๋ฃจ๋ฉฐ, FGGM๊ณผ SEVerA๋Š” ๊ธฐ์ˆ ์ ์œผ๋กœ ๊ฑด์ „ํ•˜๊ณ  ์‹ค์ œ ์„ฑ๋Šฅ์—์„œ๋„ ์šฐ์ˆ˜ํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์ธ๋‹ค. ๋‹ค๋งŒ ์ด๋ก ์  ์ถฉ๋ถ„ ์กฐ๊ฑด์˜ ์‹ค์šฉ์„ฑ ๋ช…ํ™•ํ™”์™€ ํ‰๊ฐ€ ๋„๋ฉ”์ธ ํ™•๋Œ€๊ฐ€ ํ•„์š”ํ•˜๋‹ค.

๊ฐ™์ด ๋ณด๋ฉด ์ข‹์€ ๋…ผ๋ฌธ

๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
์‹ ๊ฒฝ ์ •๋ฆฌ ์ฆ๋ช… ๊ธฐ์ˆ ์€ SEVerA์˜ ํ˜•์‹์  ๊ณ„์•ฝ ๊ธฐ๋ฐ˜ ๊ฒ€์ฆ์—์„œ LLM ์ถœ๋ ฅ์„ ํ˜•์‹์ ์œผ๋กœ ๋ณด์ฆํ•˜๋Š” ์ด๋ก ์  ๊ธฐ๋ฐ˜์„ ์ œ๊ณตํ•œ๋‹ค.
๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
DeepSeek-R1์˜ RL ๊ธฐ๋ฐ˜ ์ž๊ธฐ ๊ฐœ์„  ๋ฉ”์ปค๋‹ˆ์ฆ˜์€ SEVerA๊ฐ€ ํ˜•์‹์ ์œผ๋กœ ๊ฒ€์ฆํ•˜๋ ค๋Š” ์ž๊ธฐ ์ง„ํ™” ์—์ด์ „ํŠธ์˜ ํ•ต์‹ฌ ํ•™์Šต ํŒจ๋Ÿฌ๋‹ค์ž„์ด๋‹ค.
๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
๋ฌผ๋ฆฌ/์ˆ˜ํ•™ ์‹œ์Šคํ…œ์˜ ์ด๋ก  ๋ฐœ๊ฒฌ์„ ์œ„ํ•œ LLM ๊ธฐ๋ฐ˜ ์ˆ˜์‹ ์ถ”๋ก  ๋ฐ ๊ฒ€์ฆ ์ „๋žต์ด SEVerA์˜ ํ˜•์‹์  ์•ˆ์ „์„ฑ ํ”„๋ ˆ์ž„์›Œํฌ์— ๊ธฐ๋ฐ˜์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
์—์ด์ „ํŠธ ์„ค๊ณ„์™€ ์ง„ํ™”์— ๋Œ€ํ•œ ๋‡Œ ์˜๊ฐ์„ ๋ฐ›์€ ์•„ํ‚คํ…์ฒ˜ ๋…ผ์˜๊ฐ€ SEVerA์˜ ์ž๊ธฐ ์ง„ํ™” ์—์ด์ „ํŠธ ํ•ฉ์„ฑ ๊ธฐ๋ฐ˜์ด ๋œ๋‹ค.
๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
SEVerA๋Š” ์ž๊ธฐ-์ง„ํ™” ๊ฐ€๋Šฅ ์—์ด์ „ํŠธ์˜ ์„ค๊ณ„ยท๊ฒ€์ฆ ๊ตฌ์กฐ๋ฅผ ๋‹ค๋ฃจ๊ณ , Self-Revising Discovery Systems์˜ '๊ณผํ•™์  ์–ดํœ˜ ์ž๊ธฐ ์ˆ˜์ •์„ ๋ช…์‹œํ™”'ํ•œ ์ ‘๊ทผ๋ฒ•๊ณผ ์ง์ ‘์ ์œผ๋กœ ๋งž๋‹ฟ์•„ ์žˆ์Šต๋‹ˆ๋‹ค.
๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
824 ๋…ผ๋ฌธ์€ ๊ณผํ•™ ์—ฐ๊ตฌ ์ž๋™ํ™” ๋ถ„์•ผ์—์„œ ์‹ ๋ขฐ์„ฑ๊ณผ ์•ˆ์ „์„ฑ์˜ ๊ฐœ๋…์  ๊ธฐ๋ฐ˜์„ ๋‹ค๋ฃจ์–ด, SEVerA์˜ ํ˜•์‹์  ๋ณด์ฆ ํ”„๋ ˆ์ž„์›Œํฌ ์„ฑ์ทจ์™€ ์—ฐ๊ณ„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
466 ๋…ผ๋ฌธ์€ ์ง„ํ™”์  ์ตœ์ ํ™”๊ณผ์ •์„ LLM ๊ธฐ๋ฐ˜์œผ๋กœ ํƒ๊ตฌํ•˜๋ฏ€๋กœ, ์ž๊ธฐ ์ง„ํ™”ํ˜• ์—์ด์ „ํŠธ ํ•ฉ์„ฑ ๋ฐ ๊ฒ€์ฆ ๋ฌธ์ œ์˜ ๋Œ€์•ˆ์  ๋ฐฉํ–ฅ์„ฑ์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
AGI๋ฅผ ํ–ฅํ•œ LLM์˜ ๋Šฅ๋ ฅ๊ณผ ํ•œ๊ณ„๋ฅผ ๋‹ค๊ฐ๋„๋กœ ํ‰๊ฐ€ํ•˜๋Š” ์œ ์‚ฌํ•œ ์—ฐ๊ตฌ์ด๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
ShinkaEvolve ๋…ผ๋ฌธ์€ ์ž๊ธฐ์ง„ํ™” ํ”„๋กœ๊ทธ๋ž˜๋ฐ๊ณผ ์—์ด์ „ํŠธ ํ•ฉ์„ฑ์— ์ดˆ์ ์„ ๋งž์ถ˜ ๋Œ€์•ˆ์ ์ธ ์ ‘๊ทผ๋ฒ•์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
์—์ด์ „ํŠธ ๊ฒ€์ฆ-์•ˆ์ „์„ฑ ์ธก๋ฉด์—์„œ Reinforcement ๊ธฐ๋ฐ˜ ์ž์œจ์  ๊ฒ€์ฆ(์ž๊ธฐ๊ฒ€์ฆ/๊ณ„์•ฝ ์œ„๋ฐ˜ ๊ฐ์ง€)์ด๋ผ๋Š” ๋ณ„๋„์˜ ์ ‘๊ทผ๋ฒ•์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
LLM์˜ ๊ณต์ •์„ฑ๊ณผ ํŽธํ–ฅ ์™„ํ™”๋ฅผ ์œ„ํ•œ ๋ชจ๋ธ-๋ฌด๊ด€์  ๋ฐฉ๋ฒ•๋ก ์„ ํƒ๊ตฌํ•˜๋Š” ๊ด€๋ จ ๋…ผ๋ฌธ์ด๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
Neural-POD๋Š” PDE ํ•ด์„์—์„œ ๋ชจ๋“ˆ์‹ ๊ตฌ์กฐ์™€ ๊ฒ€์ฆ์„ ๊ฐ•์กฐํ•˜๋ฉฐ, SEVerA์™€ ๋‹ฌ๋ฆฌ ๊ณผํ•™์  ์‹œ๋ฎฌ๋ ˆ์ด์…˜์— ์ง‘์ค‘ํ•œ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
SEVerA๋Š” ํ˜•์‹์  ์•ˆ์ „์„ฑ ๋ณด์ฆ์„ ํ†ตํ•ด ์—์ด์ „ํŠธ ์‹ ๋ขฐ์„ฑ์„ ๊ธฐ์ˆ ์ ์œผ๋กœ ๋ณด์žฅํ•˜๋ ค๋Š” ์ ‘๊ทผ์œผ๋กœ, 822์˜ ์‹ ๋ขฐ์„ฑ ๋ฉ”ํŠธ๋ฆญ ๊ธฐ๋ฐ˜ ํ‰๊ฐ€์™€ ์ƒํ˜ธ ๋ณด์™„๋œ๋‹ค.
ํ›„์† ์—ฐ๊ตฌ
SEVerA ๋…ผ๋ฌธ์€ ์ž์ฒด ํ™•์žฅ, ์ž๊ธฐ์ง„ํ™” ์—์ด์ „ํŠธ ํ‰๊ฐ€๋กœ AGI ์‚ฌํšŒ์  ์˜ํ–ฅ ๋ฐ ํ‰๊ฐ€ ํ”„๋ ˆ์ž„์›Œํฌ์— ์‹ค์งˆ์  ์‚ฌ๋ก€๋ฅผ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.
ํ›„์† ์—ฐ๊ตฌ
SEVerA๋Š” ์ž๊ธฐ ์ง„ํ™” ์—์ด์ „ํŠธ์— ํ˜•์‹์  ์•ˆ์ „์„ฑ ๋ณด์ฆ์„ ์ถ”๊ฐ€ํ•œ ํ”„๋ ˆ์ž„์›Œํฌ๋กœ, DeepSeek-R1์˜ RL ๊ธฐ๋ฐ˜ ์ž๊ธฐ ๊ฐœ์„ ์„ ์•ˆ์ „ํ•˜๊ฒŒ ๊ฒ€์ฆํ•˜๋Š” ํ™•์žฅ์ด๋‹ค.
ํ›„์† ์—ฐ๊ตฌ
SEVerA๋Š” ์ž๊ธฐ ์ง„ํ™” ์—์ด์ „ํŠธ์— ํ˜•์‹์  ์•ˆ์ „์„ฑ ๋ณด์ฆ์„ ์ œ๊ณตํ•˜์—ฌ, AI ์—์ด์ „ํŠธ ์‹ ๋ขฐ์„ฑ ๊ณผํ•™์ด ์ œ์‹œํ•˜๋Š” ์ผ๊ด€์„ฑยท์•ˆ์ „์„ฑ ์š”๊ตฌ์‚ฌํ•ญ์„ ๊ธฐ์ˆ ์ ์œผ๋กœ ๊ตฌํ˜„ํ•œ๋‹ค.
← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •