Evaluating large language models trained on code

์ €์ž: Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pondรฉ de Oliveira Pinto, Jared Kaplan, Harrison Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder | ๋‚ ์งœ: 2021 | URL: https://arxiv.org/abs/2107.03374 📄 PDF


Essence

Figure 1

Figure 1. Pass rates of our models on the HumanEval dataset as a

์ด ๋…ผ๋ฌธ์€ GitHub์—์„œ ์ˆ˜์ง‘ํ•œ ๊ณต๊ฐœ ์ฝ”๋“œ๋กœ ํŒŒ์ธํŠœ๋‹ํ•œ GPT ๋ชจ๋ธ์ธ Codex๋ฅผ ์†Œ๊ฐœํ•˜๊ณ , ๋…์ŠคํŠธ๋ง์œผ๋กœ๋ถ€ํ„ฐ Python ํ•จ์ˆ˜๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๋Šฅ๋ ฅ์„ ํ‰๊ฐ€ํ•œ๋‹ค. ์ƒˆ๋กœ์šด ๋ฒค์น˜๋งˆํฌ์ธ HumanEval์„ ํ†ตํ•ด ๋‹จ์ผ ์ƒ˜ํ”Œ๋กœ๋Š” 28.8%์˜ ํ•ด๊ฒฐ๋ฅ ์„ ๋ณด์ด๋ฉฐ, 100๊ฐœ ์ƒ˜ํ”Œ ์ƒ์„ฑ ์‹œ 77.5%๊นŒ์ง€ ๋‹ฌ์„ฑํ•จ์„ ๋ณด์—ฌ์ค€๋‹ค.

Motivation

Achievement

Figure 1

Figure 1. Pass rates of our models on the HumanEval dataset as a

์ƒˆ๋กœ์šด ํ‰๊ฐ€ ๋ฐฉ๋ฒ•๋ก : pass@k ๋ฉ”ํŠธ๋ฆญ๊ณผ HumanEval ๋ฒค์น˜๋งˆํฌ ์ œ์‹œ๋กœ ํ•จ์ˆ˜ํ˜• ์ •ํ™•์„ฑ ํ‰๊ฐ€์˜ ํ‘œ์ค€ ์ˆ˜๋ฆฝ. ๋ชจ๋ธ ์„ฑ๋Šฅ: ๋‹จ์ผ ์ƒ˜ํ”Œ ๊ธฐ์ค€ 28.8%์˜ ์„ฑ๋Šฅ(GPT-3๋Š” 0%, GPT-J๋Š” 11.4%)์„ ๋‹ฌ์„ฑํ•˜๊ณ  ์ƒ˜ํ”Œ ์ฆ๊ฐ€์— ๋”ฐ๋ฅธ ๊ฐœ์„  ๊ฐ€๋Šฅ์„ฑ ์ž…์ฆ(100์ƒ˜ํ”Œ ์‹œ 77.5%). ํŒŒ์ธํŠœ๋‹ ํšจ๊ณผ: Codex-S๋ฅผ ํ†ตํ•ด ๋…๋ฆฝ์  ํ•จ์ˆ˜ ํ•™์Šต์ด 37.7%๋กœ ์„ฑ๋Šฅ ํ–ฅ์ƒ๋จ์„ ๋ณด์ž„. ์‹ค์šฉ์„ฑ ๋ถ„์„: ๋กœ๊ทธํ™•๋ฅ ๋กœ ์ƒ˜ํ”Œ ์„ ๋ณ„ ๊ฐ€๋Šฅ์„ฑ(44.5%) ์ œ์‹œ๋กœ ๋ฐฐํฌ ๊ฐ€๋Šฅ์„ฑ ๋…ผ์˜.

How

Figure 1

Figure 1. Pass rates of our models on the HumanEval dataset as a

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 4/5 Significance: 5/5 Clarity: 4/5 Overall: 5/5

์ดํ‰: ์ด ๋…ผ๋ฌธ์€ ์ฝ”๋“œ ์ƒ์„ฑ ๋ชจ๋ธ์˜ ํ‰๊ฐ€ ๋ฐฉ๋ฒ•๋ก ์„ ํ˜์‹ ํ•˜๊ณ , ๊ณต๊ฐœ ๋ฒค์น˜๋งˆํฌ์™€ ํ•จ๊ป˜ ์‹ค์šฉ์ ์œผ๋กœ ๊ฐ•๋ ฅํ•œ Codex ๋ชจ๋ธ์„ ์ œ์‹œํ•œ๋‹ค. pass@k ๋ฉ”ํŠธ๋ฆญ๊ณผ HumanEval ๋ฐ์ดํ„ฐ์…‹์€ ํ›„์† ์—ฐ๊ตฌ์˜ ํ‘œ์ค€์ด ๋˜์—ˆ์œผ๋ฉฐ, GitHub Copilot์œผ๋กœ ์‹ค์ œ ๋ฐฐํฌ๋˜์–ด ์—…๊ณ„์— ํฐ ์˜ํ–ฅ์„ ๋ฏธ์ณค๋‹ค. ๋‹ค์ค‘ ์ƒ˜ํ”Œ ์ „๋žต๊ณผ ํœด๋ฆฌ์Šคํ‹ฑ ์„ ๋ณ„์˜ ํšจ๊ณผ์„ฑ์€ ์‹ค์šฉ์  ๊ฐ€์น˜๊ฐ€ ๋†’๋‹ค.

๊ฐ™์ด ๋ณด๋ฉด ์ข‹์€ ๋…ผ๋ฌธ

๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
Evaluating large language models trained on code (Codex)๋Š” ์ฝ”๋“œ ์ƒ์„ฑ ํŠนํ™” LLM ๋ฐœ์ „์˜ ์ดˆ์„์„ ์ œ๊ณตํ•˜๋ฉฐ, Code Llama ๋ฐ ํ›„์† ์˜คํ”ˆ์†Œ์Šค ํ‰๊ฐ€์˜ ๊ธฐ๋ฐ˜์ด ๋œ๋‹ค.
๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
์ฝ”๋“œ LLM์˜ ๋Œ€๊ทœ๋ชจ ์ž์ฒด ์ฝ”๋“œ ํ‰๊ฐ€ ๋ฐ ๋””๋ฒ„๊น… ํ›ˆ๋ จ์˜ ๊ธฐ๋ฐ˜์ด ๋˜๋Š” ๋ฒค์น˜๋งˆํ‚น ์—ฐ๊ตฌ(3380)๊ฐ€ self-debugging ๊ธฐ๋ฒ•์˜ ํ‰๊ฐ€ํ† ๋Œ€๋ฅผ ์ด๋ฃน๋‹ˆ๋‹ค.
๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
Codex ๋…ผ๋ฌธ์€ ์ฝ”๋“œ ํŠนํ™” LLM์˜ ์ฒซ ๋Œ€ํ‘œ์  ๋ชจ๋ธ๋กœ, Deepseek-coder์˜ ์˜คํ”ˆ์†Œ์Šค ์„ฑ๋Šฅ ๊ฐœ์„ ๊ณผ ๋น„๊ตํ•  ์ˆ˜ ์žˆ๋Š” ์ดˆ๊ธฐ ๊ธฐ์ค€์ ์ด๋‹ค.
๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
์ฝ”๋“œ๋กœ ํ›ˆ๋ จ๋œ ๋Œ€ํ˜• ์–ธ์–ด๋ชจ๋ธ ํ‰๊ฐ€์— ์ง‘์ค‘ํ•œ ์—ฐ๊ตฌ๋กœ, Seed-coder์™€ ๊ฐ™์€ ์ฝ”๋“œ ์ค‘์‹ฌ ํŒŒ์ดํ”„๋ผ์ธ ๊ฐœ๋ฐœ์˜ ์ด๋ก ์  ํ† ๋Œ€๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
๋‘ ๋…ผ๋ฌธ ๋ชจ๋‘ ์ฝ”๋“œ ์ƒ์„ฑ LLM์˜ ํ‰๊ฐ€๋ฅผ ์œ„ํ•œ HumanEval ๋ฒค์น˜๋งˆํฌ์™€ Codex ๋ชจ๋ธ์„ ์†Œ๊ฐœํ•˜๋Š” ๋™์ผํ•œ ์—ฐ๊ตฌ๋ฅผ ๋‹ค๋ฃจ๊ณ  ์žˆ์–ด ํ•จ๊ป˜ ์ฝ์–ด์•ผ ํ•œ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
์‹ ๊ฒฝ๋ง ๊ธฐ๋ฐ˜ ์ฝ”๋“œ ์ž๋™์™„์„ฑ๊ณผ AI ์ฝ”๋”ฉ ๋„๊ตฌ์˜ ์„ฑ๋Šฅ๊ณผ ์˜ํ–ฅ์— ๋Œ€ํ•œ ํฌ๊ด„์  ํ‰๊ฐ€ ๋…ผ๋ฌธ์œผ๋กœ, ์ฝ”๋”ฉ ์ƒ์‚ฐ์„ฑ ์ž๋™ํ™”์˜ ๋‹ค์–‘ํ•œ ๊ด€์ ์„ ๋ณด์—ฌ์ค€๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
3380์€ ์ฝ”๋“œ์— ํŠนํ™”๋œ LLM ๊ณ„์—ด์˜ ํ‰๊ฐ€ ์—ฐ๊ตฌ๋กœ, 205์—์„œ ์ œ์•ˆํ•˜๋Š” ๊ฐœ๋ฐœ ๋ณด์กฐ ์—์ด์ „ํŠธ ํ”„๋ ˆ์ž„์›Œํฌ์™€ ์„ฑ๋Šฅยทํ•œ๊ณ„ ๋น„๊ต๊ฐ€ ๊ฐ€๋Šฅํ•˜๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
Deepseek-coder ๋…ผ๋ฌธ์€ Codex์™€ GPT-3.5๋ฅผ ๋„˜์–ด์„œ๋Š” ์˜คํ”ˆ์†Œ์Šค ์ฝ”๋“œ ์ „๋ฌธ LLM์˜ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ๋ฉฐ, ๋‹ค์–‘ํ•œ ์ฝ”๋“œ ์ž‘์—…์—์„œ์˜ LLM ๋ฐœ์ „ ๋™ํ–ฅ์„ ๋น„๊ตํ•  ์ˆ˜ ์žˆ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
LLM์„ ์ฝ”๋“œ์— ํŠนํ™”ํ•ด ํ•™์Šตํ•˜์—ฌ ์ฝ”๋“œ ๋””๋ฒ„๊น… ๋ถ„์•ผ์—์„œ ๋ชจ๋ธ๋ณ„ ์„ฑ๋Šฅ ๋น„๊ต๋ฅผ ํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ ํ‰๊ฐ€ ํ”„๋กœํ† ์ฝœ ์ฐธ์กฐ๊ฐ€ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.
ํ›„์† ์—ฐ๊ตฌ
Code Llama ๋“ฑ ๊ณต๊ฐœ ์†Œ์Šค ์ฝ”๋“œ ๊ธฐ๋ฐ˜ LLM๋“ค๊ณผ Codex๋ฅผ ์‹œ์Šคํ…œ ๋ฐ ์„ฑ๋Šฅ ์ธก๋ฉด์—์„œ ๋น„๊ตํ•  ์ˆ˜ ์žˆ๋‹ค.
ํ›„์† ์—ฐ๊ตฌ
StarCoder2๋Š” Codex ์ดํ›„์˜ ์˜คํ”ˆ์†Œ์Šค ์ฝ”๋“œ LLM ๋ฐœ์ „์„ ๋ณด์—ฌ์ฃผ๋ฉฐ, ์ฝ”๋“œ LLM์˜ ์„ธ๋Œ€์  ์ง„ํ™”๋ฅผ ์ดํ•ดํ•˜๋Š” ๋ฐ ํ•„์ˆ˜์ ์ด๋‹ค.
ํ›„์† ์—ฐ๊ตฌ
SWE-bench๋Š” HumanEval ์ดํ›„ ์‹ค์ œ ์†Œํ”„ํŠธ์›จ์–ด ์—”์ง€๋‹ˆ์–ด๋ง ๋Šฅ๋ ฅ์„ ๋” ํ˜„์‹ค์ ์œผ๋กœ ํ‰๊ฐ€ํ•˜๋Š” ๋ฒค์น˜๋งˆํฌ๋กœ, ์ฝ”๋“œ LLM ํ‰๊ฐ€์˜ ํ•œ๊ณ„๋ฅผ ํ™•์žฅํ•œ๋‹ค.
์‘์šฉ ์‚ฌ๋ก€
๋Œ€๊ทœ๋ชจ ์ฝ”๋“œ ํ•™์Šต LLM์˜ ์ ์šฉ์„ฑ๊ณผ ์‹ค์ œ ๊ณผํ•™ ์ž๋™ํ™” ์˜์—ญ์—์„œ์˜ ์„ฑ๋Šฅ ๋ฒค์น˜๋งˆํ‚น์„ ํ†ตํ•ด StarCoder์˜ utility๋ฅผ ์ž…์ฆํ•ฉ๋‹ˆ๋‹ค.
์‘์šฉ ์‚ฌ๋ก€
From LLMs to LLM-based Agents for Software Engineering ๋…ผ๋ฌธ์€ ์ฝ”๋“œ LLM์„ ์‹ค์ œ ์†Œํ”„ํŠธ์›จ์–ด ์—”์ง€๋‹ˆ์–ด๋ง ๋ถ„์•ผ์— ์ ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•๋ก ๊ณผ ํ•œ๊ณ„๋ฅผ ํƒ๊ตฌํ•œ๋‹ค.
← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •