Enabling language models to implicitly learn self-improvement

์ €์ž: Ziqi Wang, Le Hou, Tianjian Lu, Yuexin Wu, Yunxuan Li | ๋‚ ์งœ: 2023 | DOI: N/A 📄 PDF


Essence

Figure 1

Figure 1: The pipeline of PIT and prompting methods (Self-Refine). Upper: PIT utilizes inputs and

์ด ๋…ผ๋ฌธ์€ LLM์ด ๋ช…์‹œ์ ์ธ rubric ์„ค๊ณ„ ์—†์ด ์ธ๊ฐ„ ์„ ํ˜ธ๋„ ๋ฐ์ดํ„ฐ๋กœ๋ถ€ํ„ฐ ์ž๋™์œผ๋กœ ๊ฐœ์„  ๋ชฉํ‘œ๋ฅผ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋Š” PIT(ImPlicit Self-ImprovemenT) ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์•ˆํ•œ๋‹ค. RLHF์˜ ํ•™์Šต ๋ชฉํ‘œ๋ฅผ ์žฌ๊ตฌ์„ฑํ•˜์—ฌ ์ž…๋ ฅ๋งŒ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ๋Œ€์‹  ์ฐธ์กฐ ์‘๋‹ต ์กฐ๊ฑด๋ถ€ ์‘๋‹ต ํ’ˆ์งˆ ๊ฐ„๊ฒฉ์„ ์ตœ๋Œ€ํ™”ํ•œ๋‹ค.

Motivation

Achievement

Figure 2

Figure 2: Reward distribution of

How

Figure 1

Figure 1: The pipeline of PIT and prompting methods (Self-Refine). Upper: PIT utilizes inputs and

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 4/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: ์ด ๋…ผ๋ฌธ์€ ๋ช…์‹œ์  rubric ์„ค๊ณ„์˜ ๋น„์šฉ์„ ์ œ๊ฑฐํ•˜๋ฉด์„œ๋„ LLM์˜ self-improvement๋ฅผ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋Š” ์‹ค์šฉ์ ์ด๊ณ  ํ˜์‹ ์ ์ธ ์ ‘๊ทผ์„ ์ œ์‹œํ•œ๋‹ค. RLHF ์žฌ๊ตฌ์„ฑ์˜ ๋‹จ์ˆœ์„ฑ๊ณผ ํšจ๊ณผ์„ฑ, ๊ทธ๋ฆฌ๊ณ  ๋ฐ์ดํ„ฐ ํšจ์œจ์„ฑ ์ธก๋ฉด์—์„œ ๊ฐ€์น˜ ์žˆ๋Š” ๊ธฐ์—ฌ์ด๋ฉฐ, ICLR ์ˆ˜์ค€์˜ ์ถœํŒ๋ฌผ๋กœ ์ ์ ˆํ•˜๋‹ค.

๊ฐ™์ด ๋ณด๋ฉด ์ข‹์€ ๋…ผ๋ฌธ

๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
LLM์ด ์•”๋ฌต์ ์œผ๋กœ self-improvement๋ฅผ ํ•™์Šตํ•˜๋„๋ก ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์‹œํ•ด, ์ž๊ธฐ ๊ฐœ์„  ๋…ผ๋ฌธ์˜ ๊ธฐ๋ฐ˜์ด ๋œ๋‹ค.
๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
314๋ฒˆ ๋…ผ๋ฌธ์€ LLM์ด ์Šค์Šค๋กœ self-improvement๋ฅผ ํ†ตํ•ด ์ฐฝ์˜์„ฑ์„ ์ง„ํ™”์‹œํ‚ค๋Š” ๋ฐฉ๋ฒ•๋ก ์„ ๋‹ค๋ฃจ๋ฉฐ, 565๋ฒˆ ์—ฐ๊ตฌ์˜ ๋ชจ๋ธ-๋ถˆ๋ณ€์  ๋‹ค๊ฐ์  ์ž„๋ฒ ๋”ฉ๊ณผ ์ ‘๋ชฉํ•  ์ด๋ก ์  ๊ธฐ๋ฐ˜์„ ์ œ๊ณตํ•œ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
์ด์งˆ์  ๋ฐ์ดํ„ฐ ๊ฒฐํ•ฉ ์งˆ์˜์‘๋‹ต์— ๋‹ค๋ฅธ ์ ‘๊ทผ ๋ฐฉ์‹์„ ์ ์šฉํ•œ ์—ฐ๊ตฌ์ด๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
LLM ์ž๊ธฐ ๊ฐœ์„ ์„ ์œ„ํ•œ ๋‹ค๋ฅธ ๊ฐ•ํ™”ํ•™์Šต ๊ธฐ๋ฐ˜ ์ ‘๊ทผ๋ฒ•์„ ์ œ์‹œํ•œ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
์ž๊ธฐ ๊ฐœ์„  ์ž๋™ํ™” ํ‰๊ฐ€์— ์ฃผ๋ชฉํ•˜์—ฌ, LLM์˜ ์ „๋žต ์ ์‘ยทํ•™์Šต ๋ฉ”์ปค๋‹ˆ์ฆ˜์„ ์•กํ‹ฐ๋ธŒ ์ˆ˜์ง‘ ๊ด€์ ์—์„œ ๋ณด์™„์ ์œผ๋กœ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
๋ช…์‹œ์  ํ”ผ๋“œ๋ฐฑ ์—†์ด LLM ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ๋‹ค๋ฅธ ์•”๋ฌต์  ํ•™์Šต ๋ฐฉ๋ฒ•์ด๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
LLM์˜ ์ž๊ธฐ ๊ฐœ์„  ๋Šฅ๋ ฅ์„ ์œ„ํ•œ ๋Œ€์•ˆ์  ํ”„๋กฌํ”„ํŒ… ๋ฐ ํ•™์Šต ํ”„๋ ˆ์ž„์›Œํฌ์ด๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
447 ๋…ผ๋ฌธ์€ LLM์ด self-incentivization ๋ฐ iterative ์ž๊ธฐ ๊ฐ•ํ™”ํ•™์Šต์„ ํ†ตํ•ด ์ž๊ธฐ ๊ฐœ์„ ์„ ๋‹ฌ์„ฑํ•˜๋Š” ์ƒˆ๋กœ์šด ๋ฐฉ์‹์„ ์ œ์•ˆํ•˜์—ฌ 314์˜ PIT ๋ฐฉ์‹๊ณผ ๋Œ€๋น„๋œ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
ํ”„๋กฌํ”„ํŠธ ์ตœ์ ํ™”์˜ ์ˆ˜๋ ด ์†๋„์™€ ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ ๊ฐœ์„ ์„ ์œ„ํ•œ ๋‹ค๋ฅธ ์ ‘๊ทผ๋ฒ•์„ ์ œ์‹œํ•œ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
LLM์˜ ์•”๋ฌต์  ์ž๊ธฐ๊ฐœ์„  ํ•™์Šต๋Šฅ๋ ฅ์„ ๋…ผ์˜ํ•˜๋Š” ๋…ผ๋ฌธ์œผ๋กœ, ์ƒ์ฒด์‹ ํ˜ธ ์ž„๋ฒ ๋”ฉ์˜ ์ž๊ธฐ-์ง„ํ™” ๊ตฌ์กฐ ๋ฐ ์‹œ๊ณ„์—ด ์˜ˆ์ธก๋ ฅ ๊ฐœ์„ ์— ํ†ต์ฐฐ์„ ์ค๋‹ˆ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
314 ๋…ผ๋ฌธ์€ LLM ๊ธฐ๋ฐ˜ ์ž๊ธฐ๊ฐœ์„  ํ•™์Šต ์›Œํฌํ”Œ๋กœ์šฐ ํ‰๊ฐ€ ํ”„๋ ˆ์ž„์›Œํฌ๋กœ, ์‹œ๊ฐ„์— ๋”ฐ๋ฅธ ํ‰๊ฐ€๋ผ๋Š” LAFA(3147)์˜ ํ•ต์‹ฌ ์ฃผ์ œ์— ์‹คํ—˜์  ์ ‘๊ทผ์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.
ํ›„์† ์—ฐ๊ตฌ
470 ๋…ผ๋ฌธ์€ LLM์˜ ์ž๊ธฐ ๊ฐœ์„ ๋Šฅ๋ ฅ(์ž๊ธฐ ์ˆ˜์ •, self-improvement)์˜ ์ฒด๊ณ„์  ์‹ค์ฆ์„ ์ œ๊ณตํ•˜์—ฌ, 314์— ์ œ์•ˆ๋œ PIT(self-improvement ํ”„๋ ˆ์ž„)์˜ ํšจ๊ณผ๋ฅผ ์‹คํ—˜์ ์œผ๋กœ ํ™•์žฅํ•œ๋‹ค.
ํ›„์† ์—ฐ๊ตฌ
LLM์˜ ์ž๊ธฐ์„ฑ์žฅ, ์ž๊ธฐ๊ฐœ์„  ๋ฉ”์ปค๋‹ˆ์ฆ˜์„ ๋‹ค๊ฐ๋„๋กœ ๋ถ„์„ํ•ด, 314๋ฒˆ์˜ PIT ํ”„๋ ˆ์ž„์›Œํฌ์˜ ์‹คํ—˜์  ํ™•์žฅ ์—ฐ๊ตฌ๋กœ ์—ฐ๊ฒฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
ํ›„์† ์—ฐ๊ตฌ
314 ๋…ผ๋ฌธ์€ LLM์˜ self-improvement์™€ alignment ๊ธฐ์ˆ ์„ ๋” ๋ฐœ์ „์‹œํ‚ค๋ฉฐ, BiasFilter๊ฐ€ ์ง€ํ–ฅํ•˜๋Š” ์‹ค์‹œ๊ฐ„ ๊ฐœ์„  ๋งฅ๋ฝ๊ณผ ์—ฐ๊ฒฐ๋ฉ๋‹ˆ๋‹ค.
์‘์šฉ ์‚ฌ๋ก€
Mind the gap ๋…ผ๋ฌธ์€ LLM์˜ ์ž๊ธฐ๊ฐœ์„  ๋Šฅ๋ ฅ์˜ ํ•œ๊ณ„ ๋ฐ ์‹ค์ œ ์ž๊ธฐ๊ฐœ์„  ํšจ๊ณผ๋ฅผ ์‹ค์ฆ ๋ถ„์„, ImPlicit Self-ImprovemenT ํ”„๋ ˆ์ž„์›Œํฌ๊ฐ€ ํ˜„์‹ค์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ์„ ๊ฒ€์ฆํ•ฉ๋‹ˆ๋‹ค.
← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •