Training socially aligned language models in simulated human society

์ €์ž: Ruibo Liu, Ruixin Yang, Chenyan Jia, Ge Zhang, Denny Zhou, Andrew M. Dai, Diyi Yang, Soroush Vosoughi | ๋‚ ์งœ: 2023 | DOI: arXiv:2305.16960 📄 PDF


Essence

Figure 1

๊ธฐ์กด์˜ RLHF์™€ ๋‹ฌ๋ฆฌ Stable Alignment์€ ์‹œ๋ฎฌ๋ ˆ์ด์…˜๋œ ์‚ฌํšŒ์  ์ƒํ˜ธ์ž‘์šฉ์„ ํ†ตํ•ด ์ง์ ‘ ์–ธ์–ด๋ชจ๋ธ์„ ์ •๋ ฌํ•œ๋‹ค

๋ณธ ๋…ผ๋ฌธ์€ ์‹œ๋ฎฌ๋ ˆ์ด์…˜๋œ ์‚ฌํšŒ์  ์ƒํ˜ธ์ž‘์šฉ์„ ํ†ตํ•ด ์–ธ์–ด๋ชจ๋ธ์„ ์‚ฌํšŒ์ ์œผ๋กœ ์ •๋ ฌ(socially aligned)์‹œํ‚ค๋Š” ์ƒˆ๋กœ์šด ํ•™์Šต ํŒจ๋Ÿฌ๋‹ค์ž„์„ ์ œ์‹œํ•œ๋‹ค. ๊ธฐ์กด ๊ฐ๋… ํ•™์Šต์ด๋‚˜ ๋ณด์ƒ ๋ชจ๋ธ๋ง์˜ ํ•œ๊ณ„๋ฅผ ๊ทน๋ณตํ•˜๊ธฐ ์œ„ํ•ด ๋‹ค์ค‘ ์—์ด์ „ํŠธ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ํ™˜๊ฒฝ(SANDBOX)์—์„œ ์ƒ์„ฑ๋œ ์ƒํ˜ธ์ž‘์šฉ ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๋ณด๋‹ค ๊ฒฌ๊ณ ํ•˜๊ณ  ํ™•์žฅ ๊ฐ€๋Šฅํ•œ ์ •๋ ฌ ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค.

Motivation

Achievement

Figure 3

๋‹ค์–‘ํ•œ ์–ธ์–ด๋ชจ๋ธ์—์„œ์˜ ์ •๋ ฌ ๋ถ„์„: ๋ชจ๋ธ ๊ทœ๋ชจ๊ฐ€ ๋ฐ˜๋“œ์‹œ ์ •๋ ฌ ์„ฑ๋Šฅ์„ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œํ‚ค์ง€๋Š” ์•Š์Œ

Figure 2

Back-Scatter๋ฅผ ํ†ตํ•œ ์ƒํ˜ธ์ž‘์šฉ ๋ฐ์ดํ„ฐ ์ƒ์„ฑ ๋ฐ 3๊ฐ€์ง€ ์ •๋ ฌ ๋ฐ์ดํ„ฐ ํƒ€์ž…(๋ชจ๋ฐฉ, ์ž๊ธฐ๋น„ํŒ, ์žฌ์ •๋ ฌ) ๊ตฌ์„ฑ

  1. ๋ฒค์น˜๋งˆํฌ ์„ฑ๋Šฅ ์šฐ์›”์„ฑ: 6๊ฐœ์˜ ์ •๋ ฌ ๋ฒค์น˜๋งˆํฌ์—์„œ ๊ธฐ์กด ๋ฐฉ๋ฒ•์„ ๋Šฅ๊ฐ€ํ•˜๋ฉฐ, ์ ๋Œ€์  ๊ณต๊ฒฉ(jailbreaking)์— ๋Œ€ํ•œ ๊ฒฌ๊ณ ์„ฑ์ด ํ˜„์ €ํžˆ ํ–ฅ์ƒ๋จ
  2. ํ™•์žฅ์„ฑ ๋ฐ ํšจ์œจ์„ฑ ๊ฐœ์„ : ์ถ”๊ฐ€ ๋ณด์ƒ ๋ชจ๋ธ์ด ํ•„์š” ์—†์–ด ์ž์› ์ œ์•ฝ ํ™˜๊ฒฝ์— ์‰ฝ๊ฒŒ ๋ฐฐํฌ ๊ฐ€๋Šฅํ•˜๋ฉฐ, ๊ธฐ์กด SFT ๋Œ€๋น„ ์ธ๊ฐ„ ๋ผ๋ฒจ๋ง ๋น„์šฉ ๊ฐ์†Œ
  3. ๋ชจ๋ธ ๊ทœ๋ชจ์˜ ํ•œ๊ณ„ ๊ทน๋ณต: 175B GPT-3 ๋ชจ๋ธ๋กœ์˜ 20๋ฐฐ ํ™•๋Œ€์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  ์ •๋ ฌ ์„ฑ๋Šฅ ํ–ฅ์ƒ์ด ๋ฏธ๋ฏธํ•˜์—ฌ, ์†Œ๊ทœ๋ชจ ๋ชจ๋ธ๋„ ์ถฉ๋ถ„ํ•œ ์ •๋ ฌ ์„ฑ๋Šฅ ๋‹ฌ์„ฑ ๊ฐ€๋Šฅํ•จ์„ ์‹œ์‚ฌ
  4. ์ƒ์„ฑ ๋ฐ์ดํ„ฐ์˜ ์งˆ: 169k๊ฐœ์˜ ์ƒํ˜ธ์ž‘์šฉ ๋ฐ์ดํ„ฐ์—์„œ ์ˆ˜์ง‘๋œ ๋น„๊ต ์Œ(comparative pairs), ์ง‘๋‹จ ํ‰๊ฐ€(collective ratings), ์ƒ์„ธ ํ”ผ๋“œ๋ฐฑ, ๋ฐ˜๋ณต ์ˆ˜์ • ์‘๋‹ต์„ ํฌํ•จํ•œ ๊ณ ํ’ˆ์งˆ ๋ฐ์ดํ„ฐ ๊ตฌ์„ฑ

How

Figure 2

SANDBOX์˜ Back-Scatter ๋ฉ”์ปค๋‹ˆ์ฆ˜: ์ค‘์•™ ์—์ด์ „ํŠธ๊ฐ€ ์ดˆ๊ธฐ ์‘๋‹ต์„ ์ƒ์„ฑํ•œ ํ›„, ์ฃผ๋ณ€ ์—์ด์ „ํŠธ๋“ค์˜ ํ‰๊ฐ€์™€ ํ”ผ๋“œ๋ฐฑ์„ ๋ฐ›์•„ ๋ฐ˜๋ณต์ ์œผ๋กœ ๊ฐœ์„ 

SANDBOX ์‹œ๋ฎฌ๋ ˆ์ด์…˜:

Back-Scatter ๋ฉ”์ปค๋‹ˆ์ฆ˜:

Stable Alignment 3๋‹จ๊ณ„ ํ•™์Šต:

  1. ๋ชจ๋ฐฉ(Imitation) ๋‹จ๊ณ„: ์ •๋ ฌ๋œ ์‘๋‹ต ๋ฐ๋ชจ ํ•™์Šต์„ ํ†ตํ•œ ๊ธฐ๋ณธ ์ •๋ ฌ ๋Šฅ๋ ฅ ์Šต๋“
  2. ์ž๊ธฐ๋น„ํŒ(Self-Critic) ๋‹จ๊ณ„: ์ƒ์„ธํ•œ ํ”ผ๋“œ๋ฐฑ ํ•™์Šต์„ ํ†ตํ•ด ๋ถ€์ •์  ์‘๋‹ต ํŒ๋ณ„ ๋Šฅ๋ ฅ ๊ฐœ๋ฐœ
  3. ์žฌ์ •๋ ฌ(Realignment) ๋‹จ๊ณ„: ๋ฐ˜๋ณต ์ˆ˜์ •๋œ ์‘๋‹ต ํ•™์Šต์„ ํ†ตํ•œ ์ตœ์ข… ๊ฐœ์„ 

ํŒŒ๋ ˆํ†  ์ตœ์ ์„ฑ ๊ธฐ์ค€: ์ •๋ ฌ(alignment)๊ณผ ์ฐธ์—ฌ๋„(engagement) ํ‰๊ฐ€์˜ ๊ณฑ์ด ๋” ์ด์ƒ ์ฆ๊ฐ€ํ•˜์ง€ ์•Š์„ ๋•Œ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ์ข…๋ฃŒ

Originality

Limitation & Further Study

Evaluation

Novelty: 4.5/5 Technical Soundness: 4/5 Significance: 4.5/5 Clarity: 4/5 Overall: 4.2/5

์ดํ‰: ๋ณธ ๋…ผ๋ฌธ์€ ๊ธฐ์กด์˜ ๊ฐ๋… ํ•™์Šต๊ณผ ๋ณด์ƒ ๋ชจ๋ธ๋ง์˜ ํ•œ๊ณ„๋ฅผ ๊ทน๋ณตํ•˜๊ธฐ ์œ„ํ•ด ์‹œ๋ฎฌ๋ ˆ์ด์…˜๋œ ์‚ฌํšŒ์  ์ƒํ˜ธ์ž‘์šฉ์„ ํ™œ์šฉํ•˜๋Š” ํ˜์‹ ์ ์ด๊ณ  ์‹ค์šฉ์ ์ธ ์ ‘๊ทผ์„ ์ œ์‹œํ•˜๋ฉฐ, ๋ฒค์น˜๋งˆํฌ์™€ ์ ๋Œ€์  ๊ณต๊ฒฉ์— ๋Œ€ํ•œ ๊ฒฌ๊ณ ์„ฑ์—์„œ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ค€๋‹ค. ๋‹ค๋งŒ ์‹œ๋ฎฌ๋ ˆ์ด์…˜-ํ˜„์‹ค ๊ฐ„๊ทน, ๋ช…์‹œ์  ๊ทœ์น™ ์ •์˜, ๋‹ค๋ฌธํ™”์  ์ผ๋ฐ˜ํ™” ์ธก๋ฉด์—์„œ ๊ฐœ์„ ์˜ ์—ฌ์ง€๊ฐ€ ์žˆ๋‹ค.

๊ฐ™์ด ๋ณด๋ฉด ์ข‹์€ ๋…ผ๋ฌธ

๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
From individual to society ๋…ผ๋ฌธ์€ ์‚ฌํšŒ์  ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๊ธฐ๋ฐ˜ ์—์ด์ „ํŠธ ํ›ˆ๋ จ์˜ ๋ฉ”์ปค๋‹ˆ์ฆ˜์„ ์„ค๋ช…ํ•˜๋ฉฐ, ์‚ฌํšŒ์  ์ •๋ ฌ LLM์˜ ๊ทผ๊ฐ„์ด ๋ฉ๋‹ˆ๋‹ค.
๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
LLM์˜ ์‚ฌํšŒ์  ์ •๋ ฌ๊ณผ ํ–‰๋™ ์•ˆ์ „์„ฑ ๊ด€๋ จ ๊ธฐ์ˆ ๊ณผ ์‹ค์ œ ์„ค๊ณ„ ์‹œ ๊ณ ๋ คํ•  ์ด์Šˆ๋ฅผ ๋ฐฉ๋ฒ•๋ก ์ ์œผ๋กœ ๋‹ค๋ฃน๋‹ˆ๋‹ค.
๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
์ธ๊ฐ„-AI ํ˜‘์—…์—์„œ emergent cognition๊ณผ ์‚ฌํšŒ์  ์ƒํ˜ธ์ž‘์šฉ ๊ธฐ๋ฐ˜ AI ์ •๋ ฌ ์ด๋ก  ๋…ผ์˜๊ฐ€ ์ƒํ˜ธ์ž‘์šฉํ•ฉ๋‹ˆ๋‹ค.
๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
838์˜ LLM ์‚ฌํšŒ์  ํ•™์Šต ์„ค๊ณ„๋Š” 041์—์„œ ํƒ๊ตฌํ•˜๋Š” ์—ฐ๊ตฌ ๋ณด์กฐ AI์˜ ์ž ์žฌ์„ฑ๊ณผ ํ•œ๊ณ„์— ๋Œ€ํ•œ ์ •์„ฑ์  ๋…ผ์˜์— ๋ฐ”ํƒ•์„ ๋‘ก๋‹ˆ๋‹ค.
๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
์‚ฌํšŒ์  ๊ณ„ํš ๋ฐ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๋‚ด LLM ํ•™์Šต ์ „๋žต์—์„œ์˜ ์‚ฌํšŒ ๊ทœ๋ฒ”์  ์ •๋ ฌ ์—ฐ๊ตฌ๊ฐ€ ๊ธฐ์ดˆ๊ฐ€ ๋ฉ๋‹ˆ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
๋Œ€๊ทœ๋ชจ ์–ธ์–ด๋ชจ๋ธ์˜ ๋Šฅ๋ ฅ๊ณผ ํ•œ๊ณ„๋ฅผ ๋ถ„์„ํ•˜๋Š” ์œ ์‚ฌํ•œ ์กฐ์‚ฌ ์—ฐ๊ตฌ์ด๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
Cultural evolution in populations of large language models ๋…ผ๋ฌธ์€ ๋‹ค์ค‘ ์—์ด์ „ํŠธ ์ƒํ˜ธ์ž‘์šฉ์„ ํ†ตํ•œ LLM ๋ฌธํ™” ๋ฐ ํ–‰๋™ ์ง„ํ™” ์—ฐ๊ตฌ๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
Training socially aligned language models in simulated human society ๋…ผ๋ฌธ์€ LLM๊ณผ ์—์ด์ „ํŠธ์˜ ์‚ฌํšŒ์  ์‹œ๋ฎฌ๋ ˆ์ด์…˜, ๋‹ค์ธต์  ์ƒํ˜ธ์ž‘์šฉ์„ ์‹คํ—˜์ ์œผ๋กœ ๊ตฌํ˜„ํ•œ ์‚ฌ๋ก€๋ฅผ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
๋‹ค์ค‘ ์—์ด์ „ํŠธ ์ƒํ˜ธ์ž‘์šฉ๋ฟ ์•„๋‹ˆ๋ผ ์ธ๊ฐ„-์‚ฌํšŒ์  ๋งฅ๋ฝ์„ AI ์–ธ์–ด๋ชจ๋ธ์— ์‹œ๋ฎฌ๋ ˆ์ด์…˜ํ•˜๋Š” ์ ‘๊ทผ๋ฒ•์„ ๋‹ค๋ฃน๋‹ˆ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
413์€ ์ธ๊ฐ„-AI ํ˜‘๋™ ํ”„๋ ˆ์ž„์›Œํฌ ์ค‘์‹ฌ์ด๊ณ  838์€ LLM์˜ ์‚ฌํšŒ์  ์ƒํ˜ธ์ž‘์šฉ ๋ฐ ์ •๋ ฌ(fl alignment) ํ•™์Šต์— ์ดˆ์ ์„ ๋‘ก๋‹ˆ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
Training socially aligned language models๋Š” ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๊ธฐ๋ฐ˜ ์‚ฌํšŒ์  ์ •๋ ฌ(alignement)์— ์ดˆ์ ์„ ๋งž์ถฐ BiasFilter์™€ ๋ณด์™„์  ์‹œ๊ฐ์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
LLM ๊ธฐ๋ฐ˜ ๊ณผํ•™ ๋ฐœ๊ฒฌ ์ž๋™ํ™” ๋ถ„์•ผ์—์„œ ์œ ์‚ฌํ•œ ์ฃผ์ œ๋ฅผ ๋‹ค๋ฃจ๋Š” ๋Œ€์•ˆ์  ์—ฐ๊ตฌ์ด๋‹ค.
ํ›„์† ์—ฐ๊ตฌ
Training socially aligned language models ๋…ผ๋ฌธ์€ ์‚ฌํšŒ์  ํ˜‘๋ ฅ ๋ฐ alignment๋ฅผ agent๊ฐ„ ํ•™์Šต์˜ ํ•ต์‹ฌ์œผ๋กœ ๋‹ค๋ฃจ์–ด, ์‚ฌํšŒ์‹ฌ๋ฆฌ ๊ธฐ๋ฐ˜ ํ˜‘๋ ฅ ๋ฉ”์ปค๋‹ˆ์ฆ˜ ์—ฐ๊ตฌ์˜ ๋ฐœ์ „ ๋ฐฉํ–ฅ์„ ๋ณด์—ฌ์ค€๋‹ค.
ํ›„์† ์—ฐ๊ตฌ
838์˜ ์‚ฌํšŒ์  ์ •๋ ฌ LLM ํ•™์Šต ํŒจ๋Ÿฌ๋‹ค์ž„์€ 413์—์„œ ์ œ์•ˆํ•˜๋Š” ์ธ๊ฐ„-AI ํ˜‘๋™ ํ”„๋ ˆ์ž„์›Œํฌ(BCI ์—ฐ๊ตฌ ๋“ฑ)์™€ ์œตํ•ฉ๋˜์–ด AI ๋„๊ตฌ์˜ ์‚ฌํšŒ์  ์‹ ๋ขฐ์„ฑ ํ–ฅ์ƒ ๋ฐฉ์•ˆ์œผ๋กœ ๋ฐœ์ „ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
ํ›„์† ์—ฐ๊ตฌ
์‹ค์ œ ์‹คํ—˜ workflow ๋‚ด์—์„œ ์ธ๊ณผ์  reasoning ๋ฐ ๋ฐ์ดํ„ฐ ๋ถ„์„ ํ‰๊ฐ€๊นŒ์ง€ ํฌํ•จํ•œ ํ™•์žฅ๊ฐ€๋Šฅ์„ฑ์„ ๋ณด์—ฌ์ฃผ๋Š” ๋…ผ๋ฌธ์ด๋‹ค.
← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •