Openai o1 system card

์ €์ž: OpenAI (Aaron Jaech, Adam Tauman Kalai, Adam Lerer ๋“ฑ) | ๋‚ ์งœ: 2024 | DOI: - 📄 PDF


Essence

Figure 1

Figure 1: GPT-4o, o1, o1-preview, o1-mini์˜ jailbreak ํ‰๊ฐ€ ์„ฑ๋Šฅ ๋น„๊ต

OpenAI o1 ๋ชจ๋ธ์€ ๋Œ€๊ทœ๋ชจ ๊ฐ•ํ™”ํ•™์Šต(reinforcement learning)์œผ๋กœ ํ›ˆ๋ จ๋œ chain-of-thought ์ถ”๋ก  ๋Šฅ๋ ฅ์„ ๊ฐ–์ถ˜ ๋ชจ๋ธ๋กœ, ๊ธฐ์กด GPT-4o ๋Œ€๋น„ ์•ˆ์ „์„ฑ๊ณผ ๊ฐ•๊ฑด์„ฑ์ด ํฌ๊ฒŒ ํ–ฅ์ƒ๋˜์—ˆ์œผ๋ฉฐ ํŠนํžˆ jailbreak ๊ณต๊ฒฉ์— ๋Œ€ํ•œ ์ €ํ•ญ์„ฑ์ด ํš๊ธฐ์ ์œผ๋กœ ๊ฐœ์„ ๋˜์—ˆ๋‹ค.

Motivation

Achievement

  1. Jailbreak ์ €ํ•ญ์„ฑ ํš๊ธฐ์  ๊ฐœ์„ : StrongReject ๋ฒค์น˜๋งˆํฌ์—์„œ GPT-4o ๋Œ€๋น„ o1์ด ์ƒ๋‹นํžˆ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ ๋‹ฌ์„ฑ (Figure 1 ์ฐธ์กฐ). Production jailbreaks, ์ธ๊ฐ„ ๊ธฐ๋ฐ˜ jailbreaks ๋“ฑ ๋ชจ๋“  jailbreak ํ‰๊ฐ€์—์„œ o1 ๋ชจ๋ธ ๊ณ„์—ด์ด GPT-4o๋ฅผ ๋Šฅ๊ฐ€.
  2. ์œ ํ•ด ์ฝ˜ํ…์ธ  ๊ฑฐ๋ถ€ ๊ฐ•ํ™”: Challenging Refusal Evaluation์—์„œ o1์ด 0.92-0.934์˜ not_unsafe ์ ์ˆ˜๋กœ GPT-4o์˜ 0.713 ๋Œ€๋น„ 29-31% ํ–ฅ์ƒ. WildChat์—์„œ๋„ 0.98 ๋‹ฌ์„ฑ์œผ๋กœ 0.945 ์ƒํšŒ.
  3. ๊ณผ๋„ ๊ฑฐ๋ถ€(overrefusal) ๊ฐœ์„ : ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ž…๋ ฅ์—์„œ o1์˜ not_overrefuse ์ ์ˆ˜๊ฐ€ 0.96์œผ๋กœ GPT-4o์˜ 0.48์—์„œ ๋‘ ๋ฐฐ ํ–ฅ์ƒ. ์–‘์„ฑ ์š”์ฒญ์— ๋Œ€ํ•œ ๊ฑฐ๋ถ€์œจ ๊ฐ์†Œ.
  4. ํ™˜๊ฐ(hallucination) ๊ฐ์†Œ: SimpleQA์—์„œ o1์˜ ํ™˜๊ฐ์œจ 0.44(GPT-4o 0.61), PersonQA์—์„œ 0.20(GPT-4o 0.30)์œผ๋กœ 30-35% ๊ฐ์†Œ. ์ •ํ™•๋„๋„ ๋™์‹œ์— ํ–ฅ์ƒ(SimpleQA accuracy: 0.47 vs 0.38).
  5. ํŽธํ–ฅ์„ฑ ๊ฐœ์„ : BBQ ํ‰๊ฐ€์—์„œ ๋ช…ํ™•ํ•œ ๋‹ต๋ณ€์˜ ๊ฒฝ์šฐ o1์ด 93-94% ์ •ํ™•๋„๋กœ GPT-4o์˜ 72% ๋Œ€๋น„ 22% ํ–ฅ์ƒ. ๋ชจํ˜ธํ•œ ์งˆ๋ฌธ์—์„œ๋„ o1-preview ๋Œ€๋น„ o1์ด ๊ฐœ์„ ๋œ ์„ฑ๋Šฅ ํ‘œ์‹œ (63% โ†’ 96%).

How

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 4/5 Significance: 5/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: ๋ณธ ๋ณด๊ณ ์„œ๋Š” ๋Œ€๊ทœ๋ชจ ์–ธ์–ด๋ชจ๋ธ์˜ ์•ˆ์ „์„ฑ ํ‰๊ฐ€์— ์žˆ์–ด chain-of-thought ์ถ”๋ก  ๋Šฅ๋ ฅ์ด defensive alignment์˜ ์ƒˆ๋กœ์šด ์ฐจ์›์„ ์ œ์‹œํ•จ์„ ์‹ค์ฆ์ ์œผ๋กœ ์ž…์ฆํ–ˆ์œผ๋ฉฐ, ๋‹ค์ธต์ ์ด๊ณ  ์ฒด๊ณ„์ ์ธ ํ‰๊ฐ€ ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์‹œํ•œ ์ ์—์„œ ํ•™๊ณ„์™€ ์‚ฐ์—… ๋ชจ๋‘์— ์ค‘์š”ํ•œ ๊ธฐ์—ฌ๋ฅผ ํ•œ๋‹ค. ๋‹ค๋งŒ chain-of-thought ์ž์ฒด๊ฐ€ ์•ผ๊ธฐํ•  ์ˆ˜ ์žˆ๋Š” deception ์œ„ํ—˜๊ณผ ๋„๋ฉ”์ธ ํŠนํ™” ํ‰๊ฐ€์˜ ๋ถ€์กฑ์€ ํ–ฅํ›„ ์—ฐ๊ตฌ์˜ ์ค‘์š”ํ•œ ๊ณผ์ œ๋กœ ๋‚จ์•„์žˆ๋‹ค.

๊ฐ™์ด ๋ณด๋ฉด ์ข‹์€ ๋…ผ๋ฌธ

๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
GPT-4 ๊ธฐ์ˆ  ๋ณด๊ณ ์„œ๊ฐ€ OpenAI ์ฐจ์„ธ๋Œ€ LLM(o1) ์„ฑ๋Šฅ๊ณผ ์•ˆ์ „์„ฑ ๊ฐœ์„ ์˜ ๊ธฐ์ˆ ์ ยท์ฒ ํ•™์  ๊ธฐ์ดˆ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
GPT-4o System Card๋Š” OpenAI o1 ๋ชจ๋ธ์˜ ์ฃผ์š” ์ „์‹  ๋ชจ๋ธ๋กœ ์•ˆ์ „์„ฑ ๊ฐ•ํ™”์˜ ๊ธฐ์ดˆ๊ฐ€ ๋˜๋Š” ์›๋ฆฌ์™€ ๋ฐœ์ „๊ฒฝ๋กœ๋ฅผ ๋ณด์—ฌ์ค€๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
๋”ฅ๋Ÿฌ๋‹์˜ ๋ถˆํ™•์‹ค์„ฑ ์ •๋Ÿ‰ํ™”์— ๋‹ค๋ฅธ ๋ถ„๋ฅ˜ ์ฒด๊ณ„๋‚˜ ๋ฐฉ๋ฒ•๋ก ์„ ์ ์šฉํ•œ ์—ฐ๊ตฌ์ด๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
585(Openai o1 system card)๋Š” o1-preview ๋ชจ๋ธ์˜ ๊ธฐ์ˆ ์  ์„ธ๋ถ€ ์„ฑ๊ณผ์™€ ์œ„ํ—˜์„ฑ์„ ๋‹ค๋ฃจ๋ฉฐ, 322์™€ ๋น„๊ต ํ‰๊ฐ€์— ์ ํ•ฉํ•˜๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
630๋ฒˆ์€ LLM์„ ํ™œ์šฉํ•ด AI ์—ฐ๊ตฌ ๊ฒฐ๊ณผ๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋ฐ ์ดˆ์ ์„ ๋‘์–ด, 585๋ฒˆ์˜ ์ตœ์‹  ์‹œ์Šคํ…œ์นด๋“œ์™€ ์ƒ๋Œ€์ ์œผ๋กœ AI ์„ฑ๋Šฅยท์•ˆ์ „์„ฑ ํ‰๊ฐ€์— ๋‹ค๋ฅธ ๊ด€์ ์„ ์ œ์‹œํ•œ๋‹ค.
ํ›„์† ์—ฐ๊ตฌ
TrustLLM์—์„œ๋Š” LLM์˜ ์‹ ๋ขฐ์„ฑ์„ ํ‰๊ฐ€ํ•˜๊ณ  ๊ฐ•ํ™”ํ•˜๋Š” ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ๋…ผ์˜ํ•˜๋ฉฐ, OpenAI o1์˜ ์•ˆ์ „์„ฑ ํ‰๊ฐ€ ๋งฅ๋ฝ์—์„œ ๋งŽ์€ ์‹œ์‚ฌ์ ์„ ์ค€๋‹ค.
ํ›„์† ์—ฐ๊ตฌ
Openai o1 system card ๋…ผ๋ฌธ์€ GPT-4o์™€ ์œ ์‚ฌํ•œ ์‹œ์Šคํ…œ์˜ ์•ˆ์ „์„ฑยท์œ„ํ—˜ ํ‰๊ฐ€ ์ฒด๊ณ„๋ฅผ ์ƒ์„ธํžˆ ์„ค๋ช…ํ•˜์—ฌ, ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋ชจ๋ธ์˜ ์•ˆ์ „์„ฑ ํ‰๊ฐ€ ํŠธ๋ Œ๋“œ๋ฅผ ํŒŒ์•…ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
์‘์šฉ ์‚ฌ๋ก€
592๋ฒˆ ๋…ผ๋ฌธ์€ Peer review ๋ถ„์•ผ์—์„œ specialized LLM(์˜ˆ: GPT-4o ๋“ฑ๊ณผ ๋น„๊ต)์„ ํ™œ์šฉํ•ด, 585๋ฒˆ์ด ๊ฐ•์กฐํ•œ ์•ˆ์ „์„ฑยท๊ฐ•๊ฑด์„ฑ ๋ฌธ์ œ์— ๋Œ€ํ•œ ์‹ค์ œ ์ ์šฉ ์‚ฌ๋ก€๋ฅผ ์ œ๊ณตํ•œ๋‹ค.
์‘์šฉ ์‚ฌ๋ก€
OpenAI o1 ๋ชจ๋ธ์˜ AGI ์—์ด์ „ํŠธ ์•ˆ์ „์„ฑ ๋ฐ ํ‰๊ฐ€ ๊ฒฐ๊ณผ๋ฅผ ๋ถ„์„ํ•จ์œผ๋กœ์จ, ๋ฒค์น˜๋งˆํฌ ์ƒ ์•ˆ์ „์„ฑ๊ณผ ๊ฐ•๊ฑด์„ฑ ํ–ฅ์ƒ ํšจ๊ณผ๋ฅผ ์‹ค์ฆํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •