A practical review of mechanistic interpretability for transformer-based language models

์ €์ž: Daking Rai, Yilun Zhou, Shi Feng, Abulhair Saparov, Ziyu Yao | ๋‚ ์งœ: 2024 | URL: https://arxiv.org/abs/2407.02646 📄 PDF


Essence

Figure 3

Figure 3: Beginnerโ€™s roadmap to MI, designed to help newcomers quickly pick up the field. The MI study is

ํŠธ๋žœ์Šคํฌ๋จธ ๊ธฐ๋ฐ˜ ์–ธ์–ด๋ชจ๋ธ์˜ ๋‚ด๋ถ€ ๊ณ„์‚ฐ์„ ์—ญ๊ณตํ•™ํ•˜์—ฌ ์ดํ•ดํ•˜๋Š” ๊ธฐ๊ณ„์  ํ•ด์„๊ฐ€๋Šฅ์„ฑ(Mechanistic Interpretability, MI)์— ๋Œ€ํ•œ ์ข…ํ•ฉ ๋ฆฌ๋ทฐ๋กœ, ์ดˆ๋ณด์ž๋ฅผ ์œ„ํ•œ ์‹ค๋ฌด ๊ฐ€์ด๋“œ๋ฅผ ์ œ์‹œํ•œ๋‹ค.

Motivation

Achievement

How

Figure 4

Figure 4: Logit lens implementation at (1) RS, (2) attention head, and (3) FF sublayer.

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 3/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: ์ด ๋…ผ๋ฌธ์€ ๋น ๋ฅด๊ฒŒ ์„ฑ์žฅํ•˜๋Š” MI ๋ถ„์•ผ์—์„œ ์ดˆ๋ณด์ž๋ถ€ํ„ฐ ๊ฒฝํ—˜์ž๊นŒ์ง€ ๋ชจ๋‘๋ฅผ ์œ„ํ•œ ์‹ค์šฉ์ ์ด๊ณ  ํฌ๊ด„์ ์ธ ๊ฐ€์ด๋“œ๋ฅผ ์ œ๊ณตํ•˜๋ฉฐ, ์ž‘์—… ์ค‘์‹ฌ์˜ ๋ถ„๋ฅ˜์ฒด๊ณ„์™€ ๊ตฌ์ฒด์  ์›Œํฌํ”Œ๋กœ์šฐ๋ฅผ ํ†ตํ•ด ํ•ด์„๊ฐ€๋Šฅ์„ฑ ์—ฐ๊ตฌ์˜ ์ƒˆ๋กœ์šด ํ‘œ์ค€์„ ์ œ์‹œํ•œ๋‹ค. ํ˜„์žฅ ์ ์šฉ์„ ์œ„ํ•œ ์‹ค์ œ ๊ณ ๋ ค์‚ฌํ•ญ๊ณผ ๋ฏธ๋ž˜ ๋ฐฉํ–ฅ์„ ํ•จ๊ป˜ ์ œ์‹œํ•œ ์ ์—์„œ ๋†’์€ ๊ฐ€์น˜๋ฅผ ์ง€๋‹Œ๋‹ค.

๊ฐ™์ด ๋ณด๋ฉด ์ข‹์€ ๋…ผ๋ฌธ

๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
Matthew effect์˜ ์ด๋ก ์ ยท์‹ค์ฆ์  ๊ธฐ๋ฐ˜์„ ์ œ๊ณตํ•˜๋Š” ์„ ํ–‰ ์—ฐ๊ตฌ์ด๋‹ค.
๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
ํŠธ๋žœ์Šคํฌ๋จธ ๊ธฐ๋ฐ˜ ์–ธ์–ด๋ชจ๋ธ์˜ ๊ธฐ๊ณ„์  ํ•ด์„๊ฐ€๋Šฅ์„ฑ์„ ์œ„ํ•œ ์ด๋ก ์  ๊ธฐ์ดˆ๋ฅผ ์ œ๊ณตํ•˜๋Š” ์—ฐ๊ตฌ์ด๋‹ค.
๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
ํฌ๋กœ์Šค ๋„๋ฉ”์ธ ์ •์ฑ… ์ „์ด๋ฅผ ์œ„ํ•œ ๊ธฐ๊ณ„์  ํ•ด์„๊ฐ€๋Šฅ์„ฑ ๊ธฐ๋ฐ˜ ๋ฐฉ๋ฒ•๋ก ์„ ์ œ๊ณตํ•˜๋Š” ๊ธฐ์ดˆ ์—ฐ๊ตฌ์ด๋‹ค.
๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
ํŠธ๋žœ์Šคํฌ๋จธ ๊ณ„์—ด์˜ ๋ฉ”์ปค๋‹ˆ์ฆ˜ ํ•ด์„ ๊ฐ€๋Šฅ์„ฑ์— ๋Œ€ํ•œ ๋ฆฌ๋ทฐ๋กœ, 3232์˜ ๋„คํŠธ์›Œํฌ-๋‡Œ ์‹ ๊ฒฝ ๋น„๊ต ์ ‘๊ทผ์˜ ์ด๋ก ์  ๋ฐฐ๊ฒฝ์ด ๋œ๋‹ค.
๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
017์€ ํŠธ๋žœ์Šคํฌ๋จธ ๋ฉ”์ปค๋‹ˆ์ฆ˜ ํ•ด์„์˜ ์ตœ์‹  ํ๋ฆ„์„ ์ •๋ฆฌํ•˜์—ฌ 3281์˜ ํŒŒ์šด๋ฐ์ด์…˜ ๋ชจ๋ธ ๋‚ด๋ถ€ ํ‘œํ˜„ ํ•ด์„ ๋ฐฉ๋ฒ•๋ก ์— ์ง์ ‘์ ์ธ ์ด๋ก ์  ๊ทผ๊ฐ„์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
017์€ ํŠธ๋žœ์Šคํฌ๋จธ ๊ธฐ๋ฐ˜ ๋ชจ๋ธ ํ•ด์„์„ฑ๊ณผ ๋ฉ”์ปค๋‹ˆ์ฆ˜ ํƒ๊ตฌ์— ๊ด€ํ•œ ์„œ๋ฒ ์ด๋กœ, 3263์˜ ํƒ€์ž… ์‹œ์Šคํ…œยท์ปดํŒŒ์ผ๋Ÿฌ ์กฐํ•ฉ ๋ฐฉ์‹ ์„ค๊ณ„์™€ ๋‚ด๋ถ€ ํ•ด์„ ๊ฐ€๋Šฅ์„ฑ ๋…ผ์˜์— ์ด๋ก ์  ๋ฐฐ๊ฒฝ์ด ๋ฉ๋‹ˆ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
ํŠธ๋žœ์Šคํฌ๋จธ ๋ชจ๋ธ์˜ ๋‚ด๋ถ€ ๋ฉ”์ปค๋‹ˆ์ฆ˜์„ ํ•ด์„ํ•˜๋Š” ์œ ์‚ฌํ•œ ์ ‘๊ทผ๋ฒ•์„ ๋‹ค๋ฃจ๋Š” ์—ฐ๊ตฌ์ด๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
AI safety ์ธก๋ฉด์—์„œ ํ•ด์„๊ฐ€๋Šฅ์„ฑ ๋„๊ตฌ์™€ ๋ฐฉ๋ฒ•๋ก ์„ ์ข…ํ•ฉ ๋ถ„์„ํ•˜์—ฌ, ํŠธ๋žœ์Šคํฌ๋จธ ํ•ด์„ ํ”„๋ ˆ์ž„์›Œํฌ์˜ ๋‹ค์–‘ํ•œ ๊ด€์ ์„ ๋น„๊ตํ•ฉ๋‹ˆ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
์‹ ๊ฒฝ๋ง ๋ชจ๋ธ์˜ ํ•ด์„๊ฐ€๋Šฅ์„ฑ ๋ฐ ์„ค๋ช… ๊ฐ€๋Šฅํ•œ AI๋ฅผ ๋‹ค๋ฃจ๋Š” ๊ด€๋ จ ์—ฐ๊ตฌ์ด๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
๋Œ€๊ทœ๋ชจ ์–ธ์–ด๋ชจ๋ธ์˜ ๋‚ด๋ถ€ ์ž‘๋™ ์›๋ฆฌ๋ฅผ ๋ถ„์„ํ•˜๋Š” ์œ ์‚ฌํ•œ ์—ฐ๊ตฌ์ด๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
LLM ๋‚ด๋ถ€ ํ‘œํ˜„ ๋ถ„์„์„ ๋‹ค์ฐจ์› ์•ˆ์ „ ์ •๋ ฌ ๊ด€์ ์—์„œ ๋‹ค๋ฃจ์–ด, MI ๋ฆฌ๋ทฐ์˜ ์‹ค์ œ ์ ์šฉ๊ณผ ์ฐจ์ด์ ์„ ๋…ผ์˜ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
ํ”„๋ฆฌํŠธ๋ ˆ์ธ ํŠธ๋žœ์Šคํฌ๋จธ ํ•ด์„ ๊ฐ€๋Šฅ์„ฑ์— ๋Œ€ํ•œ ๋ฆฌ๋ทฐ๋กœ, PLM-NLM ๊ตฌ์กฐ ์ฐจ์ด ํ•ด์„์— ๋Œ€ํ•œ ๋ณด์™„์  ์‹œ๊ฐ์„ ์ œ์‹œํ•œ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
ํŠธ๋žœ์Šคํฌ๋จธ ํ•ด์„ ํ”„๋ ˆ์ž„์›Œํฌ์—์„œ bias, ๋ฐ์ดํ„ฐ ํŽธํ–ฅ, model reliability์— ๋Œ€ํ•œ ์‹ฌ์ธต์  ๋ถ„์„์œผ๋กœ, 3248์˜ ํŽฉํƒ€์ด๋“œ ์„ค๊ณ„ ํŽธํ–ฅ ๋ฌธ์ œ ์„ค๋ช…์— ๋Œ€์กฐ์  ํ•ด์„์„ ์ œ๊ณตํ•œ๋‹ค.
ํ›„์† ์—ฐ๊ตฌ
A practical review of mechanistic interpretability for transformers ๋…ผ๋ฌธ์€ ํŠธ๋žœ์Šคํฌ๋จธ ๊ณ„์—ด์—์„œ์˜ ๊ธฐ๊ณ„๋ก ์  ํ•ด์„ ๊ฐ€๋Šฅ์„ฑ ๋ฐฉ๋ฒ•์„ ์ƒ์„ธํžˆ ๋…ผ์˜ํ•˜์—ฌ 527์˜ ์ด๋ก ์  ๋ฆฌ๋ทฐ๋ฅผ ์‹ค๋ฌด์— ์ ์šฉํ•  ์ˆ˜ ์žˆ๋„๋ก ๋ณด์™„ํ•ฉ๋‹ˆ๋‹ค.
ํ›„์† ์—ฐ๊ตฌ
LLM์˜ ๋ถˆํ™•์‹ค์„ฑ ๋ฐ ์‹ ๋ขฐ์„ฑ ํ‰๊ฐ€ ์‹œ ํ•ด์„๊ฐ€๋Šฅ์„ฑ์ด ์–ธ์ œยท์–ด๋–ป๊ฒŒ ํšจ๊ณผ๋ฅผ ๋ฐœํœ˜ํ•˜๋Š”์ง€ ๋ถ„์„ํ•ด ์‹ค๋ฌด ๊ฐ€์ด๋“œ ๋ฒ”์œ„๋ฅผ ํ™•์žฅํ•ฉ๋‹ˆ๋‹ค.
ํ›„์† ์—ฐ๊ตฌ
AI ๋ฉ”๋ชจ๋ฆฌ ๋ฐ ๋‚ด๋ถ€ ๊ตฌ์กฐ์˜ ํ•ด์„ ํŒจ๋Ÿฌ๋‹ค์ž„์„ ์„œ๋ฒ ์ดํ•ด, ํŠธ๋žœ์Šคํฌ๋จธ MI ๋…ผ์˜์™€ ์ตœ๊ทผ์˜ ๋ฉ”๋ชจ๋ฆฌ ๊ธฐ๋ฐ˜ ๋™ํ–ฅ์„ ์—ฐ๊ณ„ํ•˜์—ฌ ํ•ด์„ํ•  ์ˆ˜ ์žˆ์Œ.
ํ›„์† ์—ฐ๊ตฌ
ํŠธ๋žœ์Šคํฌ๋จธ ๋ชจ๋ธ์˜ ๋ฉ”์ปค๋‹ˆ์ฆ˜ ํ•ด์„ ๋ฐฉ๋ฒ•๋ก  ๋ฆฌ๋ทฐ๋กœ, ProtoMech์—์„œ ์ œ์‹œํ•˜๋Š” CLT ์ ‘๊ทผ๊ณผ์˜ ๊ต์ฐจ ๊ฒ€ํ† ๊ฐ€ ๊ฐ€๋Šฅํ•˜๋‹ค.
← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •