BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

์ €์ž: Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova | ๋‚ ์งœ: 2018 | DOI: 10.48550/ARXIV.1810.04805 📄 PDF


Essence

Figure 1

Figure 1: Overall pre-training and ๏ฌne-tuning procedures for BERT. Apart from output layers, the same architec-

BERT๋Š” masked language model (MLM)๊ณผ next sentence prediction (NSP) ๋ชฉํ‘œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์–‘๋ฐฉํ–ฅ Transformer ๊ธฐ๋ฐ˜ ๊นŠ์€ ํ‘œํ˜„์„ ์‚ฌ์ „ํ•™์Šตํ•˜๋Š” ํ˜์‹ ์ ์ธ ์–ธ์–ด ํ‘œํ˜„ ๋ชจ๋ธ์„ ์ œ์•ˆํ•œ๋‹ค. ๊ธฐ์กด์˜ ๋‹จ๋ฐฉํ–ฅ ์–ธ์–ด ๋ชจ๋ธ๊ณผ ๋‹ฌ๋ฆฌ ์–‘์ชฝ ๋ฌธ๋งฅ์„ ๋ชจ๋‘ ์กฐ๊ฑด์œผ๋กœ ํ™œ์šฉํ•˜์—ฌ ๋ฏธ์„ธ์กฐ์ •๋งŒ์œผ๋กœ 11๊ฐœ์˜ NLP ์ž‘์—…์—์„œ ์ตœ๊ณ  ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•œ๋‹ค.

Motivation

Achievement

Figure 1

Figure 1: Overall pre-training and ๏ฌne-tuning procedures for BERT. Apart from output layers, the same architec-

GLUE ๋ฒค์น˜๋งˆํฌ: 80.5% (7.7% ์ ˆ๋Œ€ ๊ฐœ์„ ), MultiNLI ์ •ํ™•๋„: 86.7% (4.6% ์ ˆ๋Œ€ ๊ฐœ์„ ), SQuAD v1.1 F1: 93.2 (1.5 ์ ˆ๋Œ€ ๊ฐœ์„ ), SQuAD v2.0 F1: 83.1 (5.1 ์ ˆ๋Œ€ ๊ฐœ์„ ), ์ด 11๊ฐœ NLP ์ž‘์—…์—์„œ ์ตœ๊ณ  ์„ฑ๋Šฅ ๋‹ฌ์„ฑ, ์ž‘์€ ์ž‘์—…๋ณ„ ์•„ํ‚คํ…์ฒ˜ ์ˆ˜์ •์œผ๋กœ ๋‹ค์–‘ํ•œ ์ž‘์—… ์ง€์›.

How

Figure 1

Figure 1: Overall pre-training and ๏ฌne-tuning procedures for BERT. Apart from output layers, the same architec-

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 5/5 Significance: 5/5 Clarity: 5/5 Overall: 5/5

์ดํ‰: BERT๋Š” ์–‘๋ฐฉํ–ฅ ์‚ฌ์ „ํ•™์Šต์„ ํ†ตํ•ด ์–ธ์–ด ํ‘œํ˜„ ๋ชจ๋ธ๋ง์˜ ํ˜์‹ ์  ์ „ํ™˜์ ์„ ๋งˆ๋ จํ–ˆ์œผ๋ฉฐ, ํ†ตํ•ฉ๋œ ์•„ํ‚คํ…์ฒ˜๋กœ ๊ด‘๋ฒ”์œ„ํ•œ NLP ์ž‘์—…์—์„œ ์ตœ๊ณ  ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•œ ํš๊ธฐ์ ์ธ ์—ฐ๊ตฌ์ด๋‹ค. ๊ธฐ์ˆ ์  ์™„์„ฑ๋„, ์‹คํ—˜ ๊ฒ€์ฆ, ๊ทธ๋ฆฌ๊ณ  ์‹ค์ œ ์˜ํ–ฅ๋ ฅ์—์„œ ๋งค์šฐ ์šฐ์ˆ˜ํ•˜๋ฉฐ ํ˜„๋Œ€ NLP์˜ ๊ธฐ์ดˆ๋ฅผ ์ •๋ฆฝํ•œ ํ•ต์‹ฌ ๋…ผ๋ฌธ์ด๋‹ค.

๊ฐ™์ด ๋ณด๋ฉด ์ข‹์€ ๋…ผ๋ฌธ

๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
BERT์˜ ์–‘๋ฐฉํ–ฅ ์‚ฌ์ „ํ•™์Šต ๊ตฌ์กฐ๋Š” XLM-R ๋“ฑ์˜ ๋‹ค์–ธ์–ด ๋ชจ๋ธ ๊ฐœ๋ฐœ์˜ ํ•ต์‹ฌ ๋ฐฉ๋ฒ•๋ก ์  ๊ธฐ์ดˆ๊ฐ€ ๋œ๋‹ค.
๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
BioBERT๋Š” BERT ์‚ฌ์ „ํ•™์Šต ํ”„๋ ˆ์ž„์›Œํฌ์— ๋„๋ฉ”์ธ ํŠน์ด์  ์ฝ”ํผ์Šค(PubMed ๋“ฑ)๋ฅผ ์ ‘๋ชฉํ•œ ๋…ผ๋ฌธ์ด๊ธฐ ๋•Œ๋ฌธ์—, ์ด๋ก ์ ยท๋ฐฉ๋ฒ•๋ก ์  ๊ธฐ๋ฐ˜์ด ๋™์ผํ•ฉ๋‹ˆ๋‹ค.
๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
BERT์™€ ๊ฐ™์€ ์‚ฌ์ „ํ•™์Šต ์–ธ์–ด๋ชจ๋ธ์˜ ๋„๋ฉ”์ธ ๋ฐ ๊ณผ์—…๋ณ„ ์ถ”๊ฐ€ ์ ์‘์˜ ์‹ค์งˆ์  ํšจ๊ณผ๋ฅผ ์‹คํ—˜์ ์œผ๋กœ ๊ฒ€์ฆํ–ˆ๋‹ค.
๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
Deepseek-v3 ์—ญ์‹œ BERT์˜ ์‚ฌ์ „ํ•™์Šต, ์ž๊ธฐ์ฃผ์˜, ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ ์ „๋žต ๋“ฑ์˜ ๋ฐœ์ „ํ˜•์ด๋ฉฐ, ๋ฐฉ๋ฒ•๋ก ์  ์—ฐ์†์„ฑ์ด ํฌ๋‹ค.
๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
Transformer ์•„ํ‚คํ…์ฒ˜ ๊ฐœ์„  ๋ฐ ์–ธ์–ด๋ชจ๋ธ ํ›ˆ๋ จ์˜ ์ด๋ก ์  ๊ธฐ๋ฐ˜์„ ์ œ๊ณตํ•œ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
SciBERT๋Š” BERT๋ฅผ ๊ณผํ•™ ๋…ผ๋ฌธ ๋ฐ์ดํ„ฐ์— ์‚ฌ์ „ํ•™์Šตํ•œ ๋˜๋‹ค๋ฅธ domain adaptation ์‚ฌ๋ก€๋กœ, ๋‹ค์–‘ํ•œ ๊ณผํ•™ ์–ธ์–ด๋ชจ๋ธ ๋น„๊ต์— ์œ ์ตํ•ฉ๋‹ˆ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
์‚ฌ์ „ํ•™์Šต ์–ธ์–ด ๋ชจ๋ธ์˜ NLP ํƒœ์Šคํฌ ์„ฑ๋Šฅ์„ ๋‹ค๋ฃจ๋Š” ์œ ์‚ฌํ•œ ์—ฐ๊ตฌ์ด๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
ํŠธ๋žœ์Šคํฌ๋จธ ๊ธฐ๋ฐ˜ ์–ธ์–ด ๋ชจ๋ธ์˜ ๋‹ค์–‘ํ•œ ์‘์šฉ์„ ๋‹ค๋ฃจ๋Š” ๊ด€๋ จ ์—ฐ๊ตฌ์ด๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
์–‘๋ฐฉํ–ฅ ์–ธ์–ด ํ‘œํ˜„ ํ•™์Šต ๋˜๋Š” ๋ฏธ์„ธ์กฐ์ • ๊ธฐ๋ฐ˜ NLP์˜ ๊ด€๋ จ ์—ฐ๊ตฌ์ด๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
์‚ฌ์ „ํ•™์Šต ์–ธ์–ด ๋ชจ๋ธ์˜ ๋‹ค์šด์ŠคํŠธ๋ฆผ ํƒœ์Šคํฌ ์ ์šฉ์˜ ๊ด€๋ จ ์—ฐ๊ตฌ์ด๋‹ค.
ํ›„์† ์—ฐ๊ตฌ
BioBERT๋Š” BERT์˜ ๊ตฌ์กฐ๋ฅผ ์ƒ์˜ํ•™ ๋„๋ฉ”์ธ์— ๋งž์ถฐ ์‚ฌ์ „ํ•™์Šต ํ™•์žฅํ•œ ์—ฐ๊ตฌ๋กœ, ๋„๋ฉ”์ธ ํŠนํ™” NLP ๋ชจ๋ธ ๋ถ„์•ผ์—์„œ ์ง๊ฒฐ๋ฉ๋‹ˆ๋‹ค.
ํ›„์† ์—ฐ๊ตฌ
Don't Stop Pretraining ๋…ผ๋ฌธ์€ BERT์™€ ๊ฐ™์€ ๋ชจ๋ธ์„ ์ƒˆ๋กœ์šด ๋„๋ฉ”์ธ์— ์ ์‘์‹œํ‚ค๋Š” ์‚ฌ์ „ํ•™์Šต ์ „๋žต์„ ์ฒด๊ณ„์ ์œผ๋กœ ํƒ๊ตฌํ•œ๋‹ค.
ํ›„์† ์—ฐ๊ตฌ
BERT๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ™•์žฅํ•˜๊ฑฐ๋‚˜ ๊ฐœ์„ ํ•œ ์–ธ์–ด ๋ชจ๋ธ์˜ ๊ด€๋ จ ์—ฐ๊ตฌ์ด๋‹ค.
← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •