Can large language models provide useful feedback on research papers? A large-scale empirical analysis

์ €์ž: Weixin Liang, Yuhui Zhang, Hancheng Cao, Binglu Wang, Daisy Ding | ๋‚ ์งœ: 2023.10 | DOI: 10.48550/arXiv.2310.01783 📄 PDF


Essence

Figure 2

Figure 2. Retrospective analysis of LLM and human scientific feedback. a, Retrospective overlap analysis

๋ณธ ๋…ผ๋ฌธ์€ GPT-4๋ฅผ ์ด์šฉํ•˜์—ฌ ๊ณผํ•™ ๋…ผ๋ฌธ์— ๋Œ€ํ•œ ํ”ผ๋“œ๋ฐฑ์„ ์ž๋™์œผ๋กœ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๋Š”์ง€ ์ฒด๊ณ„์ ์œผ๋กœ ๋ถ„์„ํ•˜๋Š” ์—ฐ๊ตฌ์ด๋‹ค. Nature ์ €๋„ ๋ฐ ICLR ํ•™ํšŒ์˜ 3,096๊ฐœ ๋ฐ 1,709๊ฐœ ๋…ผ๋ฌธ์„ ๋Œ€์ƒ์œผ๋กœ LLM๊ณผ ์ธ๊ฐ„ ๋ฆฌ๋ทฐ์–ด์˜ ํ”ผ๋“œ๋ฐฑ ๊ฒน์นจ์„ ๋น„๊ตํ–ˆ์œผ๋ฉฐ, 308๋ช…์˜ ์—ฐ๊ตฌ์ž ๋Œ€์ƒ ์‚ฌ์šฉ์ž ์กฐ์‚ฌ๋ฅผ ํ†ตํ•ด LLM ํ”ผ๋“œ๋ฐฑ์˜ ์œ ์šฉ์„ฑ์„ ํ‰๊ฐ€ํ–ˆ๋‹ค.

Motivation

Achievement

Figure 1

Figure 1. Characterizing the capability of LLM in providing helpful feedback to researchers. a, Pipeline for

LLM-์ธ๊ฐ„ ํ”ผ๋“œ๋ฐฑ ๊ฒน์นจ: Nature ์ €๋„ ํ‰๊ท  30.85%, ICLR ํ‰๊ท  39.23%๋กœ ์ธ๊ฐ„ ๋ฆฌ๋ทฐ์–ด ๊ฐ„ ๊ฒน์นจ(Nature 28.58%, ICLR 35.25%)๊ณผ ์œ ์‚ฌํ•จ. ์‚ฌ์šฉ์ž ์ธ์‹: 57.4%์˜ ์—ฐ๊ตฌ์ž๊ฐ€ GPT-4 ํ”ผ๋“œ๋ฐฑ์„ ๋„์›€/๋งค์šฐ ๋„์›€์ด ๋œ๋‹ค๊ณ  ํ‰๊ฐ€ํ–ˆ์œผ๋ฉฐ, 82.4%๋Š” ์ผ๋ถ€ ์ธ๊ฐ„ ๋ฆฌ๋ทฐ์–ด๋ณด๋‹ค ๋” ์œ ์šฉํ•˜๋‹ค๊ณ  ํŒ๋‹จ. ์•ฝํ•œ ๋…ผ๋ฌธ์— ๋Œ€ํ•œ ์„ฑ๋Šฅ: ๊ฑฐ์ ˆ๋œ ICLR ๋…ผ๋ฌธ์—์„œ ๊ฒน์นจ์ด 43.80%๋กœ ๋†’์•„ LLM์ด lower-quality ๋…ผ๋ฌธ ์‹๋ณ„์— ๋” ํšจ๊ณผ์ .

How

Figure 3

Figure 3. LLM based feedback emphasizes certain aspects more than humans. LLM comments on the

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 4/5 Significance: 5/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: ๋ณธ ๋…ผ๋ฌธ์€ LLM์ด ๊ณผํ•™ ํ”ผ๋“œ๋ฐฑ ์ƒ์„ฑ์—์„œ ์‹ค์งˆ์ ์ธ ๊ฐ€์น˜๋ฅผ ์ œ๊ณตํ•  ์ˆ˜ ์žˆ์Œ์„ ๋Œ€๊ทœ๋ชจ ์‹ค์ฆ ๋ฐ์ดํ„ฐ๋กœ ์ฒ˜์Œ ๋ณด์—ฌ์ค€ ์ค‘์š”ํ•œ ๊ธฐ์—ฌ์ด๋‹ค. ์ธ๊ฐ„ ๋ฆฌ๋ทฐ์–ด์™€์˜ ๋น„๊ต ๋ถ„์„์ด ์ฒด๊ณ„์ ์ด๊ณ , ์‚ฌ์šฉ์ž ์กฐ์‚ฌ๊ฐ€ ํ˜„์‹ค์  ์œ ์šฉ์„ฑ์„ ๊ฐ•ํ™”ํ•˜๋‚˜, LLM์˜ ๋ฐฉ๋ฒ•๋ก ์  ์•ฝ์ ๊ณผ ์ฃผ์ œ ํŽธํ–ฅ์— ๋Œ€ํ•œ ํ•ด๊ฒฐ์ฑ…์ด ์ œ์‹œ๋˜์ง€ ์•Š์•„ ์‹ค๋ฌด ์ ์šฉ์—๋Š” ์ œ์•ฝ์ด ์žˆ๋‹ค.

๊ฐ™์ด ๋ณด๋ฉด ์ข‹์€ ๋…ผ๋ฌธ

๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
712์˜ SciCode ๋ฒค์น˜๋งˆํฌ๋Š” 184์˜ ๋…ผ๋ฌธ๊ณผ ๊ฐ™์ด LLM์ด ์‹ค์ œ ์—ฐ๊ตฌ ์ง€์›(ํ”ผ๋“œ๋ฐฑ, ์ฝ”๋”ฉ ๋“ฑ) ์—ญํ• ์„ ํ‰๊ฐ€ํ•˜๋Š” ๊ทผ๊ฑฐ ์ž๋ฃŒ๊ฐ€ ๋œ๋‹ค.
๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
RAG ๊ธฐ๋ฐ˜ ํ…์ŠคํŠธ ์ƒ์„ฑ์˜ ์ด๋ก ์  ๊ธฐ๋ฐ˜์„ ์ œ๊ณตํ•˜๋Š” ์„ ํ–‰ ์—ฐ๊ตฌ์ด๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
184๋Š” LLM์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ์— ์–ผ๋งˆ๋‚˜ ์œ ์šฉํ•œ ํ”ผ๋“œ๋ฐฑ์„ ์ œ๊ณตํ•˜๋Š”์ง€์— ๋Œ€ํ•œ ๋˜๋‹ค๋ฅธ ํ‰๊ฐ€ ๋…ผ๋ฌธ์œผ๋กœ, 1087๊ณผ ์ƒํ˜ธ๋ณด์™„์ ์œผ๋กœ ์ฝ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
LLM์„ ํ™œ์šฉํ•œ ํ•™์ˆ  ๋…ผ๋ฌธ ํ‰๊ฐ€ ๋ฐ ํ”ผ๋“œ๋ฐฑ ์ƒ์„ฑ ๋Šฅ๋ ฅ์„ ์—ฐ๊ตฌํ•˜๋Š” ์œ ์‚ฌํ•œ ์ ‘๊ทผ ๋ฐฉ์‹์„ ์ทจํ•œ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
LLM์˜ ๊ณผํ•™์  ํ…์ŠคํŠธ ํ‰๊ฐ€ ๋Šฅ๋ ฅ์„ ๋ถ„์„ํ•˜๋Š” ๊ด€๋ จ ์—ฐ๊ตฌ์ด๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
LLM์ด ํ•™์ˆ  ํ”ผ์–ด ๋ฆฌ๋ทฐ์—์„œ ์œ ์šฉํ•œ ํ”ผ๋“œ๋ฐฑ์„ ์ค„ ์ˆ˜ ์žˆ๋Š”์ง€ ์‹ค์ฆ์ ์œผ๋กœ ๊ฒ€์ฆํ•˜์—ฌ, 104๋ฒˆ ๋…ผ๋ฌธ์˜ ๋ณด์•ˆ ์œ„ํ—˜ ๋…ผ์˜์™€ ์ƒ๋ฐ˜๋˜๋Š” ์‹œ๊ฐ์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
LLM์ด ๋…ผ๋ฌธ ๋ฐ ์—ฐ๊ตฌ ํ‰๊ฐ€ ๊ณผ์ •์—์„œ ์ธ๊ฐ„ ์‹ฌ์‚ฌ์ž์— ๋น„ํ•ด ์งˆ์  ํ”ผ๋“œ๋ฐฑ์„ ์–ด๋–ป๊ฒŒ ์ œ๊ณตํ•˜๋Š”์ง€๋ฅผ ๋น„๊ต ๋ถ„์„ํ•˜์—ฌ, ์ธ๊ฐ„/AI ๋น„๊ต์˜ ๋‹ค๋ณ€ํ™”๋œ ์‹œ๊ฐ์„ ์ค€๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
AI ๊ธฐ๋ฐ˜ ํ•™์ˆ  ๋ฆฌ๋ทฐ ์‹œ์Šคํ…œ์˜ ํšจ์šฉ์„ฑ์„ ํ‰๊ฐ€ํ•˜๋Š” ์œ ์‚ฌํ•œ ์—ฐ๊ตฌ์ด๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
GPT ๋ชจ๋ธ์„ ํ™œ์šฉํ•œ ํ…์ŠคํŠธ ํ‰๊ฐ€ ์ž‘์—…์˜ ์„ฑ๋Šฅ์„ ๋น„๊ตํ•˜๋Š” ์œ ์‚ฌํ•œ ์—ฐ๊ตฌ์ด๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
AI๊ฐ€ ์ƒ์„ฑํ•œ ๋ฆฌ๋ทฐ๊ฐ€ ์‹ค์ œ ํ‰๊ฐ€์— ์–ผ๋งˆ๋‚˜ ์ ํ•ฉํ•œ์ง€, ๋‹ค์–‘ํ•œ ํ‰๊ฐ€ ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ํ†ตํ•ด ๊ฒ€์ฆํ•œ๋‹ค.
ํ›„์† ์—ฐ๊ตฌ
184 ๋…ผ๋ฌธ์€ LLM์ด ๋…ผ๋ฌธ ํ”ผ๋“œ๋ฐฑ ๋ฐ ๋ฆฌ๋ทฐ์— ์‹ค์งˆ์  ๋„์›€์„ ์ฃผ๋Š”์ง€ ๋‹ค๊ฐ๋„๋กœ ๊ฒ€์ฆํ•ด, 227์—์„œ ์ œ์•ˆํ•œ ์ž๋™ํ™” ํ”ผ๋“œ๋ฐฑ ์‹œ์Šคํ…œ์˜ ์‹คํšจ์„ฑ์„ ํ‰๊ฐ€ํ•œ๋‹ค.
ํ›„์† ์—ฐ๊ตฌ
Peer Review as A Multi-Turn Dialogue ๋…ผ๋ฌธ์€ LLM ๊ธฐ๋ฐ˜ ๋ฆฌ๋ทฐ๋ฅผ ๋‹ค์ค‘ํ„ด ๋Œ€ํ™” ๊ด€์ ์œผ๋กœ ๋ถ„์„ํ•˜์—ฌ ์‹ค์ œ ์ ์šฉ์„ฑ ๋…ผ์˜๋ฅผ ํ™•์žฅํ•ฉ๋‹ˆ๋‹ค.
ํ›„์† ์—ฐ๊ตฌ
LLM์„ ํ™œ์šฉํ•œ ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ ํ‰๊ฐ€์˜ ํ™•์žฅ์„ฑ๊ณผ ํ™œ์šฉ์— ๋Œ€ํ•˜์—ฌ ๊ทœ๋ชจ ๋ฐ ์‹ค์ œ์„ฑ ์ฐจ์›์˜ ๋ถ„์„์„ ๋”ํ•œ๋‹ค.
ํ›„์† ์—ฐ๊ตฌ
LLM์„ ํ™œ์šฉํ•œ ์—ฐ๊ตฌ ๋…ผ๋ฌธ ํ”ผ๋“œ๋ฐฑ ์ƒ์„ฑ ์‹œ์Šคํ…œ์„ ํ™•์žฅํ•˜๋Š” ๊ด€๋ จ ์—ฐ๊ตฌ์ด๋‹ค.
ํ›„์† ์—ฐ๊ตฌ
Can large language models provide useful feedback on research ๋…ผ๋ฌธ์€ ์‹ค์ œ LLM์˜ ๋ฆฌ๋ทฐ ๋น„ํŒ ๋ฐ ํ”ผ๋“œ๋ฐฑ ๋Šฅ๋ ฅ์— ๊ด€ํ•œ ํ‰๊ฐ€๋กœ, AAAR-1.0 ๋ฒค์น˜๋งˆํฌ์˜ ์‹ค์ œ ํ‰๊ฐ€ ํ•ญ๋ชฉ์„ ํ™•์žฅํ•œ๋‹ค.
์‘์šฉ ์‚ฌ๋ก€
CoAuthor ๋…ผ๋ฌธ์€ ์‹ค์ œ ๋…ผ๋ฌธ ์ง‘ํ•„ ์‹œ LLM์˜ ํ˜‘๋ ฅ์  ํ”ผ๋“œ๋ฐฑ๊ณผ ์ง‘ํ•„ ์ง€์› ์—ญ๋Ÿ‰์„ ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ๋กœ ๋ถ„์„ํ•ด, ํ”ผ์–ด ๋ฆฌ๋ทฐ ๋‹จ๊ณ„๋ฟ ์•„๋‹ˆ๋ผ ์ž‘์„ฑ ๊ณผ์ •์ƒ์˜ LLM ํ”ผ๋“œ๋ฐฑ ์‹œ์‚ฌ์ ์„ ์ œ์‹œํ•œ๋‹ค.
๋ฐ˜๋ก /๋น„ํŒ
104๋ฒˆ ๋…ผ๋ฌธ์€ LLM์ด ํ”ผ์–ด ๋ฆฌ๋ทฐ์—์„œ ๋ณด์ผ ์ˆ˜ ์žˆ๋Š” ์œ„ํ—˜๊ณผ ์ทจ์•ฝ์„ฑ์„ ๋‹ค๋ฃจ๋Š” ๋ฐ˜๋Œ€ ๊ด€์ ์ž…๋‹ˆ๋‹ค.
๋ฐ˜๋ก /๋น„ํŒ
184๋ฒˆ ๋…ผ๋ฌธ์€ LLM์ด ๊ณผํ•™ ๋ฌธํ—Œ์— ์ค„ ์ˆ˜ ์žˆ๋Š” ํ”ผ๋“œ๋ฐฑ์˜ ํ•œ๊ณ„์™€ ํ™œ์šฉ๊ฒฐ๊ณผ๋ฅผ ํ‰๊ฐ€ํ•˜์—ฌ, 530๋ฒˆ ๋…ผ๋ฌธ์—์„œ ์ œ์‹œํ•œ QA ์„ฑ๋Šฅ ๊ฐœ์„  ์ฃผ์žฅ์— ๋Œ€ํ•œ ๋น„ํŒ์  ์‹œ๊ฐ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •