Automatically evaluating the paper reviewing capability of large language models

์ €์ž: Mourad Ouzzani, Hossam M. Hammady, Zbys Fedorowicz, Ahmed K. Elmagarmid | ๋‚ ์งœ: 2025 | URL: https://arxiv.org/abs/2502.17086 📄 PDF


Essence

Figure 1

Figure 1: We introduce a focus-level evaluation frame-

LLM์ด ์ƒ์„ฑํ•œ ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๊ฐ€ ์ธ๊ฐ„ ์ „๋ฌธ๊ฐ€ ๋ฆฌ๋ทฐ์–ด์™€ ๋™์ผํ•œ ์ค‘์š” ์ธก๋ฉด์— ์ง‘์ค‘ํ•˜๋Š”์ง€ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•ด focus-level ํ‰๊ฐ€ ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์•ˆํ•˜๊ณ , LLM๋“ค์ด ๊ธฐ์ˆ ์  ํƒ€๋‹น์„ฑ์—๋Š” ๊ณผ๋„ํ•˜๊ฒŒ ์ง‘์ค‘ํ•˜๋ฉด์„œ ์ƒˆ๋กœ์›€(novelty) ํ‰๊ฐ€๋ฅผ ๊ฐ„๊ณผํ•œ๋‹ค๋Š” ๊ฒƒ์„ ๋ฐœ๊ฒฌํ–ˆ๋‹ค.

Motivation

Achievement

Figure 4

Figure 4: A visualization of focus distributions by target/aspect and strength/weakness, in a descending order of

How

Figure 2

Figure 2: The overall process of automated focus-level evaluation. We first extracted strengths and weaknesses

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 3/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: ์ด ๋…ผ๋ฌธ์€ LLM ๋ฆฌ๋ทฐ ํ‰๊ฐ€์— ์ƒˆ๋กœ์šด focus-level ๊ด€์ ์„ ๋„์ž…ํ•˜์—ฌ ๊ธฐ์กด ํ‰๊ฐ€์˜ ๋งน์ ์„ ๋ณด์™„ํ•˜๊ณ , ์ž๋™ํ™”๋œ ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ํ†ตํ•ด ๋Œ€๊ทœ๋ชจ ๋ถ„์„์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ–ˆ๋‹ค. ํŠนํžˆ LLM๋“ค์˜ ์ผ๊ด€๋œ novelty ๊ฐ„๊ณผ ํŒจํ„ด ๋ฐœ๊ฒฌ์€ ํ•™์ˆ  ๋ฆฌ๋ทฐ ํ’ˆ์งˆ ๋ฌธ์ œ๋ฅผ ๊ตฌ์ฒด์ ์œผ๋กœ ๋“œ๋Ÿฌ๋‚ด๋ฉฐ, ๊ณต๊ฐœ ๋ฐ์ดํ„ฐ์…‹์€ ํ›„์† ์—ฐ๊ตฌ์— ์ค‘์š”ํ•œ ๊ธฐ์—ฌ๋ฅผ ํ•  ๊ฒƒ์œผ๋กœ ๊ธฐ๋Œ€๋œ๋‹ค.

๊ฐ™์ด ๋ณด๋ฉด ์ข‹์€ ๋…ผ๋ฌธ

๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
ํ”ผ์–ด ๋ฆฌ๋ทฐ ์˜๊ฒฌ ์ข…ํ•ฉ ๋ฐ ๋ฉ”ํƒ€๋ฆฌ๋ทฐ ์ƒ์„ฑ์˜ ๋ฐฉ๋ฒ•๋ก ์  ๊ธฐ๋ฐ˜์„ ์ œ๊ณตํ•˜๋Š” ์„ ํ–‰ ์—ฐ๊ตฌ์ด๋‹ค.
๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
128๋ฒˆ ๋…ผ๋ฌธ์€ LLM์˜ ๋ฆฌ๋ทฐ ์ž‘์„ฑ ๋Šฅ๋ ฅ ์ž๋™ ํ‰๊ฐ€ ํ”„๋ ˆ์ž„์›Œํฌ๋กœ, OpenReviewer ์‹œ์Šคํ…œ ํ‰๊ฐ€ ๋ฐ ๊ฐœ๋ฐœ์— ๊ด€๋ จ ์ด๋ก ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
537์€ LLM ๋ฆฌ๋ทฐ์˜ ์ดˆ์ -์ˆ˜์ค€(focus-level) ํ‰๊ฐ€์™€ ๋ธ”๋ผ์ธ๋“œ ์ŠคํŒŸ ๋ฌธ์ œ์— ๊ด€ํ•œ ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์‹œํ•˜๋ฉฐ, 128์˜ ๋ฆฌ๋ทฐ ํŽธํ–ฅ ๋ถ„์„์˜ ์ด๋ก ์  ํ† ๋Œ€๋‹ค.
๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
์ค‘์š” ํ‰๊ฐ€ ์ธก๋ฉด(Aspect-focused Review Analysis)์˜ ๋ฒค์น˜๋งˆํฌ ํ”„๋ ˆ์ž„์›Œํฌ ๋ฐ ํ‰๊ฐ€ ๊ธฐ์ค€์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•ฉ๋‹ˆ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
๋…ผ๋ฌธ ๋ฆฌ๋ทฐ ์ž๋™ํ™” ๋ฐ ํ‰๊ฐ€์— ๊ด€ํ•œ ์œ ์‚ฌํ•œ ์—ฐ๊ตฌ๋กœ ์ƒํ˜ธ ๋ณด์™„์  ๊ด€์ ์„ ์ œ๊ณตํ•œ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
AI์˜ ๋…ผ๋ฌธ ์‹ฌ์‚ฌ ํ‰๊ฐ€๋Šฅ๋ ฅ ์ธก์ •์„ ๋‹ค๋ฅธ ํ‰๊ฐ€ ๊ธฐ์ค€ ๋ฐ ๋ฐ์ดํ„ฐ์…‹์—์„œ ๊ตฌํ˜„ํ•œ ์‚ฌ๋ก€๋กœ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
128์€ LLM์˜ ๋ฆฌ๋ทฐ ํ‰๊ฐ€ ๋Šฅ๋ ฅ์„ ์ž๋™ ํ‰๊ฐ€ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•˜์—ฌ, Peer review ๋ณด์กฐ ์—ญํ• ๋กœ์„œ 678๊ณผ ๋น„๊ตํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
LLM ๊ธฐ๋ฐ˜ ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ ์ž๋™ํ™”์˜ ํ’ˆ์งˆ ํ‰๊ฐ€๋ฅผ ๋‹ค๋ฅธ ๋ฐฉ์‹์œผ๋กœ ์ ‘๊ทผํ•˜๋Š” ๊ด€๋ จ ์—ฐ๊ตฌ์ด๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
LLM์ด ์‹ค์ œ๋กœ ๋ฆฌ๋ทฐ์–ด ์—ญํ• ์„ ์ž˜ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋Š”์ง€ ์ž๋™ํ™” ํ‰๊ฐ€ ๋ฐฉ๋ฒ• ๋ฐ ์‹คํ—˜์  ํ•œ๊ณ„๋ฅผ ํ•จ๊ป˜ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
AI ๊ธฐ๋ฐ˜ ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ ์‹œ์Šคํ…œ์˜ ํŽธํ–ฅ ๋˜๋Š” ํ’ˆ์งˆ์„ ํ‰๊ฐ€ํ•˜๋Š” ์œ ์‚ฌํ•œ ๋ชฉํ‘œ์˜ ์—ฐ๊ตฌ์ด๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
LLM ๊ธฐ๋ฐ˜ ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ ํ‰๊ฐ€ ์ž๋™ํ™” ์‚ฌ๋ก€๋กœ, AI ์ƒ์„ฑ ํ…์ŠคํŠธ ํ‰๊ฐ€ยทํƒ์ง€ ๊ธฐ์ˆ ์˜ ํ•™๋ฌธ์  ํ‰๊ฐ€ ๋ฐฉํ–ฅ์„ ๋ณด์—ฌ์ค€๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
LLM์ด ์ƒ์„ฑํ•œ ๋ฆฌ๋ทฐ์˜ ์งˆ์„ ํ‰๊ฐ€ํ•˜๋Š” ์œ ์‚ฌํ•œ ๋ฌธ์ œ๋ฅผ ๋‹ค๋ฅธ ๊ด€์ ์—์„œ ๋‹ค๋ฃฌ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
LLM์˜ ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ ๋Šฅ๋ ฅ ํ‰๊ฐ€์— ๊ด€ํ•œ ์œ ์‚ฌํ•œ ์ ‘๊ทผ ๋ฐฉ์‹์„ ์ œ์‹œํ•œ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
ReviewEval๋„ LLM ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ ํ‰๊ฐ€ ๋Šฅ๋ ฅ์„ ์ •๋Ÿ‰์ ์œผ๋กœ ์ธก์ •ํ•˜์—ฌ ๋ณธ ๋…ผ๋ฌธ๊ณผ ๋ณด์™„์  ๋น„๊ต๊ฐ€ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
Automatically evaluating the paper reviewing capability of llms๋Š” LLM ๋ฆฌ๋ทฐ ๋Šฅ๋ ฅ ํ‰๊ฐ€์—์„œ ๋‹ค๋ฅธ ํ‰๊ฐ€ ์ง€ํ‘œ์™€ ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์‹œํ•ด, ์ธก๋ฉด๋ณ„ ํ‰๊ฐ€ ๋ฐฉ๋ฒ•๋ก ์„ ๋น„๊ตํ•  ์ˆ˜ ์žˆ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
128๋ฒˆ ๋…ผ๋ฌธ์€ LLM ๊ธฐ๋ฐ˜ ์ž๋™ ๋ฆฌ๋ทฐ ํ‰๊ฐ€์˜ ๋‹ค์–‘ํ•œ ์ง€ํ‘œ ๋ฐ ๋ฐฉ๋ฒ•๋ก ์„ ๋น„๊ต ๋ถ„์„ํ•˜์—ฌ, 481๋ฒˆ์˜ '๊ฒŒ์œผ๋ฅธ ๋ฆฌ๋ทฐ' ํƒ์ง€์™€ ์ƒํ˜ธ ๋ณด์™„์ ์ž…๋‹ˆ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
128๋ฒˆ ๋…ผ๋ฌธ์€ LLM์˜ ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ ์ž‘์„ฑ ๋Šฅ๋ ฅ ํ‰๊ฐ€๋ผ๋Š” ๋น„์Šทํ•œ ๋ฌธ์ œ๋ฅผ ๋‹ค๋ฃจ์ง€๋งŒ, ํ‰๊ฐ€ ์ง€ํ‘œ ๋ฐ ์‹คํ—˜ ๊ตฌ์„ฑ์— ์ฐจ๋ณ„์ ์ด ์žˆ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
128์€ LLM์˜ ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ ๋Šฅ๋ ฅ์„ ์ž๋™์ ์œผ๋กœ ํ‰๊ฐ€ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•˜๋ฏ€๋กœ, AI ๊ธฐ๋ฐ˜ ๋™๋ฃŒํ‰๊ฐ€ ์ž๋™ํ™”(809)์˜ ํšจ๊ณผ์™€ ํ•œ๊ณ„๋ฅผ ๋น„๊ตํ•˜๋ฉฐ ์ฝ๊ธฐ์— ์ ํ•ฉํ•ฉ๋‹ˆ๋‹ค.
ํ›„์† ์—ฐ๊ตฌ
LLM์„ ํ™œ์šฉํ•œ ๋…ผ๋ฌธ ๊ด€๋ จ์—ฐ๊ตฌ(related work) ์ž๋™ํ‰๊ฐ€์™€ ์š”์•ฝ ์„ฑ๋Šฅ ๋น„๊ต๋ฅผ ํ†ตํ•ด, ๊ณ ์ „ joint attention๊ณผ ์ตœ์‹  ์ ‘๊ทผ๋ฒ•์˜ ์ง„ํ™”๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.
ํ›„์† ์—ฐ๊ตฌ
128์€ 126์—์„œ ์ œ์•ˆํ•œ LLM ๋ฆฌ๋ทฐ ์ƒ์„ฑ์˜ focus-level ํ‰๊ฐ€ ๋ฐ ์ธ๊ฐ„ ์ „๋ฌธ์„ฑ ๋น„๊ต๋ฅผ ๋” ๊ตฌ์ฒด์ ์œผ๋กœ ๋ถ„์„ํ•˜์—ฌ, ์„ฑ๋Šฅ์˜ ํ•œ๊ณ„์™€ ๊ฐœ์„ ์ ์„ ๋„์ถœํ•ฉ๋‹ˆ๋‹ค.
ํ›„์† ์—ฐ๊ตฌ
Automatically evaluating the paper reviewing capability of llms ๋…ผ๋ฌธ์€ LLM ์–ธ์–ด๊ธฐ๋ฐ˜ ์—ญ๋ฒˆ์—ญ์„ ํฌํ•จํ•œ ๋‹ค์–‘ํ•œ AI ์ž์—ฐ์–ธ์–ด ์ฒ˜๋ฆฌ ์„ฑ๋Šฅ ๋น„๊ต๋ฅผ ๋‹ค๋ฃจ์–ด 690์˜ ์‹คํ—˜์  ํ†ต์ฐฐ์„ ํ™•์žฅํ•ฉ๋‹ˆ๋‹ค.
ํ›„์† ์—ฐ๊ตฌ
128๋ฒˆ ๋…ผ๋ฌธ์€ LLM ๊ธฐ๋ฐ˜ ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ ์ž๋™ ํ‰๊ฐ€ ๋„๊ตฌ๋ฅผ ์ œ์‹œํ•ด, 904๋ฒˆ ๋…ผ๋ฌธ์˜ AI ๊ฒ€์ƒ‰์—”์ง„์ด ์—ฐ๊ตฌ ๊ฒ€์ฆยทํ‰๊ฐ€๊นŒ์ง€ ํ™•์žฅ๋  ๋•Œ์˜ ๊ฐ€๋Šฅ์„ฑ์„ ๋ณด์—ฌ์ค€๋‹ค.
ํ›„์† ์—ฐ๊ตฌ
LLM ๊ธฐ๋ฐ˜ ๋ฆฌ๋ทฐ ๋ถ„์„์˜ ํ‰๊ฐ€ ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ๋”์šฑ ํ™•์žฅํ•˜์—ฌ ๋‹ค์–‘ํ•œ ์—ฐ๊ตฌ ์—์ด์ „ํŠธ๋ฅผ ์ข…ํ•ฉ์ ์œผ๋กœ ๋ฒค์น˜๋งˆํ‚นํ•ฉ๋‹ˆ๋‹ค.
๋ฐ˜๋ก /๋น„ํŒ
183์€ LLM์˜ ๊ณผํ•™ ๋…ผ๋ฌธ ์˜ค์ •๋ณด ๊ฐ์ง€ ์—ญํ• ์„ ํ‰๊ฐ€ํ•˜๋ฉฐ, LLM์˜ ๋ฆฌ๋ทฐ ํ•œ๊ณ„์™€ ์—ญํ• ์„ ๋…ผ์˜ํ•˜๋Š” 128๊ณผ ๋น„ํŒ์  ๊ด€์ ์—์„œ ์—ฐ๊ฒฐํ•  ์ˆ˜ ์žˆ๋‹ค.
๋ฐ˜๋ก /๋น„ํŒ
LLM ๋ฆฌ๋ทฐ ์ƒ์„ฑ ๋Šฅ๋ ฅ์˜ ํ•œ๊ณ„๋ฅผ ์ž๋™ ํ‰๊ฐ€ ๊ด€์ ์—์„œ ๋ถ„์„ํ•˜์—ฌ Pre์˜ peer review ๊ธฐ๋ฐ˜ ํ‰๊ฐ€ ๋ฐฉ์‹์˜ ํ•œ๊ณ„์™€ ๋ณด์™„์ ์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.
← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •