SafeScientist: Toward Risk-Aware Scientific Discoveries by LLM Agents

์ €์ž: Kunlun Zhu, Jiaxun Zhang, Ziheng Qi, Ning Shang, Zijia Liu, Pengfei Han, Yue Su, Haofei Yu, Jiaxuan You | ๋‚ ์งœ: 2025 | DOI: arXiv:2505.23559 📄 PDF


Essence

Figure 1

SafeScientist๋Š” ์•…์˜์ ์ด๊ฑฐ๋‚˜ ์œ„ํ—˜ํ•œ ํ”„๋กฌํ”„ํŠธ์— ๋Œ€ํ•ด ๊ฑฐ์ ˆ ์‘๋‹ต์„ ์ œ์‹œํ•˜๋ฉฐ, ์ผ๋ฐ˜ AI ๊ณผํ•™์ž ํ”„๋ ˆ์ž„์›Œํฌ์™€ ๋‹ฌ๋ฆฌ ์œ„ํ—˜ ์ธ์‹(Risk-Awareness)์„ ํ†ตํ•ด ์•ˆ์ „ํ•˜๊ฒŒ ๊ณ ์œ„ํ—˜ ์ฃผ์ œ๋ฅผ ๋‹ค๋ฃฌ๋‹ค.

๋ณธ ๋…ผ๋ฌธ์€ LLM ๊ธฐ๋ฐ˜ AI ๊ณผํ•™์ž ์—์ด์ „ํŠธ์˜ ์ž๋™ํ™”๋œ ๊ณผํ•™ ๋ฐœ๊ฒฌ ๊ณผ์ •์—์„œ ๋ฐœ์ƒํ•˜๋Š” ์œค๋ฆฌ์ , ์•ˆ์ „ ๋ฌธ์ œ๋ฅผ ์ฒด๊ณ„์ ์œผ๋กœ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด SafeScientist ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ์ด๋Š” ๋‹ค์ธต ๋ฐฉ์–ด ๋ฉ”์ปค๋‹ˆ์ฆ˜(prompt monitoring, agent collaboration monitoring, tool-use monitoring, ethical reviewer)์„ ํ†ตํ•ฉํ•˜์—ฌ ๊ณผํ•™ ์—ฐ๊ตฌ ํŒŒ์ดํ”„๋ผ์ธ ์ „๋ฐ˜์— ๊ฑธ์ณ ์•ˆ์ „์„ฑ์„ ๋ณด์žฅํ•œ๋‹ค.

Motivation

Achievement

Figure 2

SafeScientist์˜ ์—”๋“œ-ํˆฌ-์—”๋“œ ํŒŒ์ดํ”„๋ผ์ธ: ์ž…๋ ฅ ๊ฐ์ง€(Prompt Monitor), ๋‹ค์ค‘ ์—์ด์ „ํŠธ ํ† ๋ก (Discussion Stage), ๋„๊ตฌ ์‚ฌ์šฉ(Tool Use Stage), ๋…ผ๋ฌธ ์ž‘์„ฑ(Writing Stage)์„ ๊ฑฐ์ณ SciSafetyBench ๊ธฐ๋ฐ˜ ๊ณต๊ฒฉ/๋ฐฉ์–ด ํ‰๊ฐ€๋ฅผ ํ†ตํ•ฉ.

  1. SafeScientist ํ”„๋ ˆ์ž„์›Œํฌ: ๊ธฐ์กด AI Scientist/Tiny Scientist ๋“ฑ์˜ ๊ฒฝ๋Ÿ‰ ํ”„๋ ˆ์ž„์›Œํฌ์— 4๊ฐ€์ง€ ๋ฐฉ์–ด ๋ฉ”์ปค๋‹ˆ์ฆ˜(Prompt Monitor, Agent Collaboration Monitor, Tool-Use Monitor, Paper Ethic Reviewer)์„ ํ†ตํ•ฉํ•˜์—ฌ ๊ณผํ•™ ์—ฐ๊ตฌ ํŒŒ์ดํ”„๋ผ์ธ ์ „๋ฐ˜์˜ ์•ˆ์ „ ๊ฐ๋…์„ ์‹คํ˜„. ๊ธฐ์กด AI ๊ณผํ•™์ž ํ”„๋ ˆ์ž„์›Œํฌ ๋Œ€๋น„ ์•ˆ์ „ ์„ฑ๋Šฅ 34.69% ํ–ฅ์ƒ.
  2. SciSafetyBench ๋ฒค์น˜๋งˆํฌ: 6๊ฐœ ๊ณผํ•™ ๋„๋ฉ”์ธ(๋ฌผ๋ฆฌํ•™, ํ™”ํ•™, ์ƒ๋ฌผํ•™, ์žฌ๋ฃŒ๊ณผํ•™, ์ปดํ“จํ„ฐ๊ณผํ•™, ์˜ํ•™)์— ๊ฑธ์นœ 240๊ฐœ ๊ณ ์œ„ํ—˜ ๊ณผํ•™ ๋ฐœ๊ฒฌ ๊ณผ์ œ์™€ 30๊ฐœ ๊ณผํ•™ ๋„๊ตฌ + 120๊ฐœ ๋„๊ตฌ๋ณ„ ์œ„ํ—˜ ์‹œ๋‚˜๋ฆฌ์˜ค๋กœ ๊ตฌ์„ฑ. ๋‹ค์–‘ํ•œ ๋Œ€์  ๊ณต๊ฒฉ(Base64, DAN, Inception ๋“ฑ)์— ๋Œ€ํ•œ ๊ฐ•๊ฑด์„ฑ ๊ฒ€์ฆ ์™„๋ฃŒ.

How

Figure 2

๋ฐฉ์–ด ๋ฉ”์ปค๋‹ˆ์ฆ˜ (Defense Methods):

์—ฐ๊ตฌ ํŒŒ์ดํ”„๋ผ์ธ:

  1. ์‚ฌ์šฉ์ž ๋ช…๋ น ์ž…๋ ฅ โ†’ Prompt Monitor์—์„œ ์•ˆ์ „์„ฑ ๊ฒ€์‚ฌ
  2. ๋„๋ฉ”์ธ/๊ณผ์ œ ์œ ํ˜• ๋ถ„์„ โ†’ ์ „๋ฌธ ์—์ด์ „ํŠธ ๊ทธ๋ฃน ๋™์  ํ™œ์„ฑํ™”
  3. ๋‹ค์ค‘ ์—์ด์ „ํŠธ ํ˜‘๋ ฅ ํ† ๋ก  (Agent Collaboration Monitor ๊ฐ์‹œ)
  4. ๊ณผํ•™ ๋„๊ตฌ/๊ฒ€์ƒ‰ ๋ชจ๋“ˆ ํ˜ธ์ถœ (Tool-Use Monitor๋กœ ๊ฒฐ๊ณผ ๊ฒ€์ฆ)
  5. ์“ฐ๊ธฐ/์ •์ œ ๋ชจ๋“ˆ โ†’ Paper Ethic Reviewer๋กœ ์ตœ์ข… ๊ฒ€์ฆ

Originality

Limitation & Further Study

Evaluation

์ดํ‰: SafeScientist๋Š” LLM ๊ธฐ๋ฐ˜ AI ๊ณผํ•™์ž์˜ ์œค๋ฆฌ์ , ์•ˆ์ „ํ•œ ๋ฐฐํฌ๋ฅผ ์œ„ํ•œ ์‹œ์˜์ ์ ˆํ•˜๊ณ  ํฌ๊ด„์ ์ธ ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์‹œํ•˜๋ฉฐ, SciSafetyBench๋Š” ๊ณผํ•™ ๋งฅ๋ฝ์˜ ๊ณ ์œ ํ•œ ์œ„ํ—˜์„ ์ฒด๊ณ„์ ์œผ๋กœ ํ‰๊ฐ€ํ•  ์ˆ˜ ์žˆ๋Š” ๊ท€์ค‘ํ•œ ์ž์‚ฐ์ด๋‹ค. ๋‹ค๋งŒ, ์‹ค์ œ ๊ณผํ•™ ํ™˜๊ฒฝ์—์„œ์˜ ๊ฑฐ์ง“ ์–‘์„ฑ ๋น„์œจ ๊ฐ์†Œ์™€ ๋”์šฑ ์ •๊ตํ•œ ๋Œ€์  ๊ณต๊ฒฉ์— ๋Œ€ํ•œ ๋ฐฉ์–ด ๊ฐ•ํ™”๋Š” ํ–ฅํ›„ ๊ณผ์ œ์ด๋‹ค.

๊ฐ™์ด ๋ณด๋ฉด ์ข‹์€ ๋…ผ๋ฌธ

๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
LLM ์‹ ๋ขฐ์„ฑ ํ‰๊ฐ€ ๋ฐฉ๋ฒ•๋ก ์ด SafeScientist์˜ ์œ„ํ—˜์ธ์ง€ ๋ฐ ๋‹ค์ธต์  AI ๊ณผํ•™์ž ํ”„๋ ˆ์ž„์›Œํฌ ํ‰๊ฐ€์— ๊ทผ๊ฐ„์ด ๋ฉ๋‹ˆ๋‹ค.
๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
Guided by guardrails ๋…ผ๋ฌธ์€ SafeScientist๊ฐ€ ์ œ์•ˆํ•˜๋Š” ์•ˆ์ „ ๋ฐ ์œค๋ฆฌ์  ๋ฐฉ์–ด ๋ฉ”์ปค๋‹ˆ์ฆ˜์˜ ์ด๋ก ์  ๊ธฐ๋ฐ˜์ด ๋˜๋Š” ์ปจํŠธ๋กค ๋ฐฉ๋ฒ•์„ ์ƒ์„ธํ•˜๊ฒŒ ๋…ผ์˜ํ•ฉ๋‹ˆ๋‹ค.
๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
Multi-agent risks from advanced AI ๋…ผ๋ฌธ์€ ๋ณต์ˆ˜ ์—์ด์ „ํŠธ๊ฐ€ ์žฅ๊ธฐ์ ์œผ๋กœ ์•ผ๊ธฐํ•˜๋Š” ์œ„ํ—˜๊ณผ ์•ˆ์ „ ๋ฌธ์ œ๋ฅผ ๋‹ค๋ฃจ๋ฉฐ, SafeScientist์˜ ๋‹ค์ธต ๋ฐฉ์–ด ๊ตฌ์กฐ ์„ค๊ณ„ ๋…ผ์˜์— ์ด๋ก ์  ๋ฐ”ํƒ•์ด ๋œ๋‹ค.
๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
AI ์—์ด์ „ํŠธ์˜ ์‹ ๋ขฐ์„ฑ๊ณผ ์•ˆ์ „์„ฑ์— ๊ด€ํ•œ ์ด๋ก ์  ๋…ผ์˜๊ฐ€ SafeScientist์˜ ์œ„ํ—˜๊ด€๋ฆฌ ์ค‘์‹ฌ ํ”„๋ ˆ์ž„์›Œํฌ ๋ฏธ์ ์šฉ ํ•œ๊ณ„๋ฅผ ์ดํ•ดํ•˜๋Š” ๋ฐ ๊ธฐ์ดˆ๊ฐ€ ๋œ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
Piflow ๋…ผ๋ฌธ์€ ๊ณผํ•™์  ๋ฐœ๊ฒฌ ๊ณผ์ •์—์„œ ์ •๋ณด์ด๋ก ์  ์›๋ฆฌ์™€ ์›์ฒœ์  ์ตœ์ ํ™”์— ์ดˆ์ ์„ ๋‘๋Š” ๋ฐ˜๋ฉด, SafeScientist๋Š” ์œ„ํ—˜ ์ธ์‹๊ณผ ์œค๋ฆฌ์  ํ†ต์ œ๋ฅผ ๊ฐ•์กฐํ•œ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
SafeScientist ๋…ผ๋ฌธ์€ LLM ๊ธฐ๋ฐ˜ ๊ณผํ•™ ์‹คํ—˜์˜ ์œ„ํ—˜ ์ธ์‹ยท์™„ํ™” ํ”„๋กœํ† ์ฝœ์„ ๋‹ค๋ค„, ๊ฐ•ํ™”ํ•™์Šต ์•ˆ์ „์„ฑ๊ณผ ๋น„๊ต ๊ฐ€๋Šฅํ•œ ๋Œ€์•ˆ์  ์ ‘๊ทผ๋ฒ•์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.
ํ›„์† ์—ฐ๊ตฌ
LLM ๋ฐ AI ์—์ด์ „ํŠธ ๊ธฐ๋ฐ˜ ๊ณผํ•™๋ฐœ๊ฒฌ์ด ๊ฐ€์ ธ์˜ฌ ๋ณ€ํ™”์™€ ๋„์ „, ๊ทธ๋ฆฌ๊ณ  ํ†ต์ œ๋ฐฉ์•ˆ ๋…ผ์˜๋กœ ์œค๋ฆฌยท์•ˆ์ „ ๊ธฐ๋ฐ˜ AI ๊ณผํ•™์ž ์‹œ์Šคํ…œ์— ๋น„ํŒ์  ํ†ต์ฐฐ์„ ๋ณดํƒญ๋‹ˆ๋‹ค.
ํ›„์† ์—ฐ๊ตฌ
LLM์˜ ์‹ ๋ขฐ์„ฑ ํ‰๊ฐ€์™€ ๋„๊ตฌ์‚ฌ์šฉ/์œค๋ฆฌ/์•ˆ์ „ ํ‰๊ฐ€๋ฅผ ๋‹ค๋ฃจ๋ฉฐ, 692๋Š” ํŠนํžˆ ๊ณผํ•™์  ๋ฐœ๊ฒฌ ๊ณผ์ •์—์„œ์˜ ์œ„ํ—˜์ธ์ง€์™€ ๋‹ค์ธต ์•ˆ์ „์ฒด๊ณ„๋กœ 846์˜ ํ‰๊ฐ€๋ฒ”์œ„๋ฅผ ํ™•์žฅํ•ฉ๋‹ˆ๋‹ค.
ํ›„์† ์—ฐ๊ตฌ
Toward Reliable Scientific Hypothesis Generation ๋…ผ๋ฌธ์€ ๊ณผํ•™์  ๋ฐœ๊ฒฌ ๊ณผ์ •์—์„œ ์‹ ๋ขฐ์„ฑ ํ™•๋ณด ๋ฐฉ์•ˆ์— ์ดˆ์ ์„ ๋‘์–ด, SafeScientist์˜ ๋ฆฌ์Šคํฌ ์ธ์‹ ์•ˆ์ „ ๋ฉ”์ปค๋‹ˆ์ฆ˜๊ณผ ์ง์ ‘์ ์œผ๋กœ ์—ฐ๊ฒฐ๋œ๋‹ค.
← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •