The Science Data Lake: A Unified Open Infrastructure Integrating 293 Million Papers Across Eight Scholarly Sources with Embedding-Based Ontology Alignment

์ €์ž: Jonas Wilinski | ๋‚ ์งœ: 2026 | DOI: 10.48550/ARXIV.2603.03126 📄 PDF


Essence

Figure 1

Figure 1: Temporal coverage by source (symlog scale). Publication-year distributions for DOI-

๋ณธ ๋…ผ๋ฌธ์€ DuckDB์™€ Parquet ํŒŒ์ผ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ตฌ์ถ•๋œ Science Data Lake๋ฅผ ์ œ์‹œํ•˜๋ฉฐ, Semantic Scholar, OpenAlex, SciSciNet ๋“ฑ 8๊ฐœ์˜ ๊ฐœ๋ฐฉํ˜• ๋ฐ์ดํ„ฐ ์†Œ์Šค๋กœ๋ถ€ํ„ฐ 2์–ต 9,300๋งŒ ๊ฐœ์˜ ๋…ผ๋ฌธ์„ ํ†ตํ•ฉํ•˜๊ณ , BGE-large ์ž„๋ฒ ๋”ฉ์„ ์‚ฌ์šฉํ•œ ์˜จํ†จ๋กœ์ง€ ์ •๋ ฌ์„ ํ†ตํ•ด OpenAlex์˜ 4,516๊ฐœ ์ฃผ์ œ๋ฅผ 13๊ฐœ์˜ ๊ณผํ•™ ์˜จํ†จ๋กœ์ง€์— ๋งคํ•‘ํ•œ๋‹ค.

Motivation

Achievement

Figure 5

Figure 5: Ontology reach heatmap showing the number of high-quality mappings (similarity โ‰ฅ

ํ•ต์‹ฌ ์„ฑ๊ณผ ๋ชฉ๋ก:

How

Figure 4

Figure 4: UMAP projection of BGE-large embeddings for OpenAlex topics (points) and matched

Originality

Limitation & Further Study

Evaluation

Novelty: 4/5 Technical Soundness: 4/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

์ดํ‰: ๋ณธ ๋…ผ๋ฌธ์€ ํ•™์ˆ  ๋ฐ์ดํ„ฐ ํ†ตํ•ฉ์˜ ์˜ค๋žœ ๋‚œ์ œ๋ฅผ ์‹ค์งˆ์ ์œผ๋กœ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์„ค๊ณ„๋œ ์ž˜ ๊ตฌ์„ฑ๋œ ์ธํ”„๋ผ ์‹œ์Šคํ…œ์œผ๋กœ, ๋‹ค์ค‘ ์†Œ์Šค ์Šคํ‚ค๋งˆ ๋ณด์กด, ๋™์  ์ ์‘์„ฑ, ๊ทœ๋ชจ ์žˆ๋Š” ์˜จํ†จ๋กœ์ง€ ์ •๋ ฌ ๋ฐฉ๋ฒ•์„ ํ†ตํ•ด ๊ณผํ•™ ๋ฉ”ํŠธ๋ฆญ์Šค ์—ฐ๊ตฌ ์ปค๋ฎค๋‹ˆํ‹ฐ์— ์œ ์šฉํ•œ ์ž์›์„ ์ œ๊ณตํ•œ๋‹ค. ๋‹ค๋งŒ ์Šค๋ƒ…์ƒท ๊ธฐ๋ฐ˜ ์„ค๊ณ„์™€ ์ œํ•œ๋œ ๊ฒ€์ฆ ๊ทœ๋ชจ๊ฐ€ ์šด์˜์ƒ ๋ฐ ๋ฐฉ๋ฒ•๋ก ์  ๊ฐœ์„ ์˜ ์—ฌ์ง€๋ฅผ ๋‚จ๊ธด๋‹ค.

๊ฐ™์ด ๋ณด๋ฉด ์ข‹์€ ๋…ผ๋ฌธ

๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
๊ณผํ•™ ๋ฐ์ดํ„ฐ ํ†ตํ•ฉ๊ณผ ๊ฒ€์ƒ‰์„ ์œ„ํ•œ ๊ธฐ๋ฐ˜ ๋ฐ์ดํ„ฐ ์ธํ”„๋ผ๋ฅผ ์ œ๊ณตํ•˜๋Š” ๊ธฐ์ดˆ ์—ฐ๊ตฌ์ด๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
๋Œ€๊ทœ๋ชจ ํ•™์ˆ  ๋ฐ์ดํ„ฐ ์ฝ”ํผ์Šค๋ฅผ ๋‹ค๋ฅธ ๋ฐฉ์‹์œผ๋กœ ๊ตฌ์ถ•ํ•˜๊ณ  ์ œ๊ณตํ•˜๋Š” ๋Œ€์•ˆ์  ์—ฐ๊ตฌ์ด๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
๋Œ€๊ทœ๋ชจ ํ•™์ˆ  ๋ฐ์ดํ„ฐ๋ฅผ ํ†ตํ•ฉํ•˜๊ณ  ์ ‘๊ทผ ๊ฐ€๋Šฅํ•œ ์ธํ”„๋ผ๋ฅผ ๊ตฌ์ถ•ํ•˜๋Š” ์œ ์‚ฌํ•œ ๋ชฉ์ ์˜ ์—ฐ๊ตฌ์ด๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
๊ธฐ์ดˆ ๋ชจ๋ธ์„ ๊ณผํ•™ ๋ฐœ๊ฒฌ์— ํ™œ์šฉํ•˜๋Š” ๋‹ค๋ฅธ ๋ฒค์น˜๋งˆํฌ๋‚˜ ๋ฐฉ๋ฒ•๋ก ์„ ์ œ์‹œํ•œ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
์˜คํ”ˆ ํ•™์ˆ  ๋ฐ์ดํ„ฐ ์†Œ์Šค๋ฅผ ํ†ตํ•ฉํ•˜์—ฌ ๋Œ€๊ทœ๋ชจ ๊ณผํ•™ ๋ฐ์ดํ„ฐ ๋ถ„์„ ํ™˜๊ฒฝ์„ ๊ตฌ์ถ•ํ•˜๋Š” ์œ ์‚ฌํ•œ ์ ‘๊ทผ๋ฒ•์„ ์ทจํ•œ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
๋Œ€๊ทœ๋ชจ ํ•™์ˆ  ๋ฐ์ดํ„ฐ๋ฅผ LLM๊ณผ NLP๋กœ ๋ถ„๋ฅ˜ ๋ฐ ๋ถ„์„ํ•˜๋Š” ์œ ์‚ฌํ•œ ๋ฐ์ดํ„ฐ ๊ธฐ๋ฐ˜ ์ ‘๊ทผ๋ฒ•์„ ์ทจํ•œ๋‹ค.
๋‹ค๋ฅธ ์ ‘๊ทผ
LLM ๊ธฐ๋ฐ˜ ํ•™์ˆ  ์ฝ˜ํ…์ธ  ํ‰๊ฐ€์— ๊ด€ํ•œ ์œ ์‚ฌํ•œ ๋ฐฉ๋ฒ•๋ก ์  ์ ‘๊ทผ์˜ ์—ฐ๊ตฌ์ด๋‹ค.
← ๋ชฉ๋ก์œผ๋กœ ๋Œ์•„๊ฐ€๊ธฐ

๐ŸŽง Audio Overview

์ด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํŒŸ์บ์ŠคํŠธํ˜• ์˜ค๋””์˜ค๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (Gemini ยท ํ‚ค๋Š” ๋ธŒ๋ผ์šฐ์ €์—๋งŒ ์ €์žฅ ยท ์™„์„ฑ๋ณธ์€ ์ด๋ฉ”์ผ๋กœ๋„ ์ „์†ก)
โ–ธ ๊ณ ๊ธ‰: ๊ตฌ์„ฑ ๋ฐฉํ–ฅ(๋Œ€๋ณธ ์ž‘์„ฑ ์ง€์นจ) ์ง์ ‘ ์ˆ˜์ •