MIRAI: Prediction and Generation of High-Impact Academic Research

Essence

Figure 2: Impact prediction model architecture. The title and abstract are encoded by a frozen text

MIRAI는 논문의 제목, 초록, 출판 날짜만을 사용하여 deep learning framework로 5년 후 논문 영향력을 예측하는 프레임워크이다. arXiv 학술 그래프에서 PageRank와 citation counts를 예측하며, 2021년 출판 논문에 대해 PageRank 예측에서 Spearman's ρ 0.4686, citation 예측에서 0.6192를 달성한다.

Motivation

Known: 기존 연구들은 주로 post-publication 데이터나 hand-crafted features에 의존했고, 최근 deep text embeddings을 활용한 접근법들이 제안되었으나 대부분 binary classification이거나 특정 도메인에 제한되어 있었다. 또한 LLM을 통한 scientific ideation 활용이 증가하고 있는 추세이다.
Gap: 기존 impact prediction 방법들은 citation-based metrics의 지연성 문제와 bias 취약성을 가지고 있으며, post-publication 데이터에 의존하거나 shallow 텍스트 특징만 사용한다. Publication time에서 직접 content 기반 영향력 예측이 scalable하고 equity 있는 방식으로 실현되지 못했다.
Why: 과학 출판 부피의 기하급수적 증가, AI-생성 저질 논문의 확산, 연구 자금의 축소, peer review의 한계 등으로 인해 고영향 연구를 효율적이고 공정하게 식별하는 것이 긴급해졌다. Language model 기반의 빠르고 저편향 접근법이 필요하다.
Approach: Text embedding 기반 접근으로 title과 abstract를 universal text embedding으로 인코딩하여 regression model에 입력한다. arXiv academic citation graph를 Semantic Scholar API로 구성하고, PageRank와 citation counts를 5년 기준 impact label로 사용한다. 학습된 prediction model을 활용하여 high-impact 방향으로 지향된 research ideas를 생성하는 pipeline을 제안한다.

Achievement

Figure 3: Performance as measuerd by Spearman’s ρ for both impact targets across different test

Dataset: 약 300만 개 arXiv 논문의 저자, citation, network-based impact label(citation count, PageRank) 포함 데이터셋 구축. Impact prediction: Publication time 정보만으로 5년 citation 예측 Spearman's ρ 0.62, PageRank 예측 0.47 달성. Research generation: Impact prediction framework를 활용한 research ideation pipeline 제안으로 LLM judge가 4:3 비율로 baseline 대비 더 높은 영향력 판정. Public release: 5년 citation prediction model을 https://predict-paper-impact.vercel.app에 공개.

Limitation & Further Study

arXiv 코퍼스의 computer science, mathematics, physics 분야 과다 대표로 인한 일반화 제한 (다른 필드·저널로 확장 미래 작업으로 남김). - 5년 기준 impact label은 최근 논문의 경우 실제 영향력이 아직 완전히 나타나지 않았을 가능성. - PageRank와 citation count의 예측 성능 격차(0.62 vs 0.47)가 있으며, PageRank 예측의 실용성 검증 부족. - Research generation 평가가 LLM judge의 unbiased 판정에만 의존하여 human expert evaluation 부재. - Post-publication metrics(journal prestige, author reputation 등)이 완전히 배제되어 실제 예측력에 미치는 영향 미분석.

Evaluation

Novelty: 4/5 Technical Soundness: 4/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

총평: 본 논문은 과학 문헌의 급속한 증가라는 시급한 문제에 대응하여 publication time에서만 content 기반으로 논문 영향력을 예측하는 MIRAI framework를 제안한다. Deep text embedding을 활용한 scalable하고 공정한 접근법과 large-scale dataset, 그리고 research generation으로의 확장은 의미 있는 기여이다. 다만 domain 일반화 제한, 평가 방법론(LLM judge만 사용), PageRank 예측 성능, research idea 생성의 실제 영향력 검증 등에서 개선의 여지가 있다.