Benchmark for evaluation and analysis of citation recommendation models

저자: Puja Maharjan | 날짜: 2024 | DOI: N/A 📄 PDF

Essence

Figure 3: Data distribution of papers according to various

본 논문은 citation recommendation 모델들을 체계적으로 평가하고 비교하기 위한 standardized benchmark를 제안한다. S2ORC와 S2AG 데이터셋으로부터 생성된 diagnostic dataset을 통해 local citation recommendation 시스템의 성능을 다양한 메트릭으로 평가하는 것을 목표로 한다.

Motivation

Known: Citation recommendation 연구는 다양한 방법론과 데이터셋을 사용하고 있으며, global과 local 시스템으로 분류된다. 기존 연구에서는 context size, 메타데이터, 다양한 neural network 아키텍처 등이 성능에 영향을 미친다는 것이 알려져 있다.
Gap: Citation recommendation 분야에는 GLUE, SuperGLUE 같은 NLP 벤치마크와 달리 표준화된 dataset과 평가 메트릭이 부재하여, 서로 다른 연구 간 일관된 비교와 평가가 어렵다.
Why: Citation recommendation은 학술 논문 저자들의 생산성을 높이는 실질적 응용이고, 다양한 모델과 데이터셋의 효과적 비교를 통해 이 분야의 진전을 가속화할 수 있기 때문에 중요하다.
Approach: S2ORC와 S2AG 데이터셋으로부터 citation context, metadata, citation position, POS tagging 등 다양한 측면을 다루는 diagnostic dataset들을 생성하고, BM25를 기준 모델로 하여 Recall과 Mean Reciprocal Rank (MRR) 메트릭으로 평가한다.

Achievement

Figure 5: Citation count distribution based on fields, where

Diagnostic dataset 생성: citation position, POS tagging, field-based distribution, publication year 등 다양한 특성을 고려한 dataset 개발
표준화된 평가 메트릭: Recall과 MRR을 통한 일관된 성능 평가 체계 제시
공개 자료: 소스 코드, diagnostic dataset, benchmark 모델을 GitHub과 Google Drive에 공개

How

Figure 2: Combined preceding POS of the citation.

S2ORC 데이터셋에서 diagnostic dataset 추출 및 필터링
Citation context의 위치(position), 문법 특성(POS), 학문 분야별 분포 분석
BM25 baseline 모델을 이용한 기본 성능 평가
Recall@k와 MRR 메트릭 적용
다양한 특성별 성능 비교 분석

Originality

Citation recommendation 분야 최초의 체계적 benchmark 제안
Diagnostic dataset을 통한 세부 특성별 분석 방법론
Local citation recommendation 중심의 표준화된 평가 체계

Limitation & Further Study

Scope의 제한성: local citation recommendation에만 초점을 맞추었으며, global recommendation 방법들과의 통합 평가는 부재
모델 평가 부족: 실제 다양한 citation recommendation 모델들(LSTM, BERT, GCN 등)에 대한 벤치마크 결과가 제시되지 않음
후속 연구 필요: 더 많은 모델들에 대한 실증적 평가와 분석, 도메인 특성에 따른 성능 차이 분석, 시간 기반 필터링 등 추가 diagnostic aspect 개발

Evaluation

Novelty: 4/5 Technical Soundness: 3/5 Significance: 4/5 Clarity: 4/5 Overall: 4/5

총평: 본 논문은 citation recommendation 분야에서 오랫동안 필요로 했던 표준화된 benchmark를 제안한다는 점에서 의미가 있으나, diagnostic dataset의 생성 방법론은 제시하고 있으나 실제 여러 모델에 대한 벤치마크 결과가 부족하여 그 유용성을 실증하지 못하고 있다. 추가적으로 global과 local 방법의 통합 평가 및 더 다양한 모델에 대한 성능 비교가 필요하다.

같이 보면 좋은 논문

기반 연구

When large language models meet citation: A survey

When large language models meet citation 논문은 LLM 기반 인용 문제에 대한 서베이로서, 인용 추천 시스템의 평가 기준을 이론적으로 뒷받침합니다.

기반 연구

Cited text spans for citation text generation

Benchmark for evaluation and analysis of citation recommendation은 인용 추천 및 평가를 위한 기초 데이터를 제공, 인용 텍스트 생성 시스템의 평가 기준으로 활용됩니다.

기반 연구

Citebart: Learning to generate citations for local citation recommendation

219 논문은 local citation generation을 학습하는 기반 모델로, 150의 인용 추천 시스템 평가 지침 설정에 참고가 된다.

기반 연구

ILCiteR: Evidence-grounded interpretable local citation recommendation

Benchmark for evaluation and analysis of citation recommendation(150)은 인용 추천 성능 평가 프레임워크를 제공하며, 420의 평가 체계 설계에 기초로 활용된다.

기반 연구

Taxonomy tree generation from citation graph

인용 네트워크 및 citation recommendation task에서 taxonomy/계층적 구조 분석은 150번 논문의 주요 연구 영역입니다.

기반 연구

Scirgc: Multi-granularity citation recommendation and citation sentence preference alignment

인용 문장 생성의 방법론적 기반을 제공하는 연구이다.

다른 접근

Semantic Scholar

학술 문헌 검색 및 인용 추천의 관련 연구이다.

다른 접근

S2ORC: The Semantic Scholar Open Research Corpus

인용 추천 시스템의 평가 방법론을 다루는 유사한 연구이다.

다른 접근

ScholarCopilot: Training Large Language Models for Academic Writing with Accurate Citations

150번 논문은 citation recommendation task를 위한 평가 및 벤치마크 구축법에 초점을 맞추어, 702와 문제 접근법이 다릅니다.

다른 접근

Open Datasets in Learning Analytics: Trends, Challenges, and Best PRACTICE

논문 인용 네트워크 분석 또는 추천 시스템의 관련 연구이다.

다른 접근

Public Profile Matters: A Scalable Integrated Approach to Recommend Citations in the Wild

인용 추천 벤치마크 또는 평가 메트릭의 유사한 접근 방식이다.

후속 연구

OARelatedWork: A large-scale dataset of related work sections with full-texts from open access sources

OARelatedWork는 citation-related section 데이터셋으로, 150의 인용추천 벤치마크 설계에서 실제 사용할 수 있는 실질적 데이터와 응용 사례를 제공합니다.

후속 연구

HLM-Cite: Hybrid Language Model Workflow for Text-based Scientific Citation Prediction

Benchmark for evaluation and analysis of citation recommendations 논문은 다양한 인용 분류와 추천 태스크의 벤치마크로 실제 평가 방향을 제시합니다.

후속 연구

Vulnerability of text-matching in ml/ai conference reviewer assignments to collusions

심사 배정 및 인용추천의 공정성, 메트릭 일관성 문제를 다뤄 870번 논문의 ML학회 심사시스템 취약성 문제와 직접적으로 연결됩니다.

후속 연구

NSF-SCIFY: Mining the NSF Awards Database for Scientific Claims

579 논문은 미국 NSF 데이터에서 과학적 인용 네트워크를 추출 및 분석하며, 150의 인용 추천 벤치마크 활용 가능성을 데이터 스케일에서 확장한다.

후속 연구

Scirgc: Multi-granularity citation recommendation and citation sentence preference alignment

Scirgc 논문은 다양한 인용 추천 및 인용 강건성 평가 프레임워크를 제공하여 인용문헌 추천 벤치마크 연구를 확장합니다.

응용 사례

Wordcraft: A human-ai collaborative editor for story writing

Benchmark for evaluation and analysis of citation recommenda 논문은 LLM 기반 창의 분야 평가 및 추천 문제에 초점을 맞추어, 창작 도구의 성능 평가 지점과 이어진다.

응용 사례

Vulnerability of text-matching in ml/ai conference reviewer assignments to collusions

인용추천, 심사 배정, reviewer-author 간 인용 패턴 등 평가 메트릭의 실질적 취약성 및 조작 리스크 논의를 확대할 수 있습니다.

← 목록으로 돌아가기