End-to-end pipeline to analyze how influential citations (Fellow authors) refer to a target paper. Given a target paper title, the system pulls citing papers, filters for published works, checks Fellow status for top authors, extracts citation contexts from PDFs, scores each context with an LLM, and outputs reports plus trace logs for reproducibility.
- Overview
- Key Features
- Pipeline Overview (T1-T9)
- Requirements
- Installation and Setup
- Quick Start
- Outputs
- FAQ
This project analyzes citations to a specified target paper, focusing on how influential authors (e.g., IEEE/ACM/AAAI Fellows) refer to it and the semantic stance of their remarks. The system automatically collects citing papers, filters for published works, identifies influential authors, extracts citation contexts from PDFs, scores each context with an LLM, and generates reproducible reports and trace logs.
- Automatic retrieval of the target paper and its citing papers
- Author metrics enrichment and Fellow status verification
- Full-text PDF extraction and citation-context extraction
- LLM-based scoring and report generation
- End-to-end trace logs for reproducibility
- RunContext initialization and parameter snapshot (run_id, directories, trace)
- OpenAlex title match (target paper id/doi)
- Cited-by retrieval (TopK)
- DBLP publication status check (keep published only)
- Author metrics enrichment (h-index, affiliation)
- TopK author Fellow verification (IEEE/ACM/AAAI)
- Candidate paper selection and fallback strategy (max h-index)
- PDF download, full-text extraction, citation-context extraction
- LLM scoring and report output (per-paper report + summary report)
- Linux
- Conda
- Python 3.10
- Chrome and Chromedriver (for
pcra.get_pdf) - minerU service (for PDF structured parsing; 5GB GPU VRAM recommended)
- Network access to DuckDuckGo (proxy required)
conda create -n pcraPaper python=3.10
conda activate pcraPaper
pip install -r requirements.txtpcra.get_pdf requires Chrome and chromedriver binaries at fixed paths under the repo root:
chrome_bin/chrome-linux64/chromechrome_bin/chromedriver-linux64/chromedriver
These files come from the official Chrome for Testing release. Download the matching
versions of chrome-linux64.zip and chromedriver-linux64.zip, then extract them into
chrome_bin/.
Example (version number can be replaced):
VER=143.0.7499.40
mkdir -p chrome_bin
wget -O /tmp/chrome-linux64.zip "https://storage.googleapis.com/chrome-for-testing-public/${VER}/linux64/chrome-linux64.zip"
wget -O /tmp/chromedriver-linux64.zip "https://storage.googleapis.com/chrome-for-testing-public/${VER}/linux64/chromedriver-linux64.zip"
unzip -q /tmp/chrome-linux64.zip -d chrome_bin
unzip -q /tmp/chromedriver-linux64.zip -d chrome_binVerify versions:
chrome_bin/chrome-linux64/chrome --version
chrome_bin/chromedriver-linux64/chromedriver --versionconda create -n MinerUService python=3.12
conda activate MinerUService
export UV_DEFAULT_INDEX=https://mirrors.aliyun.com/pypi/simple/
pip install uv -i https://pypi.org/simple/
uv pip install "mineru[core]"
# On first run, use a mirror to download required model files,
# then convert a file to trigger downloading all models
export MINERU_MODEL_SOURCE=modelscope
mineru -p <input_path> -o <output_path>
# Choose a directory for temporary files, then start minerU there
cd <minerUtemp>
export MINERU_MODEL_SOURCE=modelscope
mineru-api --host 0.0.0.0 --port 18543Service port: 18543.
Find the config template config/llm_model_template.yaml:
cp config/llm_model_template.yaml config/llm_model.yamlFill in your OpenRouter API Key. Keep api_key_env and api_key consistent.
The openrouter_web_search field must use an OpenRouter model; the text field
can use any OpenAI-compatible model.
conda activate pcraPaper
bash e2e_scripts/run_one_paper/run_one_paper.shBefore running, set the following variables in e2e_scripts/run_one_paper/run_one_paper.sh:
# Target paper (cited) title
PAPER_TO_ANALYZE="Efficient Personalized PageRank Computation: The Power of Variance-Reduced Monte Carlo Approaches"
# Target paper author
TARGET_AUTHOR="Rong-Hua Li"
# Authors to ignore (citing papers that include these authors will be skipped)
IGNORE_AUTHORS='["Guoren Wang","Rong-Hua Li"]'
# Custom ID for later aggregation
RUN_ID="102"Default output location: trace_log/.
It is recommended to run with tmux in the background. A single paper takes about 5 minutes.
Aggregate results under trace_log/ into Excel:
conda activate pcraPaper
python e2e_scripts/export_result/export_summary_to_excel.pyResults and logs for a single paper are located at trace_log/<target_paper_name>/:
res/paper_ref_contexts/{paper_id}.json(PDF, full text, citation contexts)res/paper_ref_contexts_scored/{paper_id}.json(LLM scores)res/fulltext/{paper_id}.mdres/pdf/{paper_id}.pdfres/reports/paper/{paper_id}.mdres/reports/summary.mdres/summary.jsonlog/{run_id}.ndjson(stage trace logs)
Aggregated results for multiple papers are in: e2e_scripts/export_result.
- Cannot access DuckDuckGo: configure a proxy and set related environment variables.
- Chrome or Chromedriver version mismatch: ensure both versions match and are placed in
chrome_bin/. - minerU service unavailable: confirm port
18543is running and that the runtime has GPU access.

