This directory contains the complete analysis pipeline for the TransCisPredict framework - a comprehensive approach to protein expression weights estimation that incorporates both cis- and trans- variants for performing proteome-wide association studies (PWAS) with predicted protein expression levels. The pipeline consists of eight sequential steps, each implemented as a standalone R script with standardized configuration sections.
Required R packages:
- tidyverse - Data manipulation and visualization
- data.table - High-performance data processing
- plink2R - Genotype data handling
- pecotmr - Statistical methods implementation
- rsample - Cross-validation and resampling
- broom - Model output formatting
- broom.mixed - Mixed model output formatting
- janitor - Data cleaning utilities
- geepack - Generalized estimating equations
- igraph - Graph analysis for kinship matrices
- future - Parallel processing framework
- furrr - Future-based parallel mapping
- R.utils - Additional utility functions
-
Sequential Execution: Run steps 1-8 in order, as each step depends on outputs from previous steps.
-
Path Configuration: Each script has a standardized CONFIGURATION section at the top. Modify the placeholder paths for your environment:
- Input data directories
- Output directories
- Genotype data locations
- Reference files
-
No Additional Setup Required: After configuring paths, scripts run autonomously without requiring further user input.
Script: step1_process_olink_data.R
Process raw protein expression levels, perform quality control, and create individual protein expression files.
Script: step2_covariate_regression.R
Remove covariate effects, e.g., age, sex, age×sex, body mass index (BMI), genetic ancestry principal components (PCs) 1-20, from protein expression to generate normalized protein expression (NPX) residuals for each protein.
Script: step3_LD_block_selection.R
Identify genomic regions with significant genetic signal using false discovery rate (FDR)-based selection to include in weight estimation.
Script: step4_cross_validation.R
Perform cross-validation to evaluate multiple statistical methods (BayesR, SuSiE, LASSO, and Elastic Net) for protein prediction using both cis- and trans- variants.
Scripts:
- step5a_evaluate_cv_performance.R - Evaluate cross-validation performance for four methods per protein
- step5b_identify_best_method.R - Select optimal method for each protein
-
step5c_summarize_all_methods.R - Generate a comprehensive results summary for all proteins (optional), e.g.,
$CV-r$ and$CV-R^2$
Evaluate cross-validation results and identifies the "optimal" method for each protein.
Script: step6_whole_sample_analysis.R
Estimate the weights using the complete reference sample applying the "optimal" method identified for each protein.
Scripts:
- step7a_predict_npx_population.R - Predict NPX residuals for each protein for every individual in the target sample
- step7b_combine_npx_files.R - Combine all proteins predicted NPX residual levels into a single file
Apply weights to predict NPX residual levels in the target sample.
Script: step8_pwas_analysis.R
Perform proteome-wide association studies using generalized estimating equations (GEE) to test associations between predicted NPX residual levels and complex traits.
Directory: utilities/ Scripts:
- pqtl_functions.R - Core utility functions for data processing and genetic prediction
- pqtl_weights.R - Statistical method implementations (BayesR, SuSiE, LASSO, and Elastic Net)
- timing_function.R - Runtime calculation utilities
Contain common functions and method implementations used throughout the pipeline.