Skip to content

Latest commit

 

History

History
92 lines (63 loc) · 5.94 KB

File metadata and controls

92 lines (63 loc) · 5.94 KB

TransCisPredict Analysis Pipeline

This directory contains the complete analysis pipeline for the TransCisPredict framework - a comprehensive approach to protein expression weights estimation that incorporates both cis- and trans- variants for performing proteome-wide association studies (PWAS) with predicted protein expression levels. The pipeline consists of eight sequential steps, each implemented as a standalone R script with standardized configuration sections.

Pipeline Overview and Usage

Requirements and Setup

Required R packages:

  • tidyverse - Data manipulation and visualization
  • data.table - High-performance data processing
  • plink2R - Genotype data handling
  • pecotmr - Statistical methods implementation
  • rsample - Cross-validation and resampling
  • broom - Model output formatting
  • broom.mixed - Mixed model output formatting
  • janitor - Data cleaning utilities
  • geepack - Generalized estimating equations
  • igraph - Graph analysis for kinship matrices
  • future - Parallel processing framework
  • furrr - Future-based parallel mapping
  • R.utils - Additional utility functions

Running the Pipeline

  1. Sequential Execution: Run steps 1-8 in order, as each step depends on outputs from previous steps.

  2. Path Configuration: Each script has a standardized CONFIGURATION section at the top. Modify the placeholder paths for your environment:

    • Input data directories
    • Output directories
    • Genotype data locations
    • Reference files
  3. No Additional Setup Required: After configuring paths, scripts run autonomously without requiring further user input.

Pipeline Steps

Step 1: Data Processing

Script: step1_process_olink_data.R

Process raw protein expression levels, perform quality control, and create individual protein expression files.

Step 2: Covariate Regression

Script: step2_covariate_regression.R

Remove covariate effects, e.g., age, sex, age×sex, body mass index (BMI), genetic ancestry principal components (PCs) 1-20, from protein expression to generate normalized protein expression (NPX) residuals for each protein.

Step 3: LD Block Selection

Script: step3_LD_block_selection.R

Identify genomic regions with significant genetic signal using false discovery rate (FDR)-based selection to include in weight estimation.

Step 4: Cross Validation

Script: step4_cross_validation.R

Perform cross-validation to evaluate multiple statistical methods (BayesR, SuSiE, LASSO, and Elastic Net) for protein prediction using both cis- and trans- variants.

Step 5: CV Evaluation

Scripts:

Evaluate cross-validation results and identifies the "optimal" method for each protein.

Step 6: Whole Sample Analysis

Script: step6_whole_sample_analysis.R

Estimate the weights using the complete reference sample applying the "optimal" method identified for each protein.

Step 7: Prediction of NPX Residual Levels in Target Sample

Scripts:

Apply weights to predict NPX residual levels in the target sample.

Step 8: PWAS Analysis

Script: step8_pwas_analysis.R

Perform proteome-wide association studies using generalized estimating equations (GEE) to test associations between predicted NPX residual levels and complex traits.

Utilities

Directory: utilities/ Scripts:

Contain common functions and method implementations used throughout the pipeline.