Skip to content

sbreuch/ORCA

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Orca: Omni RNA Modification Characterization and Annotation

Orca is a command-line toolkit for RNA modification analysis, featuring preprocessing (pileup & eventalign), feature extraction, prediction using pretrained models, and genomic annotation.

Table of Contents

Installation

Dependency

Softwares:
    minimap2 >= 2.21
    samtools >= 1.11
    f5c >= 1.11
    slow5tools >= 0.8.0

Recommended python version is python v3.10.

Installation with wheel

One can create a virtual envirionment with conda and install Orca using wheel file provided in dist folder.

conda create -n Orca python=3.10
conda activate Orca
wget https://github.com/bioinfo-biols/ORCA/releases/download/v-0.1.2/ORCA-0.1.2-py3-none-any.whl
pip install ./ORCA-0.1.2-py3-none-any.whl

The installation takes less than 2 hours in a stable network environment.

Information of test files

To test Orca, please start from Step2. We provide test files as shown in the chart below:

column name for command description
1 MCF7.Example.eventalign orca-pred_signal_feature_ext Eventalign file of a human sample from f5c eventalign command
2 MCF7.Example.pileup orca-pred_bascal_feature_ext Pileup file of a human sample from samtools mpileup command
3 Answer_from_RMBase_and_DirectRMDB_NGS.csv orca-annotation A csv file containing the genomic coordicates of RNA modifications
4 hg38.gtf orca-genomic_locator GTF file from Ensembl

Please click Here for download.

Usage

Step1. Basecalling & alignments

Perform basecalling on FAST5 files using GUPPY:

guppy_basecaller -i /path/to/FAST5 -s /path/to/output --config /path/to/configuration --fast5-out

Align FASTQ sequences to the transcriptome with minimap2 and samtools:

minimap2 -ax splice -N 0 -uf -k14 --cs -t threads <transcriptome> <fastq> | samtools sort -@ threads -o <bam>
samtools index -@ threads <bam>

Convert FAST5 files using slow5tools:

slow5tools f2s <fast5_dir> --allow -d <blow5_dir> -p threads
slow5tools merge <blow5_dir> -o merged_blow5 -t threads

Use f5c/nanopolish to align signals to reference sequences (example using f5c):

f5c index -t threads --slow5 blow5 <fastq>
f5c eventalign --rna --signal-index --scale-events --threads threads --slow5 blow5 --reads fastq --bam bam --secondary=no --collapse-events --genome transcriptome --summary summary_path > eventalign_file

Step2. Prediction

While running Orca, make sure to run all commands in Step 2 and Step 3 with the same --work_dir folder and the same --prefix string.

  1. Signal Feature Extraction

    # Extract signal alignment features from the eventalign results:
    usage: orca-pred_signal_feature_ext [-h] [--n_processes N_PROCESSES] --eventalign EVENTALIGN [--chunk_size CHUNK_SIZE] [--prefix PREFIX] --work_dir WORK_DIR
    
    Extract signal alignment features from the eventalign results.
    
    options:
      -h, --help            show this help message and exit
      --n_processes N_PROCESSES
                            Number of parallel processes. Default: All available CPU cores
      --eventalign EVENTALIGN
                            Path to the eventalign file.
      --chunk_size CHUNK_SIZE
                            Chunk size for reading eventalign files for indexing. Default: 100000
      --prefix PREFIX       prefix of output file. Default: data
      --work_dir WORK_DIR   Working directory of your job.
  2. Basecalling Feature Extraction

    # Extract basecalling features from pileup results
    usage: orca-pred_bascal_feature_ext [-h] --pileup PILEUP [--prefix PREFIX] --work_dir WORK_DIR [--n_processes N_PROCESSES]
    
    options:
      -h, --help            show this help message and exit
      --pileup PILEUP       pileup FILE PATH from samtools mpileup
      --prefix PREFIX       prefix of output file, please keep it THE SAME AS the one used in previous steps. Default: data
      --work_dir WORK_DIR   Working directory of your job, please keep it THE SAME AS the one used in previous steps.
      --n_processes N_PROCESSES
                            Number of parallel processes. Default: All available CPU cores
  3. Feature Merge

    # Merge both types of features:
    usage: orca-pred_feature_merge [-h] [--prefix PREFIX] --work_dir WORK_DIR [--n_processes N_PROCESSES]
    
    options:
      -h, --help            show this help message and exit
      --prefix PREFIX       prefix of output file, please keep it THE SAME AS the one used in previous steps. Default: data
      --work_dir WORK_DIR   Working directory of your job, please keep it THE SAME AS the one used in previous steps.
      --n_processes N_PROCESSES
                            Number of parallel processes. Default: All available CPU cores
  4. Run Prediction

    # RNA modification sites prediction based on pretrained models
    usage: prediction.py [-h] [--prefix PREFIX] --work_dir WORK_DIR
                         [--default_model_version {002,004} | --extractor_path EXTRACTOR_PATH] [--classifier_path CLASSIFIER_PATH]
    
    Run prediction on sample feature data using prediction models.
    
    options:
      -h, --help            show this help message and exit
      --prefix PREFIX       prefix of output file, please keep it THE SAME AS the one used in previous steps. Default: data
      --work_dir WORK_DIR   Working directory of your job, please keep it THE SAME AS the one used in previous steps.
      --default_model_version {002,004}
                            Use built-in default model weights for a given platform version (002 or 004). If provided,
                            extractor_path/classifier_path should NOT be set.
      --extractor_path EXTRACTOR_PATH
                            Path to the feature extractor model (use together with --classifier_path).
      --classifier_path CLASSIFIER_PATH
                            Path to the class classifier model (use together with --extractor_path).

Step3. Annotation

  1. Genomic Location

    # Map transcriptomic to genomic coordinates:
    usage: orca-genomic_locator [-h] --gtf_path GTF_PATH [--prefix PREFIX] --work_dir WORK_DIR [--n_processes N_PROCESSES]
    
    Convert transcriptomic coordinates to genomic coordinates.
    
    options:
      -h, --help            show this help message and exit
      --gtf_path GTF_PATH   Path to GTF file
      --prefix PREFIX       prefix of output file, please keep it THE SAME AS the one used in previous steps. Default: data
      --work_dir WORK_DIR   Working directory of your job, please keep it THE SAME AS the one used in previous steps.
      --n_processes N_PROCESSES
                            Number of parallel processes. Default: All available CPU cores
  2. Filter for Annotation

    usage: orca-anno_bascal_feature_ext [-h] [--prefix PREFIX] --work_dir WORK_DIR
    
    options:
      -h, --help           show this help message and exit
      --prefix PREFIX      prefix of output file, please keep it THE SAME AS the one used in previous steps. Default: data
      --work_dir WORK_DIR  Working directory of your job, please keep it THE SAME AS the one used in previous steps.
  3. Flatten Signal Features

    usage: orca-anno_signal_feature_ext [-h] [--prefix PREFIX] --work_dir WORK_DIR
    
    options:
      -h, --help           show this help message and exit
      --prefix PREFIX      prefix of output file, please keep it THE SAME AS the one used in previous steps. Default: data
      --work_dir WORK_DIR  Working directory of your job, please keep it THE SAME AS the one used in previous steps.
  4. Annotation:

    The answer file is a 0-based modification annotation CSV file without a header, containing four columns: chromosome, position, strand, and modification type. See the test folder for an example.

    The base type file is a two-column TSV file specifying the base type for each modification. It contains no required header; the first column is the modification name and the second column is the corresponding base type (A/C/G/T/N/unknown, please do not use U).

    The modification names must be consistent between the answer file and the base type file (e.g., pseudouridine should be written consistently as either pseudoU in both files or ψ in both files).

    usage: orca-annotation [-h] --answer_path ANSWER_PATH --ref_path REF_PATH --base_type_path BASE_TYPE_PATH [--threshold THRESHOLD] [--prefix PREFIX] --work_dir WORK_DIR
    
    options:
      -h, --help            show this help message and exit
      --answer_path ANSWER_PATH
                            Path to the NGS-based answers
      --ref_path REF_PATH   Path to the reference GENOME path
      --base_type_path BASE_TYPE_PATH
                            TSV with two columns: modification name and base type
      --threshold THRESHOLD Only consider modifications with at least this number of sites supported by NGS answers. Default: 50
      --prefix PREFIX       prefix of output file, please keep it THE SAME AS the one used in previous steps. Default: data
      --work_dir WORK_DIR   Working directory of your job, please keep it THE SAME AS the one used in previous steps.

Outputs

1. Prediction

The prediction results are stored in the your_prefix.preds.per.site file, containing the following columns:

column name description
1 id transcript ID
2 position 0-based transcriptome coordinate
3 kmer 5-mers sequence centered at this position
4 depth sequencing depth
5 modScore RNA modification score
6 pred_rate predicted modification proportion

2. Annotation

The annotation results are stored in the your_prefix.annotation.per.site file, containing the following columns:

column name description
1 id transcript ID
2 position 0-based transcriptome coordinate
3 kmer 11-mer sequence centered at this position
4 contig chromosome name
5 gen_position 0-based genomic coordinate
6 strand strand of the transcript
7 modification predicted modification
8 source source of the modification

Support

If you encounter issues or have questions, please open an issue on our GitHub repository.

Author

Authors: Han Dong(donghan@biols.ac.cn), Jinyang Zhang(zhangjinyang@biols.ac.cn), Fangqing Zhao(zhfq@biols.ac.cn)

Maintainer: Han Dong

License

This project is licensed under the MIT License – see the LICENSE file for details.

Last updated: June 10, 2025

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 100.0%