Orca is a command-line toolkit for RNA modification analysis, featuring preprocessing (pileup & eventalign), feature extraction, prediction using pretrained models, and genomic annotation.
Softwares:
minimap2 >= 2.21
samtools >= 1.11
f5c >= 1.11
slow5tools >= 0.8.0
Recommended python version is python v3.10.
One can create a virtual envirionment with conda and install Orca using wheel file provided in dist folder.
conda create -n Orca python=3.10
conda activate Orca
wget https://github.com/bioinfo-biols/ORCA/releases/download/v-0.1.2/ORCA-0.1.2-py3-none-any.whl
pip install ./ORCA-0.1.2-py3-none-any.whlThe installation takes less than 2 hours in a stable network environment.
To test Orca, please start from Step2. We provide test files as shown in the chart below:
| column | name | for command | description |
|---|---|---|---|
| 1 | MCF7.Example.eventalign | orca-pred_signal_feature_ext | Eventalign file of a human sample from f5c eventalign command |
| 2 | MCF7.Example.pileup | orca-pred_bascal_feature_ext | Pileup file of a human sample from samtools mpileup command |
| 3 | Answer_from_RMBase_and_DirectRMDB_NGS.csv | orca-annotation | A csv file containing the genomic coordicates of RNA modifications |
| 4 | hg38.gtf | orca-genomic_locator | GTF file from Ensembl |
Please click Here for download.
Perform basecalling on FAST5 files using GUPPY:
guppy_basecaller -i /path/to/FAST5 -s /path/to/output --config /path/to/configuration --fast5-outAlign FASTQ sequences to the transcriptome with minimap2 and samtools:
minimap2 -ax splice -N 0 -uf -k14 --cs -t threads <transcriptome> <fastq> | samtools sort -@ threads -o <bam>
samtools index -@ threads <bam>Convert FAST5 files using slow5tools:
slow5tools f2s <fast5_dir> --allow -d <blow5_dir> -p threads
slow5tools merge <blow5_dir> -o merged_blow5 -t threadsUse f5c/nanopolish to align signals to reference sequences (example using f5c):
f5c index -t threads --slow5 blow5 <fastq>
f5c eventalign --rna --signal-index --scale-events --threads threads --slow5 blow5 --reads fastq --bam bam --secondary=no --collapse-events --genome transcriptome --summary summary_path > eventalign_fileWhile running Orca, make sure to run all commands in Step 2 and Step 3 with the same --work_dir folder and the same --prefix string.
-
Signal Feature Extraction
# Extract signal alignment features from the eventalign results: usage: orca-pred_signal_feature_ext [-h] [--n_processes N_PROCESSES] --eventalign EVENTALIGN [--chunk_size CHUNK_SIZE] [--prefix PREFIX] --work_dir WORK_DIR Extract signal alignment features from the eventalign results. options: -h, --help show this help message and exit --n_processes N_PROCESSES Number of parallel processes. Default: All available CPU cores --eventalign EVENTALIGN Path to the eventalign file. --chunk_size CHUNK_SIZE Chunk size for reading eventalign files for indexing. Default: 100000 --prefix PREFIX prefix of output file. Default: data --work_dir WORK_DIR Working directory of your job.
-
Basecalling Feature Extraction
# Extract basecalling features from pileup results usage: orca-pred_bascal_feature_ext [-h] --pileup PILEUP [--prefix PREFIX] --work_dir WORK_DIR [--n_processes N_PROCESSES] options: -h, --help show this help message and exit --pileup PILEUP pileup FILE PATH from samtools mpileup --prefix PREFIX prefix of output file, please keep it THE SAME AS the one used in previous steps. Default: data --work_dir WORK_DIR Working directory of your job, please keep it THE SAME AS the one used in previous steps. --n_processes N_PROCESSES Number of parallel processes. Default: All available CPU cores
-
Feature Merge
# Merge both types of features: usage: orca-pred_feature_merge [-h] [--prefix PREFIX] --work_dir WORK_DIR [--n_processes N_PROCESSES] options: -h, --help show this help message and exit --prefix PREFIX prefix of output file, please keep it THE SAME AS the one used in previous steps. Default: data --work_dir WORK_DIR Working directory of your job, please keep it THE SAME AS the one used in previous steps. --n_processes N_PROCESSES Number of parallel processes. Default: All available CPU cores
-
Run Prediction
# RNA modification sites prediction based on pretrained models usage: prediction.py [-h] [--prefix PREFIX] --work_dir WORK_DIR [--default_model_version {002,004} | --extractor_path EXTRACTOR_PATH] [--classifier_path CLASSIFIER_PATH] Run prediction on sample feature data using prediction models. options: -h, --help show this help message and exit --prefix PREFIX prefix of output file, please keep it THE SAME AS the one used in previous steps. Default: data --work_dir WORK_DIR Working directory of your job, please keep it THE SAME AS the one used in previous steps. --default_model_version {002,004} Use built-in default model weights for a given platform version (002 or 004). If provided, extractor_path/classifier_path should NOT be set. --extractor_path EXTRACTOR_PATH Path to the feature extractor model (use together with --classifier_path). --classifier_path CLASSIFIER_PATH Path to the class classifier model (use together with --extractor_path).
-
Genomic Location
# Map transcriptomic to genomic coordinates: usage: orca-genomic_locator [-h] --gtf_path GTF_PATH [--prefix PREFIX] --work_dir WORK_DIR [--n_processes N_PROCESSES] Convert transcriptomic coordinates to genomic coordinates. options: -h, --help show this help message and exit --gtf_path GTF_PATH Path to GTF file --prefix PREFIX prefix of output file, please keep it THE SAME AS the one used in previous steps. Default: data --work_dir WORK_DIR Working directory of your job, please keep it THE SAME AS the one used in previous steps. --n_processes N_PROCESSES Number of parallel processes. Default: All available CPU cores
-
Filter for Annotation
usage: orca-anno_bascal_feature_ext [-h] [--prefix PREFIX] --work_dir WORK_DIR options: -h, --help show this help message and exit --prefix PREFIX prefix of output file, please keep it THE SAME AS the one used in previous steps. Default: data --work_dir WORK_DIR Working directory of your job, please keep it THE SAME AS the one used in previous steps.
-
Flatten Signal Features
usage: orca-anno_signal_feature_ext [-h] [--prefix PREFIX] --work_dir WORK_DIR options: -h, --help show this help message and exit --prefix PREFIX prefix of output file, please keep it THE SAME AS the one used in previous steps. Default: data --work_dir WORK_DIR Working directory of your job, please keep it THE SAME AS the one used in previous steps.
-
Annotation:
The answer file is a 0-based modification annotation CSV file without a header, containing four columns: chromosome, position, strand, and modification type. See the test folder for an example.
The base type file is a two-column TSV file specifying the base type for each modification. It contains no required header; the first column is the modification name and the second column is the corresponding base type (A/C/G/T/N/unknown, please do not use U).
The modification names must be consistent between the answer file and the base type file (e.g., pseudouridine should be written consistently as either pseudoU in both files or ψ in both files).
usage: orca-annotation [-h] --answer_path ANSWER_PATH --ref_path REF_PATH --base_type_path BASE_TYPE_PATH [--threshold THRESHOLD] [--prefix PREFIX] --work_dir WORK_DIR options: -h, --help show this help message and exit --answer_path ANSWER_PATH Path to the NGS-based answers --ref_path REF_PATH Path to the reference GENOME path --base_type_path BASE_TYPE_PATH TSV with two columns: modification name and base type --threshold THRESHOLD Only consider modifications with at least this number of sites supported by NGS answers. Default: 50 --prefix PREFIX prefix of output file, please keep it THE SAME AS the one used in previous steps. Default: data --work_dir WORK_DIR Working directory of your job, please keep it THE SAME AS the one used in previous steps.
The prediction results are stored in the your_prefix.preds.per.site file, containing the following columns:
| column | name | description |
|---|---|---|
| 1 | id | transcript ID |
| 2 | position | 0-based transcriptome coordinate |
| 3 | kmer | 5-mers sequence centered at this position |
| 4 | depth | sequencing depth |
| 5 | modScore | RNA modification score |
| 6 | pred_rate | predicted modification proportion |
The annotation results are stored in the your_prefix.annotation.per.site file, containing the following columns:
| column | name | description |
|---|---|---|
| 1 | id | transcript ID |
| 2 | position | 0-based transcriptome coordinate |
| 3 | kmer | 11-mer sequence centered at this position |
| 4 | contig | chromosome name |
| 5 | gen_position | 0-based genomic coordinate |
| 6 | strand | strand of the transcript |
| 7 | modification | predicted modification |
| 8 | source | source of the modification |
If you encounter issues or have questions, please open an issue on our GitHub repository.
Authors: Han Dong(donghan@biols.ac.cn), Jinyang Zhang(zhangjinyang@biols.ac.cn), Fangqing Zhao(zhfq@biols.ac.cn)
Maintainer: Han Dong
This project is licensed under the MIT License – see the LICENSE file for details.
Last updated: June 10, 2025