Orca: Omni RNA Modification Characterization and Annotation

Orca is a command-line toolkit for RNA modification analysis, featuring preprocessing (pileup & eventalign), feature extraction, prediction using pretrained models, and genomic annotation.

Installation

Dependency

Softwares:
    minimap2 >= 2.21
    samtools >= 1.11
    f5c >= 1.11
    slow5tools >= 0.8.0

Recommended python version is python v3.10.

Installation with wheel

One can create a virtual envirionment with conda and install Orca using wheel file provided in dist folder.

conda create -n Orca python=3.10
conda activate Orca
wget https://github.com/bioinfo-biols/ORCA/releases/download/v-0.1.2/ORCA-0.1.2-py3-none-any.whl
pip install ./ORCA-0.1.2-py3-none-any.whl

The installation takes less than 2 hours in a stable network environment.

Information of test files

To test Orca, please start from Step2. We provide test files as shown in the chart below:

column	name	for command	description
1	MCF7.Example.eventalign	orca-pred_signal_feature_ext	Eventalign file of a human sample from f5c eventalign command
2	MCF7.Example.pileup	orca-pred_bascal_feature_ext	Pileup file of a human sample from samtools mpileup command
3	Answer_from_RMBase_and_DirectRMDB_NGS.csv	orca-annotation	A csv file containing the genomic coordicates of RNA modifications
4	hg38.gtf	orca-genomic_locator	GTF file from Ensembl

Please click Here for download.

Usage

Step1. Basecalling & alignments

Perform basecalling on FAST5 files using GUPPY:

guppy_basecaller -i /path/to/FAST5 -s /path/to/output --config /path/to/configuration --fast5-out

Align FASTQ sequences to the transcriptome with minimap2 and samtools:

minimap2 -ax splice -N 0 -uf -k14 --cs -t threads <transcriptome> <fastq> | samtools sort -@ threads -o <bam>
samtools index -@ threads <bam>

Convert FAST5 files using slow5tools:

slow5tools f2s <fast5_dir> --allow -d <blow5_dir> -p threads
slow5tools merge <blow5_dir> -o merged_blow5 -t threads

Use f5c/nanopolish to align signals to reference sequences (example using f5c):

f5c index -t threads --slow5 blow5 <fastq>
f5c eventalign --rna --signal-index --scale-events --threads threads --slow5 blow5 --reads fastq --bam bam --secondary=no --collapse-events --genome transcriptome --summary summary_path > eventalign_file

Step2. Prediction

While running Orca, make sure to run all commands in Step 2 and Step 3 with the same --work_dir folder and the same --prefix string.

Signal Feature Extraction

# Extract signal alignment features from the eventalign results:
usage: orca-pred_signal_feature_ext [-h] [--n_processes N_PROCESSES] --eventalign EVENTALIGN [--chunk_size CHUNK_SIZE] [--prefix PREFIX] --work_dir WORK_DIR

Extract signal alignment features from the eventalign results.

options:
  -h, --help            show this help message and exit
  --n_processes N_PROCESSES
                        Number of parallel processes. Default: All available CPU cores
  --eventalign EVENTALIGN
                        Path to the eventalign file.
  --chunk_size CHUNK_SIZE
                        Chunk size for reading eventalign files for indexing. Default: 100000
  --prefix PREFIX       prefix of output file. Default: data
  --work_dir WORK_DIR   Working directory of your job.

Basecalling Feature Extraction

# Extract basecalling features from pileup results
usage: orca-pred_bascal_feature_ext [-h] --pileup PILEUP [--prefix PREFIX] --work_dir WORK_DIR [--n_processes N_PROCESSES]

options:
  -h, --help            show this help message and exit
  --pileup PILEUP       pileup FILE PATH from samtools mpileup
  --prefix PREFIX       prefix of output file, please keep it THE SAME AS the one used in previous steps. Default: data
  --work_dir WORK_DIR   Working directory of your job, please keep it THE SAME AS the one used in previous steps.
  --n_processes N_PROCESSES
                        Number of parallel processes. Default: All available CPU cores

Feature Merge

# Merge both types of features:
usage: orca-pred_feature_merge [-h] [--prefix PREFIX] --work_dir WORK_DIR [--n_processes N_PROCESSES]

options:
  -h, --help            show this help message and exit
  --prefix PREFIX       prefix of output file, please keep it THE SAME AS the one used in previous steps. Default: data
  --work_dir WORK_DIR   Working directory of your job, please keep it THE SAME AS the one used in previous steps.
  --n_processes N_PROCESSES
                        Number of parallel processes. Default: All available CPU cores

Run Prediction

# RNA modification sites prediction based on pretrained models
usage: prediction.py [-h] [--prefix PREFIX] --work_dir WORK_DIR
                     [--default_model_version {002,004} | --extractor_path EXTRACTOR_PATH] [--classifier_path CLASSIFIER_PATH]

Run prediction on sample feature data using prediction models.

options:
  -h, --help            show this help message and exit
  --prefix PREFIX       prefix of output file, please keep it THE SAME AS the one used in previous steps. Default: data
  --work_dir WORK_DIR   Working directory of your job, please keep it THE SAME AS the one used in previous steps.
  --default_model_version {002,004}
                        Use built-in default model weights for a given platform version (002 or 004). If provided,
                        extractor_path/classifier_path should NOT be set.
  --extractor_path EXTRACTOR_PATH
                        Path to the feature extractor model (use together with --classifier_path).
  --classifier_path CLASSIFIER_PATH
                        Path to the class classifier model (use together with --extractor_path).

Step3. Annotation

Genomic Location

# Map transcriptomic to genomic coordinates:
usage: orca-genomic_locator [-h] --gtf_path GTF_PATH [--prefix PREFIX] --work_dir WORK_DIR [--n_processes N_PROCESSES]

Convert transcriptomic coordinates to genomic coordinates.

options:
  -h, --help            show this help message and exit
  --gtf_path GTF_PATH   Path to GTF file
  --prefix PREFIX       prefix of output file, please keep it THE SAME AS the one used in previous steps. Default: data
  --work_dir WORK_DIR   Working directory of your job, please keep it THE SAME AS the one used in previous steps.
  --n_processes N_PROCESSES
                        Number of parallel processes. Default: All available CPU cores

Filter for Annotation

usage: orca-anno_bascal_feature_ext [-h] [--prefix PREFIX] --work_dir WORK_DIR

options:
  -h, --help           show this help message and exit
  --prefix PREFIX      prefix of output file, please keep it THE SAME AS the one used in previous steps. Default: data
  --work_dir WORK_DIR  Working directory of your job, please keep it THE SAME AS the one used in previous steps.

Flatten Signal Features

usage: orca-anno_signal_feature_ext [-h] [--prefix PREFIX] --work_dir WORK_DIR

options:
  -h, --help           show this help message and exit
  --prefix PREFIX      prefix of output file, please keep it THE SAME AS the one used in previous steps. Default: data
  --work_dir WORK_DIR  Working directory of your job, please keep it THE SAME AS the one used in previous steps.

Annotation:

The answer file is a 0-based modification annotation CSV file without a header, containing four columns: chromosome, position, strand, and modification type. See the test folder for an example.

The base type file is a two-column TSV file specifying the base type for each modification. It contains no required header; the first column is the modification name and the second column is the corresponding base type (A/C/G/T/N/unknown, please do not use U).

The modification names must be consistent between the answer file and the base type file (e.g., pseudouridine should be written consistently as either pseudoU in both files or ψ in both files).

usage: orca-annotation [-h] --answer_path ANSWER_PATH --ref_path REF_PATH --base_type_path BASE_TYPE_PATH [--threshold THRESHOLD] [--prefix PREFIX] --work_dir WORK_DIR

options:
  -h, --help            show this help message and exit
  --answer_path ANSWER_PATH
                        Path to the NGS-based answers
  --ref_path REF_PATH   Path to the reference GENOME path
  --base_type_path BASE_TYPE_PATH
                        TSV with two columns: modification name and base type
  --threshold THRESHOLD Only consider modifications with at least this number of sites supported by NGS answers. Default: 50
  --prefix PREFIX       prefix of output file, please keep it THE SAME AS the one used in previous steps. Default: data
  --work_dir WORK_DIR   Working directory of your job, please keep it THE SAME AS the one used in previous steps.

Outputs

1. Prediction

The prediction results are stored in the your_prefix.preds.per.site file, containing the following columns:

column	name	description
1	id	transcript ID
2	position	0-based transcriptome coordinate
3	kmer	5-mers sequence centered at this position
4	depth	sequencing depth
5	modScore	RNA modification score
6	pred_rate	predicted modification proportion

2. Annotation

The annotation results are stored in the your_prefix.annotation.per.site file, containing the following columns:

column	name	description
1	id	transcript ID
2	position	0-based transcriptome coordinate
3	kmer	11-mer sequence centered at this position
4	contig	chromosome name
5	gen_position	0-based genomic coordinate
6	strand	strand of the transcript
7	modification	predicted modification
8	source	source of the modification

Support

If you encounter issues or have questions, please open an issue on our GitHub repository.

Author

Authors: Han Dong(donghan@biols.ac.cn), Jinyang Zhang(zhangjinyang@biols.ac.cn), Fangqing Zhao(zhfq@biols.ac.cn)

Maintainer: Han Dong

License

This project is licensed under the MIT License – see the LICENSE file for details.

Last updated: June 10, 2025

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
orca		orca
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Orca: Omni RNA Modification Characterization and Annotation

Table of Contents

Installation

Dependency

Installation with wheel

Information of test files

Usage

Step1. Basecalling & alignments

Step2. Prediction

Step3. Annotation

Outputs

1. Prediction

2. Annotation

Support

Author

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Orca: Omni RNA Modification Characterization and Annotation

Table of Contents

Installation

Dependency

Installation with wheel

Information of test files

Usage

Step1. Basecalling & alignments

Step2. Prediction

Step3. Annotation

Outputs

1. Prediction

2. Annotation

Support

Author

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages