Count the number of distal junctions (DJ) of the ribosomal DNA (rDNA)
This repository provides tools for estimating the copy number of the DJs in a genome from sequencing reads.
Copy number of DJs are determined by 1) the sequencing coverage in mapped reads or 2) k-mer multiplicity in raw reads.
Copy number of the DJ can be estimated with the following approaches:
The mapping based approach is recommended when the reads are already aligned to one of the following references.
-
GRCh38/hg38 Homo_sapiens_assembly38.fasta.gz (1000 Genomes Project Broad ver. Suitable for UKBioBank) or GRCh38_full_analysis_set_plus_decoy_hla.fa (1000 Genomes Project NYGC ver. Suitable for 1KGP NYGC 30x cram)
Requires
chr21,chr17_GL000205v2_randomandchrUn_GL000195v1. -
GRCh37/hg19 human_g1k_v37.fasta.gz (1000 Genomes Project ver)
Requires
chr7_gl000195_randomandchr17_gl000205_random. -
T2T-CHM13/hs1 (Will be updated soon)
For hg38 or hg19, check with samtools to confirm the sequence exists in the header; such as using samtools view -H in.bam | grep chr17_GL000205v2_random.
Read coverage is assessed on the mapped BAM file for the target DJ region and compared against the background coverage collected from autosomes.
The k-mer based approach is reference-free.
This approach is recommended when reads are aligned to hg38 or hg19 without any decoy sequences or are in its raw FASTQ form. A collected set of target k-mers are pre-built to query the k-mer multiplicity of the DJ and is compared against the single / 2-copy copy number estimates inferred from the k-mer multiplicity histogram.
v0.1(2024-07-17)
* first commitv0.2(2024-07-25)
* Changing the background coverage estimation methods from samtools idxstats to samtools coverage.* Removing the step of saving temporary files; instead, we assign everything to variables.
v0.2.1(2024-07-29)
* Add background and fragment size to the output file.* Fix the command line used for calculating the background to ensure it works correctly.