Skip to content

marbl/DJCounter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

50 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DJ Counter

Count the number of distal junctions (DJ) of the ribosomal DNA (rDNA)

Purpose

This repository provides tools for estimating the copy number of the DJs in a genome from sequencing reads.

Principle

Copy number of DJs are determined by 1) the sequencing coverage in mapped reads or 2) k-mer multiplicity in raw reads.

Main workflow

Copy number of the DJ can be estimated with the following approaches:

  1. Mapping based approach
  2. K-mer based, reference-free approach

Mapping based estimates

The mapping based approach is recommended when the reads are already aligned to one of the following references.

  1. GRCh38/hg38 Homo_sapiens_assembly38.fasta.gz (1000 Genomes Project Broad ver. Suitable for UKBioBank) or GRCh38_full_analysis_set_plus_decoy_hla.fa (1000 Genomes Project NYGC ver. Suitable for 1KGP NYGC 30x cram)

    Requires chr21, chr17_GL000205v2_random and chrUn_GL000195v1.

  2. GRCh37/hg19 human_g1k_v37.fasta.gz (1000 Genomes Project ver)

    Requires chr7_gl000195_random and chr17_gl000205_random.

  3. T2T-CHM13/hs1 (Will be updated soon)

For hg38 or hg19, check with samtools to confirm the sequence exists in the header; such as using samtools view -H in.bam | grep chr17_GL000205v2_random.

Read coverage is assessed on the mapped BAM file for the target DJ region and compared against the background coverage collected from autosomes.

K-mer based approach

The k-mer based approach is reference-free.

This approach is recommended when reads are aligned to hg38 or hg19 without any decoy sequences or are in its raw FASTQ form. A collected set of target k-mers are pre-built to query the k-mer multiplicity of the DJ and is compared against the single / 2-copy copy number estimates inferred from the k-mer multiplicity histogram.

Change logs

v0.1(2024-07-17) * first commit
v0.2(2024-07-25) * Changing the background coverage estimation methods from samtools idxstats to samtools coverage.
* Removing the step of saving temporary files; instead, we assign everything to variables.
v0.2.1(2024-07-29) * Add background and fragment size to the output file.
* Fix the command line used for calculating the background to ensure it works correctly.
v0.2.2(2025-11-26) * Add BED file for roi on hg19

About

This repository can be used to calculate the number of DJ counts in the genome aligned to the GRCh38 (broad reference).

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages