Skip to content

Commit 3ee4bc4

Browse files
ad3002claude
andcommitted
Release 1.7.0: Unified output format, streaming writes, summary file
Major changes: - Unified format for .hors.tsv and .monomers.tsv (same 16 columns + parent_idx) - Added .summary.tsv with per-array statistics and consensus for both HOR and monomer levels - Streaming writes: constant memory usage regardless of input size - External sorting with type priority (pred_array → flank → monomer → array → consensus) New features: - Consensus sequences (with IUPAC and quality) in summary file - Edit distance metrics for base monomers - Global indexing for base monomers with parent HOR linkage Performance: - Memory: radically reduced (no longer accumulates all results) - Sorting: moved to external sort after streaming writes Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
1 parent 4e0fc09 commit 3ee4bc4

5 files changed

Lines changed: 819 additions & 229 deletions

File tree

README.md

Lines changed: 108 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -45,46 +45,138 @@ All output is **deterministically sorted by chromosome and genomic position** (c
4545
| File | Description |
4646
|------|-------------|
4747
| `.decomposed.fasta` | Monomers with orientation info in headers |
48-
| `.hors.tsv` | HOR-level decomposition with metrics per HOR monomer |
49-
| `.monomers.tsv` | Base-level monomers from recursive HOR decomposition |
48+
| `.hors.tsv` | HOR-level decomposition (16 columns) |
49+
| `.monomers.tsv` | Base-level monomers from recursive decomposition (17 columns) |
50+
| `.summary.tsv` | One-row-per-array summary with HOR and monomer statistics (23 columns) |
5051
| `.lengths` | Fragment lengths for each array |
5152

53+
### Summary TSV Columns (`.summary.tsv`)
54+
55+
One row per array combining HOR-level and monomer-level statistics. Useful for overview analysis.
56+
57+
| Column | Description |
58+
|--------|-------------|
59+
| `array_id` | Array identifier (chr_start_end_len_period_type) |
60+
| `array_length` | Total array length in bp |
61+
| `orientation` | `fwd` or `rev` (reverse complemented to canonical) |
62+
| `method` | Detection method used (`autocorr`, `classic`) |
63+
| **HOR-level stats** | |
64+
| `hor_period` | Detected HOR period in bp |
65+
| `hor_autocorr` | Autocorrelation at HOR period |
66+
| `hor_n_monomers` | Number of HOR-level monomers |
67+
| `hor_mean_ed_tmpl` | Mean edit distance to HOR consensus |
68+
| `hor_mean_ed_prev` | Mean edit distance between adjacent HORs |
69+
| `hor_cv` | Coefficient of variation for HOR lengths |
70+
| `hor_consensus` | Consensus sequence at HOR level |
71+
| `hor_iupac` | IUPAC ambiguity codes (bases ≥20% frequency) |
72+
| `hor_quality` | Per-position support (digit 0-9, 9=90-100%) |
73+
| **Monomer-level stats** | |
74+
| `mono_period` | Median base monomer period |
75+
| `mono_autocorr` | Mean autocorrelation at monomer level |
76+
| `mono_n_monomers` | Total number of base monomers |
77+
| `mono_mean_ed_tmpl` | Mean edit distance to monomer consensus |
78+
| `mono_mean_ed_prev` | Mean edit distance between adjacent monomers |
79+
| `mono_cv` | Mean coefficient of variation |
80+
| `mono_consensus` | Consensus sequence at monomer level |
81+
| `mono_iupac` | IUPAC ambiguity codes |
82+
| `mono_quality` | Per-position support |
83+
| `cut_sequence` | Anchor k-mer used for splitting |
84+
5285
### HORs TSV Columns (`.hors.tsv`)
5386

54-
Contains the primary decomposition into HOR (Higher Order Repeat) monomers.
87+
Contains the primary decomposition into HOR (Higher Order Repeat) monomers. Multiple rows per array.
88+
89+
**Row types** (in order):
90+
1. `pred_array` - Array-level prediction/header row
91+
2. `flank` - Terminal fragments <70% of period
92+
3. `monomer` - Full HOR monomers (sorted by idx)
93+
4. `array` - Summary statistics row
94+
5. `consensus` - Consensus sequence row
5595

5696
| Column | Description |
5797
|--------|-------------|
5898
| `array_id` | Array identifier (chr_start_end_len_period_type) |
5999
| `type` | `pred_array`, `monomer`, `flank`, `array`, `consensus` |
60-
| `idx` | Monomer index within array |
61-
| `length` | Sequence length |
62-
| `source` | Detection method (`anchor`, `split_2x`, etc.) |
100+
| `idx` | Monomer index within array (0-based) |
101+
| `length` | Sequence length in bp |
102+
| `source` | Detection method: `anchor`, `split_2x`, `split_3x`, `left_flank`, `right_flank` |
63103
| `ed_tmpl` | Edit distance to consensus template |
64104
| `ed_prev` | Edit distance to previous monomer |
65105
| `ed_next` | Edit distance to next monomer |
66-
| `period` | Detected repeat period |
67-
| `autocorr` | Autocorrelation value at period |
106+
| `period` | Detected repeat period in bp |
107+
| `autocorr` | Autocorrelation value at detected period |
108+
| `n_expected` | Expected count of monomers (array_len / period) |
109+
| `ed_per_bp` | Normalized edit distance (ed / length) |
110+
| `cv` | Coefficient of variation for lengths |
68111
| `cut_sequence` | Anchor sequence used for splitting |
69112
| `orientation` | `fwd` or `rev` (reverse complemented) |
70-
| `sequence` | Actual DNA sequence |
113+
| `sequence` | Actual DNA sequence (or `-` for pred_array/array rows) |
71114

72115
### Monomers TSV Columns (`.monomers.tsv`)
73116

74-
Contains base-level monomers after recursive decomposition of HORs. Each HOR is recursively decomposed until no further periodicity is detected (autocorrelation ≤ 0.5) or minimum length (5bp) is reached.
117+
Contains base-level monomers after recursive HOR decomposition. **Unified format** matching `.hors.tsv` plus `parent_idx`.
118+
119+
Each HOR is recursively decomposed until:
120+
- No further periodicity detected (autocorrelation ≤ 0.5)
121+
- Minimum length (5bp) reached
122+
123+
**Row types** (in order):
124+
1. `pred_array` - Array-level summary row
125+
2. `base_monomer` - Base-level monomers from recursive decomposition
126+
3. `monomer` - Non-decomposable monomers (e.g., telomeres)
75127

76128
| Column | Description |
77129
|--------|-------------|
78130
| `array_id` | Array identifier |
79-
| `hor_idx` | Index of parent HOR from primary decomposition |
80-
| `sub_idx` | Index within parent HOR (hierarchical for nested decomposition) |
81-
| `level` | Recursion depth (1 = direct child of HOR) |
82-
| `length` | Sequence length |
83-
| `period` | Detected period at this level (0 if base monomer) |
84-
| `autocorr` | Autocorrelation value at detected period |
131+
| `type` | `pred_array`, `base_monomer`, `monomer` |
132+
| `idx` | Global index within array (0-based) |
133+
| `length` | Sequence length in bp |
85134
| `source` | `recursive_anchor`, `recursive_split`, `base`, `recursive_flank` |
135+
| `ed_tmpl` | Edit distance to submonomer consensus |
136+
| `ed_prev` | Edit distance to previous base monomer |
137+
| `ed_next` | Edit distance to next base monomer |
138+
| `period` | Detected period at this level (0 if base) |
139+
| `autocorr` | Autocorrelation value |
140+
| `n_expected` | Always 1 for individual monomers |
141+
| `ed_per_bp` | Normalized edit distance |
142+
| `cv` | Coefficient of variation within parent group |
143+
| `cut_sequence` | Inherited anchor sequence |
144+
| `orientation` | Inherited from array (`fwd`/`rev`) |
145+
| `parent_idx` | Index of parent HOR from `.hors.tsv` |
86146
| `sequence` | Actual DNA sequence |
87147

148+
### Example: α-satellite HOR Decomposition
149+
150+
For a typical α-satellite HOR (512bp → 3×171bp monomers):
151+
152+
**`.hors.tsv`** - 10 HOR monomers (~512bp each):
153+
```
154+
array_id type idx length period ...
155+
chr1_centromere pred_array 10 5120 512 ...
156+
chr1_centromere monomer 0 512 512 ...
157+
chr1_centromere monomer 1 512 512 ...
158+
...
159+
chr1_centromere array 10 5120 512 ...
160+
chr1_centromere consensus 10 512 512 ... [consensus seq]
161+
```
162+
163+
**`.monomers.tsv`** - 30 base monomers (~171bp each):
164+
```
165+
array_id type idx length parent_idx ...
166+
chr1_centromere pred_array 30 5120 - ...
167+
chr1_centromere base_monomer 0 171 0 ...
168+
chr1_centromere base_monomer 1 171 0 ...
169+
chr1_centromere base_monomer 2 170 0 ...
170+
chr1_centromere base_monomer 3 171 1 ...
171+
...
172+
```
173+
174+
**`.summary.tsv`** - Single row with both levels:
175+
```
176+
array_id length hor_period hor_n_monomers mono_period mono_n_monomers ...
177+
chr1_centromere 5120 512 10 171 30 ...
178+
```
179+
88180
## Algorithm
89181

90182
ArraySplitter employs an autocorrelation-based algorithm for detecting repeat periods and decomposing satellite DNA arrays.

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
44

55
[project]
66
name = "ArraySplitter"
7-
version = "1.6.0"
7+
version = "1.7.0"
88
description = "De Novo Decomposition of Satellite DNA Arrays into Monomers within Telomere-to-Telomere Assemblies"
99
readme = "README.md"
1010
license = {text = "MIT"}

src/rust/arraysplitter/Cargo.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[package]
22
name = "arraysplitter_rs"
3-
version = "1.6.0"
3+
version = "1.7.0"
44
edition = "2021"
55
authors = ["Aleksey Komissarov <ad3002@gmail.com>"]
66
description = "De novo decomposition of satellite DNA arrays into monomers"

0 commit comments

Comments
 (0)