PSHIC decomposes a mixed (unphased) diploid Hi-C contact matrix into two homolog-specific contact maps by learning latent structural embeddings for each homolog. It supports both simulated and real bulk Hi-C data and can optionally exploit inter-homolog contacts for improved phasing.
reconstruction/ Core algorithm
run_sim_pshic.py Main PSHIC entry point
KR_norm.py Knight-Ruiz normalization
ensemble_inits.py Ensemble multiple random initialisations
batch_process.py Batch runner for multiple datasets
baseline/ Comparison methods
allele_certain/ Allele-certain phasing
mate_rescue/ Mate-rescue phasing
pastis/ PASTIS 3D reconstruction
ashic/ ASHIC wrapper
simulation/ Simulation framework
datasets_simulation.py Contact map generation
eval/ Evaluation metrics, scripts, and plots
resolution_analysis/ Resolution scaling experiments
example_data/ Quick-start example matrix
PSHIC requires Python ≥ 3.8 and the following packages:
pip install numpy scipy scikit-learn torchFor KR normalization and evaluation utilities:
pip install pandas statsmodelsFor Hi-C format conversion (see Input data):
pip install cooler hic-strawPSHIC takes as input a symmetric, dense contact matrix for a single chromosome, stored as either:
- a NumPy binary file (
.npy) — recommended - a whitespace-delimited text file (
.txt) — one row per line
import cooler, numpy as np
# Single-resolution .cool
clr = cooler.Cooler("sample.cool")
mat = clr.matrix(balance=False).fetch("chr22")
np.save("chr22_raw.npy", np.nan_to_num(mat).astype(np.float64))
# Multi-resolution .mcool — pick a resolution
clr = cooler.Cooler("sample.mcool::/resolutions/100000")
mat = clr.matrix(balance=False).fetch("chr22")
np.save("chr22_100kb_raw.npy", np.nan_to_num(mat).astype(np.float64))Option A — hic-straw (Python):
import hicstraw, numpy as np
hic = hicstraw.HiCFile("sample.hic")
resolution = 100_000
chrom = "22" # no "chr" prefix for human in .hic files
result = hicstraw.straw("observed", "NONE", "sample.hic",
chrom, chrom, "BP", resolution)
chrom_sizes = {c.name: c.length for c in hic.getChromosomes()}
n_bins = (chrom_sizes[f"chr{chrom}"] + resolution - 1) // resolution
mat = np.zeros((n_bins, n_bins), dtype=np.float64)
for r in result:
i, j = r.binX // resolution, r.binY // resolution
if i < n_bins and j < n_bins:
mat[i, j] += r.counts
mat[j, i] += r.counts
np.save(f"chr{chrom}_100kb_raw.npy", mat)Option B — Juicer Tools dump (command-line):
java -jar juicer_tools.jar dump observed NONE \
sample.hic 22 22 BP 100000 chr22_100kb.txtThen convert the three-column sparse output to a dense matrix:
import numpy as np
resolution, chrom_size = 100_000, 50_818_468 # adjust chrom_size
n_bins = (chrom_size + resolution - 1) // resolution
mat = np.zeros((n_bins, n_bins), dtype=np.float64)
with open("chr22_100kb.txt") as f:
for line in f:
p = line.split()
i, j, v = int(p[0])//resolution, int(p[1])//resolution, float(p[2])
if i < n_bins and j < n_bins:
mat[i, j] += v; mat[j, i] += v
np.save("chr22_100kb_raw.npy", mat)To coarsen a matrix (e.g. 10 kb → 100 kb):
import numpy as np
mat = np.load("chr22_10kb_raw.npy")
factor = 10
M = mat.shape[0] // factor * factor
mat = mat[:M, :M].reshape(M//factor, factor, M//factor, factor).sum(axis=(1,3))
np.save("chr22_100kb_raw.npy", mat)# Simulated / clean data
python reconstruction/run_sim_pshic.py \
-m <matrix.npy> -t <task_name> -o <output_dir> \
-k 5 --device auto
# Real bulk Hi-C (outlier removal + KR normalization recommended)
python reconstruction/run_sim_pshic.py \
-m <matrix.npy> -t <task_name> -o <output_dir> \
-k 32 --real --kr-norm --device auto
# With inter-homolog contacts (Drosophila or other organisms with labeled reads)
python reconstruction/run_sim_pshic.py \
-m <matrix.npy> -t <task_name> -o <output_dir> \
-k 32 --real --kr-norm --inter --device autoKey arguments:
| Argument | Description |
|---|---|
-m |
Path to input matrix (.npy or .txt) |
-t |
Task name (used for output file naming) |
-o |
Output directory |
-k |
Embedding dimension (see Choosing k) |
--real |
Enable outlier-bin removal (recommended for real data) |
--kr-norm |
Apply KR normalization (recommended for real data) |
--inter |
Enable inter-homolog Phase 2 (for data with labeled reads) |
--device |
cpu, cuda, or auto |
Output files (written to <output_dir>/):
| File | Description |
|---|---|
p_hic_<task>.txt |
Predicted paternal contact matrix |
m_hic_<task>.txt |
Predicted maternal contact matrix |
p_struct_<task>.txt |
Paternal structural embedding (N × k) |
m_struct_<task>.txt |
Maternal structural embedding (N × k) |
loss.txt |
Final training loss (used by ensemble script) |
PSHIC uses random initialisation — running multiple times and ensembling improves stability:
for i in $(seq 0 9); do
python reconstruction/run_sim_pshic.py \
-m <matrix.npy> -t <task_name> \
-o output/init${i} \
-k 5 --device auto
done
python reconstruction/ensemble_inits.py \
-d output -t <task_name> -k 5A simulated diploid Hi-C matrix (150 × 150 bins, β = 2000) is provided in
example_data/chd.txt to verify your installation:
python reconstruction/run_sim_pshic.py \
-m example_data/chd.txt \
-t example \
-o example_data/output/init0 \
-k 5 -l 1e-5 --momentum 0.999 --noise 15 --radius 10 \
--device cpu --verboseThe -k parameter controls the dimensionality of the latent structural embedding —
it is a model-capacity parameter, not a physical space dimension. The 3D
visualizations in the paper are PCA projections of the k-dimensional embeddings
included for illustration only; all quantitative analyses use the full
k-dimensional representations.
- Population-averaged contacts. Bulk Hi-C aggregates over millions of cells and conformational states. Restricting k = 3 can be too rigid to approximate this effective contact-frequency profile.
- Coarse-binning aggregation. Contact counts between coarse bins sum over many fine-scale interactions and do not reduce to a single distance-decay term. A larger k compensates for this effect.
- Diploid decomposition capacity. Inferring two homolog-specific maps from one mixed observation requires sufficient representational flexibility for allele-specific structural differences.
| Matrix size (n bins) | Recommended k |
|---|---|
| ≤ 50 | 4 – 8 |
| 50 – 150 | 6 – 16 |
| 150 – 500 | 12 – 32 |
| > 500 | ≥ 32 |
The smallest k achieving mean SCC > 0.9 in our simulations: 50 × 50 → k = 4 | 200 × 200 → k = 6 | 500 × 500 → k = 12
To keep the optimisation well-constrained:
k < (n − 1) / 4
For example, n = 50 gives k < 12.25, consistent with k = 4–8 being sufficient.
- Simulated / low-noise data — use the lower end of the range.
- Real bulk Hi-C — use the upper end; real data are more heterogeneous and may contain stronger allele-specific structural differences.