Skip to content

deepomicslab/PSHIC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PSHIC: Reveals Allele-Specific Chromatin Organization by Phasing-Free Structural Modeling

PSHIC decomposes a mixed (unphased) diploid Hi-C contact matrix into two homolog-specific contact maps by learning latent structural embeddings for each homolog. It supports both simulated and real bulk Hi-C data and can optionally exploit inter-homolog contacts for improved phasing.


Repository structure

reconstruction/          Core algorithm
  run_sim_pshic.py       Main PSHIC entry point
  KR_norm.py             Knight-Ruiz normalization
  ensemble_inits.py      Ensemble multiple random initialisations
  batch_process.py       Batch runner for multiple datasets

baseline/                Comparison methods
  allele_certain/        Allele-certain phasing
  mate_rescue/           Mate-rescue phasing
  pastis/                PASTIS 3D reconstruction
  ashic/                 ASHIC wrapper

simulation/              Simulation framework
  datasets_simulation.py Contact map generation
  eval/                  Evaluation metrics, scripts, and plots
  resolution_analysis/   Resolution scaling experiments

example_data/            Quick-start example matrix

Installation

PSHIC requires Python ≥ 3.8 and the following packages:

pip install numpy scipy scikit-learn torch

For KR normalization and evaluation utilities:

pip install pandas statsmodels

For Hi-C format conversion (see Input data):

pip install cooler hic-straw

Input data

PSHIC takes as input a symmetric, dense contact matrix for a single chromosome, stored as either:

  • a NumPy binary file (.npy) — recommended
  • a whitespace-delimited text file (.txt) — one row per line

From .cool / .mcool (cooler format)

import cooler, numpy as np

# Single-resolution .cool
clr = cooler.Cooler("sample.cool")
mat = clr.matrix(balance=False).fetch("chr22")
np.save("chr22_raw.npy", np.nan_to_num(mat).astype(np.float64))

# Multi-resolution .mcool — pick a resolution
clr = cooler.Cooler("sample.mcool::/resolutions/100000")
mat = clr.matrix(balance=False).fetch("chr22")
np.save("chr22_100kb_raw.npy", np.nan_to_num(mat).astype(np.float64))

From .hic (Juicer format)

Option A — hic-straw (Python):

import hicstraw, numpy as np

hic        = hicstraw.HiCFile("sample.hic")
resolution = 100_000
chrom      = "22"   # no "chr" prefix for human in .hic files

result = hicstraw.straw("observed", "NONE", "sample.hic",
                        chrom, chrom, "BP", resolution)

chrom_sizes = {c.name: c.length for c in hic.getChromosomes()}
n_bins = (chrom_sizes[f"chr{chrom}"] + resolution - 1) // resolution
mat    = np.zeros((n_bins, n_bins), dtype=np.float64)

for r in result:
    i, j = r.binX // resolution, r.binY // resolution
    if i < n_bins and j < n_bins:
        mat[i, j] += r.counts
        mat[j, i] += r.counts

np.save(f"chr{chrom}_100kb_raw.npy", mat)

Option B — Juicer Tools dump (command-line):

java -jar juicer_tools.jar dump observed NONE \
    sample.hic 22 22 BP 100000 chr22_100kb.txt

Then convert the three-column sparse output to a dense matrix:

import numpy as np
resolution, chrom_size = 100_000, 50_818_468   # adjust chrom_size
n_bins = (chrom_size + resolution - 1) // resolution
mat    = np.zeros((n_bins, n_bins), dtype=np.float64)
with open("chr22_100kb.txt") as f:
    for line in f:
        p = line.split()
        i, j, v = int(p[0])//resolution, int(p[1])//resolution, float(p[2])
        if i < n_bins and j < n_bins:
            mat[i, j] += v; mat[j, i] += v
np.save("chr22_100kb_raw.npy", mat)

Resolution binning

To coarsen a matrix (e.g. 10 kb → 100 kb):

import numpy as np
mat    = np.load("chr22_10kb_raw.npy")
factor = 10
M      = mat.shape[0] // factor * factor
mat    = mat[:M, :M].reshape(M//factor, factor, M//factor, factor).sum(axis=(1,3))
np.save("chr22_100kb_raw.npy", mat)

Running PSHIC

# Simulated / clean data
python reconstruction/run_sim_pshic.py \
    -m <matrix.npy> -t <task_name> -o <output_dir> \
    -k 5 --device auto

# Real bulk Hi-C (outlier removal + KR normalization recommended)
python reconstruction/run_sim_pshic.py \
    -m <matrix.npy> -t <task_name> -o <output_dir> \
    -k 32 --real --kr-norm --device auto

# With inter-homolog contacts (Drosophila or other organisms with labeled reads)
python reconstruction/run_sim_pshic.py \
    -m <matrix.npy> -t <task_name> -o <output_dir> \
    -k 32 --real --kr-norm --inter --device auto

Key arguments:

Argument Description
-m Path to input matrix (.npy or .txt)
-t Task name (used for output file naming)
-o Output directory
-k Embedding dimension (see Choosing k)
--real Enable outlier-bin removal (recommended for real data)
--kr-norm Apply KR normalization (recommended for real data)
--inter Enable inter-homolog Phase 2 (for data with labeled reads)
--device cpu, cuda, or auto

Output files (written to <output_dir>/):

File Description
p_hic_<task>.txt Predicted paternal contact matrix
m_hic_<task>.txt Predicted maternal contact matrix
p_struct_<task>.txt Paternal structural embedding (N × k)
m_struct_<task>.txt Maternal structural embedding (N × k)
loss.txt Final training loss (used by ensemble script)

Multiple initialisations and ensembling

PSHIC uses random initialisation — running multiple times and ensembling improves stability:

for i in $(seq 0 9); do
    python reconstruction/run_sim_pshic.py \
        -m <matrix.npy> -t <task_name> \
        -o output/init${i} \
        -k 5 --device auto
done

python reconstruction/ensemble_inits.py \
    -d output -t <task_name> -k 5

Quick-start example

A simulated diploid Hi-C matrix (150 × 150 bins, β = 2000) is provided in example_data/chd.txt to verify your installation:

python reconstruction/run_sim_pshic.py \
    -m example_data/chd.txt \
    -t example \
    -o example_data/output/init0 \
    -k 5 -l 1e-5 --momentum 0.999 --noise 15 --radius 10 \
    --device cpu --verbose

Choosing the embedding dimension k

The -k parameter controls the dimensionality of the latent structural embedding — it is a model-capacity parameter, not a physical space dimension. The 3D visualizations in the paper are PCA projections of the k-dimensional embeddings included for illustration only; all quantitative analyses use the full k-dimensional representations.

Why k > 3?

  1. Population-averaged contacts. Bulk Hi-C aggregates over millions of cells and conformational states. Restricting k = 3 can be too rigid to approximate this effective contact-frequency profile.
  2. Coarse-binning aggregation. Contact counts between coarse bins sum over many fine-scale interactions and do not reduce to a single distance-decay term. A larger k compensates for this effect.
  3. Diploid decomposition capacity. Inferring two homolog-specific maps from one mixed observation requires sufficient representational flexibility for allele-specific structural differences.

Recommended k by matrix size

Matrix size (n bins) Recommended k
≤ 50 4 – 8
50 – 150 6 – 16
150 – 500 12 – 32
> 500 ≥ 32

The smallest k achieving mean SCC > 0.9 in our simulations: 50 × 50 → k = 4  |  200 × 200 → k = 6  |  500 × 500 → k = 12

Upper-bound heuristic

To keep the optimisation well-constrained:

k  <  (n − 1) / 4

For example, n = 50 gives k < 12.25, consistent with k = 4–8 being sufficient.

Data complexity

  • Simulated / low-noise data — use the lower end of the range.
  • Real bulk Hi-C — use the upper end; real data are more heterogeneous and may contain stronger allele-specific structural differences.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors