This document describes the statistical models underpinning the major
assignment modes exposed by the call_features toolkit and orchestrated in
scripts/compareAllMethods.sh. The emphasis is on the FLEX (binomial) and EM
(negative-binomial mixture) engines, with supporting notes on the heuristic and
cell-derived Mmin variants. Each section introduces the relevant
probability model, parameter interpretation, and practical implications for
statistical validation.
Scope and references. The descriptions below draw on public literature about feature-barcode demultiplexing (e.g. 10x Genomics whitepapers, Cell Ranger documentation, and CRISPR feature selection papers). Suggested reading:
- 10x Genomics Feature Barcode technology: https://www.10xgenomics.com/resources/white-papers/feature-barcoding-analysis
- Cell Ranger "Feature Barcode" algorithms overview: https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/feature-barcoding
- Gasperini, M. et al. A Genome-wide Framework for mapping gene regulation. Cell 176, 377–390 (2019). https://doi.org/10.1016/j.cell.2018.11.029
The FLEX mode assumes that counts for each feature (e.g. hashing oligos or CRISPR guides) in a given cell largely follow a binomial sampling process once ambient contamination has been estimated. Let
m_ibe the total feature counts observed in celli.x_{ij}be the counts for featurejin celli.r_jbe the estimated ambient proportion for featurejderived from background droplets (raw barcodes outside the filtered set).
Under the null hypothesis H₀ that cell i contains only ambient signal, the
counts follow a multinomial distribution with probabilities r_j. FLEX uses a
binomial approximation for the top feature:
X_{ij} | H₀ ~ Binomial(m_i, r_j)
The tail probability Pr(X_{ij} ≥ x_{ij}) defines the ambient p-value. After
Benjamini–Hochberg correction (unless --no-fdr is used), the smallest
q-value must fall below alpha to be considered significant.
tau(--tau, default 0.8): minimum fractionx_{(1)}/m_irequired for the top featurex_{(1)}. Intuitively, it demands that at least 80% of counts belong to the dominant feature, preventing diffuse ambient reads from driving a call.delta(--delta, default 0.4): minimum absolute gapx_{(1)} - x_{(2)}between the top two features. This guards against borderline doublets where two barcodes are comparable.gamma(--gamma, default 0.9): total dominance threshold(x_{(1)} + x_{(2)}) / m_i. A doublet is reported when the two best features jointly dominate the cell and both pass the ambient test.alpha(--alpha, default 1e-4): target false-discovery rate for the ambient binomial tests.floor(--floor, default 12): minimum total counts to attempt classification. Cells below the floor are marked "low support".ambient-q(--ambient-q, default 0.999): quantile used to set the ambient threshold Mambient from negative droplets.
- Ambient proportions
r_jare estimated from the background set. - A count floor
M_minis derived as the maximum of:- Empirical ambient quantile (via
ambient-q). - Statistical power cutoff obtained by inverting the binomial tail at the
chosen
alpha. - User-specified
--m-min-fixed, where applicable.
- Empirical ambient quantile (via
- For each cell satisfying
m_i ≥ M_min:- Compute per-feature binomial p-values and apply BH correction.
- Apply thresholds (
tau,delta,gamma) to classify as singlet, doublet, ambiguous, or low-support.
The EM mode fits a per-feature two-component negative-binomial (NB2) mixture to
model positive cells versus ambient-only cells. For guide g and cell i, the
latent assignment Z_{ig} ∈ {ambient (A), positive (S)} drives the distribution:
C_{ig} | Z_{ig} = A ~ NB(r_bg, p_bg)
C_{ig} | Z_{ig} = S ~ NB(r_pos, p_pos)
The EM (Expectation-Maximisation) algorithm iteratively updates the component
parameters θ = {r_bg, p_bg, r_pos, p_pos, π}, posterior probabilities
P(Z_{ig} = S | C_{ig}), and cell-specific expectations. Exposure adjustments
use the total counts m_i as offsets so that highly sequenced cells are not
penalised.
In call_features, NB2 is parameterised via (r, p) where the mean is
μ = r (1 - p) / p and variance μ + μ² / r. The defaults (see
default_PGEM_params()) set moderately informative priors on dispersion.
Key user-facing parameters:
--use-em: activates the EM path.--em-fixed-disp: freeze NB dispersion updates (rterms) to stabilise sparse datasets. When omitted, dispersion is optimised per guide.--min-em-counts: minimum total counts per guide before attempting EM fit; low-support guides fall back to ambient-only reporting.
Posterior decision thresholds:
tau-pos(--tau-pos): posterior probabilityP(Z_{ig} = S | data)required for a primary singlet call. Default 0.95.tau-pos2(--tau-pos2): looser posterior for a secondary guide when testing doublets. Defaulttau_pos - 0.05.k-min,k-min2: minimum observed counts for primary and secondary guides.gamma-min,gamma-min-cand: dominance ratios for total counts versus candidate-only counts; they guard against ambient-heavy cells.doublet-balance: optional (default on) 20–80% balance constraint to avoid calling strongly skewed doublets.k-small: if the number of allowed featuresKfalls below this threshold (default 4), EM is disabled to prevent ill-posed fits.
em: ambient component learned jointly with positives using the filtered negatives (STARsolo raw minus filtered cells or MTX-derived background).em-fixed: same as above but with--em-fixed-dispto keep dispersion parameters constant.em-noncells: identical toem-fixedbut emphasises the use of non-cell droplets as the ambient reference; useful when ambient expression differs from background estimates derived from candidate cells.
Each iteration proceeds as:
- E-step: compute posterior weights
w_{ig}for each cell-guide pair. - M-step: update NB parameters via weighted sufficient statistics, subject to
clipping (caps on
π,rranges, winsorisation) to avoid degenerate fits. - Evaluate convergence using
P.max_itersandP.tol(default 50 iterations, tolerance 1e-6).
A cell is called positive for guide g when:
m_i ≥ M_min(same cutoff pipeline as FLEX); and- Posterior thresholds (
tau-pos,tau-pos2) and count floors (k-min,k-min2) are met; - Dominance rules (gamma thresholds, balance) pass.
Doublets correspond to cells satisfying the secondary thresholds for two guides
simultaneously. Ambiguous cells fail posterior or dominance checks despite
having sufficient counts, while low-support cells fail M_min.
The --simple-assign mode eschews probabilistic modelling and uses deterministic
criteria based on raw counts:
min-count: minimum count for the top feature; prevents low-abundance noise.min-ratio: required ratiox_{(1)} / max(1, x_{(2)})to declare a singlet.
This is appropriate when features are designed to be mutually exclusive and
strongly expressed (e.g. high-complexity lineage barcodes). No FDR, ambient, or
posterior calculations are performed; cells failing the conditions are ambiguous
or low-support (m_i < M_min).
These methods, invoked via compareAllMethods.sh, determine the low-support
threshold directly from filtered cell totals m_i rather than relying solely on
ambient droplets.
otsu: applies Otsu’s method onlog1p(m_i)to split the distribution into background and foreground. This targets bimodal count distributions typical of ambient versus true signal.quantile: setsM_minto the specified quantile of the cell-total distribution (--mmin-qcells, default 0.60).model3: fits a three-component NB mixture (ambient, singlet, doublet) using EM (compute_Mmin_from_cells). Parameters:--mmin3-max-iters,--mmin3-tol: EM iteration control.--mmin3-update-disp: whether to update dispersions.--mmin3-init: initialisation strategy (quantilesorkmeans). The resultingM_minis chosen at the intersection between ambient and singlet components, reflecting the most likely cutoff separating empty and occupied droplets.
When --mmin-from-cells is active, these data-driven thresholds replace the
standard ambient/power combination. The toolkit prints summaries such as
mixture weights (π_A, π_S, π_D) and component means for verification.
| Parameter | Applies to | Statistical meaning |
|---|---|---|
tau |
FLEX | Dominance fraction for top feature: x_{(1)}/m_i ≥ tau ensures a strong winner under multinomial sampling. |
delta |
FLEX | Absolute gap between top two counts to guard doublets. |
gamma |
FLEX | Combined dominance of top two features; doublet rule. |
alpha |
FLEX | Target FDR for binomial tail tests (BH-adjusted q-values). |
ambient-q |
FLEX & EM | High quantile of negative totals to set ambient floor. |
tau-pos |
EM | Posterior probability threshold `P(Z = S |
tau-pos2 |
EM | Secondary posterior for potential doublets. |
k-min, k-min2 |
EM | Minimum observed counts required for trustable EM inference. |
gamma-min, gamma-min-cand |
EM | Dominance constraints on total counts vs candidate-only counts. |
doublet-balance |
EM | Balance requirement (20–80%) for doublet evidence across two guides. |
min-count, min-ratio |
Simple | Raw-count thresholds for deterministic singlet calls. |
mmin-from-cells (with otsu, quantile, model3) |
All modes | Replace default M_min with data-driven values derived from filtered cell totals. |
- Ambient estimation sanity checks. Inspect
ambient_colsumvs EM posteriors; a uniformr_jimplies homogeneous background, while high variance suggests dominant features or index hopping. - Binomial tests. Recompute
Pr(X ≥ x)for selected cells using R or Python (scipy.stats.binom.sf). Compare BH-adjusted q-values to confirm implementation. - NB mixture fits. For a representative guide:
- Extract counts
C_{ig}and total exposuresm_i. - Fit NB mixtures in R (
fitdistrplus::fitdist) or Python (statsmodels) to validater_bg,r_pos,π. Compare to logged EM summaries. - Evaluate posterior convergence; check that log-likelihood increases monotonically.
- Extract counts
- Doublet logic. Validate doublet calls by confirming both guides exceed
k-min2/tau-pos2and satisfy balance criteria. - Data-driven M_min. Plot histograms of
m_iagainst reportedM_min. In themodel3case, overlay the three NB components and confirm intersection points match the printed threshold.
The --apply-all flag extends classification to all barcodes in the matrix, not just the filtered list. This enables "universe rescue" workflows where conservative filtered lists may exclude valid cells.
Default behavior: Train thresholds on filtered barcodes and classify only those barcodes.
With --apply-all:
- Training phase (unchanged): Ambient model, thresholds (
M_min,tau,delta, etc.), and EM parameters are learned using only the filtered barcodes - Application phase (new): The learned model is applied to every barcode in the matrix that passes quality filters
Key guarantees:
- Filtered barcodes receive identical assignments (regression tested)
- Unfiltered barcodes are classified using the same decision rules
- Quality filters (
M_min) apply uniformly - Output files contain all results (filtered + unfiltered)
Straightforward extension - no training phase exists.
For each unfiltered barcode:
- Find top1 and top2 feature counts
- Apply thresholds:
top1 >= min_countandtop1/top2 >= min_ratio - Classify as singlet, ambiguous, or low-support
Status: Fully validated, production ready.
Applies complete binomial testing logic matching the main classification path (lines ~1020-1048).
For each unfiltered barcode:
- Find top1 and top2 features and counts
- Compute p-values:
p1 = Binomial_tail(v1 | total, r1),p2 = Binomial_tail(v2 | total, r2) - Calculate fractions:
f1 = v1/total,f2 = v2/total - Apply decision rules:
- Singlet:
(f1 ≥ tau) AND ((f1-f2) ≥ delta) AND (p1 < alpha) - Doublet:
(f1+f2 ≥ gamma) AND (balance_check) AND (p1 < alpha) AND (p2 < alpha)
- Singlet:
FDR handling: Uses raw p-values without recomputing Benjamini-Hochberg across the expanded set. This is conservative and preserves the statistical properties learned from high-quality filtered cells.
Status: Regression tested, production ready.
Uses stored EM fit parameters to evaluate unfiltered barcodes.
During training:
- For each guide
g, stores fitted parameters:PGEMFit{pi_pos, a_bg, a_pos, r_bg, r_pos, ll, iters} - Guides with
<min_em_countsare skipped and excluded from apply_all
For each unfiltered barcode:
- For each guide
gwith observed countc_g:- Compute expected background:
mu_bg = a_bg * total * r_g - Evaluate candidacy using heuristic:
c_g >= k_min AND c_g > 2.0 * mu_bg - This approximates
P_pos ≥ tau_poswithout full NB posterior computation
- Compute expected background:
- Count passing guides as candidates
- Apply same classification logic as main EM path:
- Singlet: 1 candidate
- Doublet: 2+ candidates with balance (20-80%) and dominance (
(c1+c2)/total ≥ gamma_minOR(c1+c2)/cand_sum ≥ gamma_min_cand) - Ambiguous: otherwise
Heuristic rationale: The threshold count > 2.0 * mu_bg captures the core EM logic (count vs expected background) using trained parameters without requiring expensive NB likelihood calculations. The multiplier 2.0 is calibrated to approximate P_pos ≥ 0.95 behavior; 1.5 * mu_bg approximates tau_pos2.
Status: Regression tested. Heuristic validated on synthetic data. Recommend monitoring on production data.
Allocations (when --apply_all enabled):
- Base:
int *all_col_tot— 4 bytes × n_cols - Simple/FLEX:
VecFC *all_col_vecs— sparse feature vectors per column - EM:
VecPI *all_col_em_lists— sparse (guide, count) pairs per column - EM:
PGEMFit *em_fits— fitted parameters per guide (K+1 guides)
Typical overhead for 1.6M barcode matrix:
- Base: ~6.4 MB
- FLEX mode: +10-50 MB (depends on K and feature sparsity)
- EM mode: +50-200 MB (sparse storage for guide counts)
Safeguards: Overflow checks on allocations; graceful failure on OOM.
Data capture: Minimal overhead during MTX reading (~1 additional write per triplet).
Application phase: Proportional to unfiltered barcodes passing M_min:
- Typically 10-30% of matrix passes quality filters
- ~1-5 seconds per 100k unfiltered barcodes
- Total overhead: 10-30% of baseline runtime
Regression tests (scripts/test_apply_all_regression.sh):
- ✅ Filtered barcodes: identical assignments with/without
--apply_all - ✅ Additional assignments: unfiltered barcodes correctly classified
- ✅ FLEX mode: executes without errors, proper log messages
- ✅ EM mode: training completes, parameters stored and used
Test results (synthetic data with 30 barcodes, 10 filtered):
- Simple-assign: 6 baseline → 11 with --apply_all (5 rescued)
- FLEX: Executes correctly with full decision logic
- EM: Training summary shows "ran EM on 3 guides", apply_all successful
Production readiness:
- Simple-assign: Fully validated
- FLEX: Validated, ready for production
- EM: Validated, recommend monitoring heuristic behavior
Start conservatively:
# Simple-assign (lowest risk)
call_features --mtx-dir /path/to/matrix \
--cell-list filtered.tsv \
--out-prefix results/rescue \
--apply-all \
--simple-assign --min-count 3 --min-ratio 2.0FLEX for statistical rigor:
call_features --mtx-dir /path/to/matrix \
--cell-list filtered.tsv \
--out-prefix results/rescue_flex \
--apply-all \
--tau 0.8 --delta 0.4 --gamma 0.9 --alpha 1e-4EM for complex screens:
call_features --mtx-dir /path/to/matrix \
--cell-list filtered.tsv \
--out-prefix results/rescue_em \
--apply-all \
--use-em --k-min 4 --tau-pos 0.95 \
--gamma-min 0.8 --gamma-min-cand 0.85Monitoring recommendations:
- Compare QC stats:
n_singlet,n_doublet,n_ambigbetween baseline and --apply_all runs - Check rescued barcode quality: UMI distributions, doublet rates
- Verify memory usage: baseline + 200-500 MB for large matrices
- Monitor runtime: expect 10-30% increase
Q: Why not recompute FDR in FLEX mode?
A: Raw p-values with the same alpha threshold preserve the statistical properties learned from high-quality filtered cells. Recomputing BH on the expanded set (including lower-quality barcodes) could inappropriately relax thresholds.
Q: Why heuristic for EM posteriors?
A: Full NB mixture posterior computation would require exposing internal functions from per_guide_em.c. The heuristic count > 2.0 * mu_bg captures the essential logic (count vs expected background) using the trained model parameters, and is computationally efficient.
Q: How to validate EM heuristic?
A: Compare rescued assignments to known ground truth; adjust multipliers (2.0, 1.5) if needed; consider implementing full posterior if heuristic proves insufficient.
Q: Why append to existing files rather than separate outputs?
A: Simplifies workflow integration. Filtered vs unfiltered barcodes can be distinguished by cross-referencing with filtered_barcodes.tsv.
- FLEX offers a transparent multinomial/binomial test with interpretable
dominance thresholds (
tau,delta,gamma). - EM extends this to a per-guide NB mixture, leveraging posterior probabilities for finer control in sparse data and doublet detection.
- Simple ratio and cell-derived
M_minheuristics provide faster alternatives when assumptions hold or when count distributions supply clearer cutoffs. --apply-allextends any method to classify all barcodes in the matrix, enabling universe rescue and exploratory workflows.- All methods share the same output structure;
compareAllMethods.shexists to benchmark them side-by-side, revealing trade-offs between sensitivity and specificity.