Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
249 changes: 243 additions & 6 deletions docs/stats.md
Original file line number Diff line number Diff line change
Expand Up @@ -684,14 +684,251 @@ and boolean expressions (e.g., {math}`(x > 0)`) are interpreted as 0/1.
and {math}`v_j` is the covariance of the trait with the j-th covariate.


(sec_stats_multi_site)=

## Multi site statistics

:::{todo}
Document statistics that use information about correlation between sites, such as
LdCalculator (and perhaps reference {ref}`sec_identity`). Note that if we have a general
framework which has the same calling conventions as the single site stats,
we can rework the sections above.
:::
(sec_stats_two_locus)=

### Two-locus statistics

The {meth}`~TreeSequence.ld_matrix` method provides an interface to
a collection of two-locus statistics with predefined summary functions (see
{ref}`sec_stats_two_locus_summary_functions`) and `site` and `branch`
{ref}`modes <sec_stats_mode>`. The LD matrix method differs from other
statistics methods in that it provides a unified API with an argument to
specify different two-locus summaries of the data. It otherwise behaves
similarly to most other functions with respect to `sample_sets` and `indexes`.

Two-locus statistics can be computed using two modes, either `site` or
`branch`, and these should be interpreted in the same way as these modes in the
single-site statistics. Within this framework, we also implement polarisation,
but do not expose it to the user, opting to provide statistics that are
polarised where appropriate.

(sec_stats_two_locus_site)=

#### Site

The `site` mode computes two-locus statistics summarized over alleles between
all pairs of specified sites. The default behavior, leaving `sites`
unspecified, this method will compute a matrix for all pairs of sites, with
rows and columns representing each site in the tree sequence (i.e., an n×n
matrix where n is the number of sites in the tree sequence). We can also
restrict the output to a subset of sites, either by specifying a single vector
for both rows and columns or a pair of vectors for the row sites and column
sites separately.

The following computes a matrix of the {math}`r^2` measure of linkage
disequilibrium (LD) computed pairwise between the first 4 sites in the tree
sequence among all samples. In our computations, row sites are used as the row
("left-hand") loci and column sites are used as the column ("right-hand")
locus, and with a single list of sites specified, we obtain a symmetric square
matrix.

```{code-cell} ipython3
ld = ts.ld_matrix(sites=[[0, 1, 2, 3]])
print(ld)
```

The following demonstrates how we can specify the row and column sites
independently of each other. We're specifying 3 columns and 2 rows, which
computes a subset of the matrix shown above.

```{code-cell} ipython3
ld = ts.ld_matrix(sites=[[0, 1], [1, 2, 3]])
print(ld)
```

Because we implement two-locus statistics for multi-allelic data, we require
a method for combining the statistical results from each pair of alleles into
one summary for a pair of sites. There two methods for combining results from
multiple alleles, `hap_weighted` and `total_weighted`, which are
statistic-specific and not chosen by the user:

(sec_stats_two_locus_branch)=

#### Branch

The `branch` mode computes two-locus statistics between pairs of trees. The
resulting statistic is a summary depending on marginal tree topologies and the
branch lengths of the trees at a pair of positions. To perform this
computation, we consider the counts of all possible two-locus haplotypes that
may be generated by possible mutations on each pair of branches between the two
trees, which is summarized using an LD statistic. We then sum each pairwise
comparison weighted by the product of the two branch lengths above the nodes on
each tree.

Similar to the single-site statistics computed in `branch` mode, this results
in a statistic that is proportional to the expected statistic under an infinite
sites model (with mutation rate 1), conditioned on the pair of trees.

The time complexity of this method is quadratic, due to the pairwise
comparisons of branches from a pair of trees. By default, this method computes
a symmetric matrix for all pairs of trees, with rows and columns representing
each tree in the tree sequence. Similar to the site method, we can restrict the
output to a subset of trees, either by specifying a vector of positions or
a pair of vectors for row and column positions separately. To select a specific
tree, the specified positions must land in the tree span (`[start, end)`).

In the following, we compute a matrix of expected {math}`r^2` within and
between the first 4 trees in the tree sequence. The tree breakpoints are
a convenient way to specify those first four trees tree.

```{code-cell} ipython3
ld = ts.ld_matrix(mode="branch", positions=[ts.breakpoints(as_array=True)[0:4]])
print(ld)
```

Again, we can specify the row and column trees separately.

```{code-cell} ipython3
breakpoints = ts.breakpoints(as_array=True)
ld = ts.ld_matrix(mode="branch", positions=[breakpoints[[0]], breakpoints[0:4]])
print(ld)
```

(sec_stats_two_locus_sample_sets)=

#### Sample Sets

The two-locus API provides a mechanism by which to subset the samples under
consideration, providing the ability to compute a separate LD matrix for each
sample set or an LD matrix between sample sets. Output dimensions are handled
in the same manner as the rest of the stats API (see
{ref}`sec_stats_output_dimensions`). The two-locus API differs from the rest of
the stats API in that we handle one-way and two-way statistics in the same
function call.

To compute a two-way two-locus statistic, the `index` argument must be
provided. The statistics are selected in the same way (with the `stat`
argument), but we provide a restricted set of two-way statistics (see
{ref}`sec_stats_two_locus_summary_functions_two_way`). The dimension-dropping
rules for the result follow the rest of the tskit stats API in that scalar
indexes will produce a single two-dimensional matrix, while list of indexes
will produce a three-dimensional array, with the outer dimension of length
equal to the length of the list of indexes.

For concreteness, we would expect the following dimensions with the specified
`sample_sets` and `indexes` arguments.

```
# one-way
ts.ld_matrix(sample_sets=None) -> 2 dimensions
ts.ld_matrix(sample_sets=[0, 1, 2, 3]) -> 2 dimensions
ts.ld_matrix(sample_sets=[[0, 1, 2, 3]]) -> 3 dimensions
# two-way
ts.ld_matrix(sample_sets=[[0, 1, 2, 3], [4, 5, 6, 7]], indexes=(0, 1)) -> 2 dimensions
ts.ld_matrix(sample_sets=[[0, 1, 2, 3], [4, 5, 6, 7]], indexes=[(0, 1)]) -> 3 dimensions
```

(sec_stats_two_locus_sample_one_way_stats)=

##### One-way Statistics

One-way statistics are summaries of two loci in a single sample set, using a
triple of haplotype counts {math}`\{w_{Ab}, w_{aB}, w_{AB}\}` and the size of
the sample set {math}`n`, where the capitalized and lowercase letters in our notation
represent alternate alleles.

(sec_stats_two_locus_sample_two_way_stats)=

##### Two-way Statistics

Two-way statistics are summaries of haplotype counts between two sample sets,
which operate on the three haplotype counts (as in one-way stats, above)
computed from each sample set, indexed by `(i, j)`. These statistics take on
a different meaning from their one-way counterparts. For instance {math}`D_i
D_j` can be thought of as the covariance between two loci when computed for
a haplotype in a single sample set.

Only a subset of our summary functions are two-way statistics (see
{ref}`sec_two_locus_summary_functions_two_way`). Note that the unbiased two-way
statistics expect non-overlapping sample sets, and we do not make any
assertions about the sample sets and assume that `i` and `j` represent disjoint
sets of samples (see also the note in {meth}`~TreeSequence.divergence`).

(sec_stats_two_locus_summary_functions)=

#### Summary Functions

(sec_stats_two_locus_summary_functions_one_way)=

##### One-way

The two-locus summary functions all take haplotype counts and sample set size as
input. Each of our summary functions has the signature
{math}`f(w_{Ab}, w_{aB}, w_{AB}, n)`, converting to haplotype frequencies
{math}`\{p_{Ab}, p_{aB}, p_{AB}\}` where appropriate by dividing by {math}`n`.

`D`
: {math}`f(w_{Ab}, w_{aB}, w_{AB}, n) = p_{ab} - p_{a}p_{b}`

This statistic is polarised, as the unpolarised result, which averages over
allele labelings, is zero. Uses the `total` normalisation method.

`D_prime`
: {math}`f(w_{Ab}, w_{aB}, w_{AB}, n) = \frac{D}{D_{\max}}`

Where {math}```
D_{\max} = \begin{cases}
\min\{p_A (1-p_B), p_B (1-p_B)\} & \textrm{if }D>=0 \\
\min\{p_A p_B, (1-p_B) (1-p_B)\} & \textrm{otherwise}
\end{cases}```

and {math}`D` is defined above.

`D2`
: {math}`f(w_{Ab}, w_{aB}, w_{AB}, n) = (p_{ab} - p_{a} p_{b})^2`

Unpolarised, total weighted.

`Dz`
: {math}`f(w_{Ab}, w_{aB}, w_{AB}, n) = D (1 - 2 p_{a})(1 - 2p_{b})`

Where {math}`D` is defined above. Unpolarised, total weighted.

`pi2`
: {math}`f(w_{Ab}, w_{aB}, w_{AB}, n) = p_{a}p_{b}(1-p_{a})(1-p_{b})`

Unpolarised, total weighted.

`r`
: {math}`f(w_{Ab}, w_{aB}, w_{AB}, n) = \frac{D}{\sqrt{p_{a}p_{b}(1-p_{a})(1-p_{b})}}`

Where {math}`D` is defined above. Polarised, total weighted.

`r2`
: {math}`f(w_{Ab}, w_{aB}, w_{AB}, n) = \frac{D^{2}}{p_{a}p_{b}(1-p_{a})(1-p_{b})}`

Where {math}`D` is defined above. Unpolarised, haplotype weighted.

Unbiased two-locus statistics from the Hill-Robertson (1968) system are
computed from haplotype counts. Derivations for these unbiased estimators can
be found in [Ragsdale and Gravel
(2020)](https://doi.org/10.1093/molbev/msz265). They require at least 4 samples
to be valid and are specified as `stat=D2_unbiased`, `Dz_unbiased`, or
`pi2_unbiased`.

(sec_two_locus_summary_functions_two_way)=

(sec_stats_two_locus_summary_functions_two_way)=

##### Two-way

Two-way statistics are indexed by sample sets {math}`i, j` and compute values
using haplotype counts within pairs of sample sets.

`D2`
: {math}`f(w_{Ab}, w_{aB}, w_{AB}, n) = D_i * D_j`

Where {math}`D` is defined above.

`r2`
: {math}`f(w_{Ab}, w_{aB}, w_{AB}, n) = \frac{(p_{AB_i} - (p_{A_i} p_{B_i})) (p_{AB_j} - (p_{A_j} p_{B_j}))}{\sqrt{p_{A_i} (1 - p_{A_i}) p_{B_i} (1 - p_{B_i})}\sqrt{p_{A_j} (1 - p_{A_j}) p_{B_j} (1 - p_{B_j})}}`

And `D2_unbiased`, which can be found in [Ragsdale and Gravel
(2020)](https://doi.org/10.1093/molbev/msz265).


(sec_stats_notes)=
Expand Down
6 changes: 6 additions & 0 deletions python/CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,12 @@ In development
- Add ``json+struct`` metadata codec that allows storing binary data using a struct
schema alongside JSON metadata. (:user:`benjeffery`, :pr:`3306`)

**Features**

- Add ``TreeSequence.ld_matrix`` stats method and documentation, for computing
two-locus statistics in site and branch mode.
(:user:`lkirk`, :user:`apragsdale`, :pr:`3416`)

--------------------
[1.0.2] - 2026-03-06
--------------------
Expand Down
81 changes: 79 additions & 2 deletions python/tskit/trees.py
Original file line number Diff line number Diff line change
Expand Up @@ -10930,12 +10930,89 @@ def impute_unknown_mutations_time(
def ld_matrix(
self,
sample_sets=None,
sites=None,
positions=None,
mode="site",
stat="r2",
sites=None,
positions=None,
indexes=None,
):
r"""

Returns a matrix of the specified two-locus statistic (default
:math:`r^2`) computed from sample allelic states or branch lengths.
The resulting linkage disequilibrium (LD) matrix represents either the
two-locus statistic as computed between all pairs of specified
``sites`` ("site" mode, producing a num_sites-by-num_sites sized
matrix), or as computed from the branch structures at marginal trees
between pairs of trees at all specified ``positions`` ("branch" mode,
producing a num_positions-by-num_positions sized matrix).

In the site mode, the sites under consideration can be restricted using
the ``sites`` argument. Sites can be passed as a list of lists,
specifying the ``[[row_sites], [col_sites]]``, resulting in a
rectangular matrix, or by specifying a single list of ``[sites]``, in
which a square matrix will be produced (see
:ref:`sec_stats_two_locus_site` for examples).

Similarly, in the branch mode, the ``positions`` argument specifies
loci for which the expectation for the two-locus statistic is computed
over pairs of trees at those positions. LD stats are computed between
trees whose ``[start, end)`` contains the given position (such that
repeats of trees are possible). Similar to the site mode, a nested list
of row and column positions can be specified separately (resulting in a
rectangular matrix) or a single list of a specified positions results
in a square matrix (see :ref:`sec_stats_two_locus_branch` for
examples).

Some LD statistics are defined for two sample sets instead of within a
single set of samples. If the ``indexes`` argument is specified, at
least two sample sets must also be specified. ``indexes`` specifies the
sample set indexes between which to compute LD.

For more on how the ``indexes`` and ``sample_sets`` interact with the
output dimensions, see the :ref:`sec_stats_two_locus_sample_sets`
section.

**Available Stats** (use ``Stat Name`` in the ``stat`` keyword
argument).

======================= ========== ================ ==============
Stat Polarised Multi Sample Set Stat Name
======================= ========== ================ ==============
:math:`r^2` n y "r2"
:math:`r` y n "r"
:math:`D^2` n y "D2"
:math:`D` y n "D"
:math:`D'` y n "D_prime"
:math:`D_z` n n "Dz"
:math:`\pi2` n n "pi2"
:math:`\widehat{D^2}` n y "D2_unbiased"
:math:`\widehat{D_z}` n n "Dz_unbiased"
:math:`\widehat{\pi_2}` n n "pi2_unbiased"
======================= ========== ================ ==============

:param list sample_sets: A list of lists of sample Node IDs, specifying
the groups of nodes to compute the statistic with. Defaults to all
samples grouped by population.
:param str mode: A string giving the "type" of the statistic to be
computed. Defaults to "site", can be "site" or "branch".
:param str stat: A string giving the selected two-locus statistic to
compute. Defaults to "r2".
:param list sites: A list of sites over which to compute LD. Can be
specified as a list of lists to control the row and column sites.
Only applicable in site mode. Specify as
``[[row_sites], [col_sites]]`` or ``[all_sites]``.
:param list positions: A list of genomic positions where expected LD is
computed. Only applicable in branch mode. Can be specified as a list
of lists to control the row and column positions. Specify as
``[[row_positions], [col_positions]]`` or ``[all_positions]``.
:param list indexes: A list of 2-tuples or a single 2-tuple, specifying
the indexes of two sample sets over which to compute a two-way LD
statistic. Only :math:`r^2`, :math:`D^2`, and :math:`\widehat{D^2}`
are implemented for two-way statistics.
:return: A 2D or 3D array of LD matrices.
:rtype: numpy.ndarray
"""
one_way_stats = {
"D": self._ll_tree_sequence.D_matrix,
"D2": self._ll_tree_sequence.D2_matrix,
Expand Down