Can you compare branch length and per site Fst #2320

hyanwong · 2022-06-07T17:15:59Z

hyanwong
Jun 7, 2022
Maintainer

In the docs it says: "Most statistics have the property that mode="branch" and mode="site" are “dual” in the sense that they are equal, on average, under a high neutral mutation rate. Fst() and Tajimas_D() do not have this property (since both are ratios of statistics that do have this property)."

I don't quite understand what's being said here. I think it's reasonable to compare the branch-length version of Fst (across the entire genome) with the sitewise one (and in fact, you don't need to multiply by a mutation rate, because that cancels out on the top and bottom). The duality paper talks about how these ratio-based statistics aren't additive across windows, which I can see is the case. But could someone explain to me in simpler language what the paragraph above actually means?

Here, for example, is something I'd like to use in the docs to illustrate the reduced variance when using mode="branch". I think this is reasonable, isn't it?

import msprime
import matplotlib_inline
import matplotlib.pyplot as plt
import numpy as np

L = 1e6  # simulate 1 megabase length (could increase for a larger example)
rho = 1e-8  # Human-like recombination  parameter
subpop_size = 1e4
migration_rate=1e-4
ploidy = 2

mu = 1e-10  # Low mutation rate: emphasises random mutational noise
n_reps = 20
ts_reps = list(msprime.sim_ancestry(
    samples={"pop_0": 10, "pop_1": 10},
    demography=msprime.Demography.island_model([subpop_size] * 2, migration_rate),
    ploidy=ploidy,
    recombination_rate=rho,
    sequence_length=L,
    random_seed=123,
    num_replicates=n_reps,
))

ts_mutated_reps = [
    msprime.sim_mutations(ts, rate=mu, random_seed=i+4) for i, ts in enumerate(ts_reps)
]

# Define sample sets as all samples from each population (uses all pairwise comparisons)
def sample_sets(ts):
    return ts.samples(population=0), ts.samples(population=1)

Fst_genealogy_based = np.array([
    ts.Fst(sample_sets(ts), mode="branch")
    for ts in ts_reps
])

Fst_mutation_based = np.array([
    ts.Fst(sample_sets(ts))
    for ts in ts_mutated_reps
])

plt.scatter(["Genetic variation"] * 20, Fst_mutation_based)
plt.scatter(["Genealogy"] * 20, Fst_genealogy_based)
plt.xlabel("Basis of estimate")
plt.ylabel("Fst\n(20 replicates)")
plt.show()

petrelharp · 2022-06-07T19:31:14Z

petrelharp
Jun 7, 2022
Maintainer

Good question. It's totally reasonable to compare the two. What that section means is: Fst = X / Y where X and Y are things computed from the genotypes. Now let X' and Y' be the corresponding branch stats; these are defined so that E[X] = X' and E[Y] = Y', where the expectations are conditioned on the trees. So, you would reasonably expect X' / Y' to be close to Fst-coimputed-from-the-genotypes; however, there is no equality of expectations: it is not true that E[Fst] = X' / Y'.

In other words, even though X and Y are unbiased estimators of X' and Y', Fst is a biased estimator of X' / Y', although the bias goes away as the size of the window increases.

(Does this count as "simpler language"?)

5 replies

petrelharp Jun 7, 2022
Maintainer

In even simpler language: if you computed site divergence lots of times on the same tree sequence across lots of different independent applications of mutations, then the average would be equal to branch divergence. However, this is not strictly true for Fst, although the difference between the-average-of-site-Fst-across-many-independently-simulated-batches-of-mutations and branch-Fst will be small if the window is big.

petrelharp Jun 7, 2022
Maintainer

And, yes - the example is good!

hyanwong Jun 7, 2022
Maintainer Author

Thanks. That does clarify it for me. Although (a) I presume the bias is not very great (an example where its is large would perhaps be useful), and (b) the example above is not conditional on the trees, so I suspect that the bias is not an issue here, right?

petrelharp Jun 7, 2022
Maintainer

The bias would be large if you computed it per-site (e.g., with windows="sites"). And, um, let's see - because of Jensen's inequality,
E[1/Y] > 1/E[Y] and so I suspect that branch-Fst = E[X] / E[Y] > E[X/Y] = E[site-Fst], but because X and Y are not independent, that might not be true. (So, there is still mathematically some bias; i.e. mean(branch-Fst) and mean(site-Fst) are not the same across independent simulations.) However, in your example it's clearly small and I'd say you totally don't need to even mention it.

hyanwong Jun 27, 2022
Maintainer Author

Perhaps an extreme example would help clarify. For instance, if in one of the random genealogies, Y (the denominator in your terminology) happens to be tiny (perhaps even zero), then Fst will be correspondingly huge (or infinite). Even if there are a large number of other replicates with Y values that are not so tiny, the arithmetic mean will be skewed by this huge outlier.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can you compare branch length and per site Fst #2320

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Can you compare branch length and per site Fst #2320

hyanwong Jun 7, 2022 Maintainer

Replies: 1 comment · 5 replies

petrelharp Jun 7, 2022 Maintainer

petrelharp Jun 7, 2022 Maintainer

petrelharp Jun 7, 2022 Maintainer

hyanwong Jun 7, 2022 Maintainer Author

petrelharp Jun 7, 2022 Maintainer

hyanwong Jun 27, 2022 Maintainer Author

hyanwong
Jun 7, 2022
Maintainer

Replies: 1 comment 5 replies

petrelharp
Jun 7, 2022
Maintainer

petrelharp Jun 7, 2022
Maintainer

petrelharp Jun 7, 2022
Maintainer

hyanwong Jun 7, 2022
Maintainer Author

petrelharp Jun 7, 2022
Maintainer

hyanwong Jun 27, 2022
Maintainer Author