Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TPR can be greater than 1.0 due to multiple hits #11

Open
petrelharp opened this issue Dec 20, 2024 · 4 comments
Open

TPR can be greater than 1.0 due to multiple hits #11

petrelharp opened this issue Dec 20, 2024 · 4 comments

Comments

@petrelharp
Copy link
Contributor

For instance, in our documentation:
Screenshot from 2024-12-15 08-20-07
The issue here is that "best match" can be many-to-one: in other words; recall that α(n1) is the mapping from the nodes of T1 to T2 that gives us the "best match" in T2; there can be many n1 that all map to the same n2. Potential resolutions:

  1. Leave it as-is. I don't think this is ideal.
  2. Average over all the nodes that map to the same n2.
  3. Redefine α so that it is one-to-one, by letting only the "best" n1 match a given n2.
  4. Redefine sim(T1, T2) so that it only sums over a subset of nodes for which α(n1) is one-to-one (and picking the "best" ones).

For reference, here's the definitions:
Screenshot from 2024-12-20 08-41-00

So, option (4) is proposing something like: for each node in N2, let β(n2) denote the node n1 such that α(n1)=n2 and n1 has the best agreement out of all such n1; then sim(T1, T2) sums over nodes in N2 for which β(n2) is defined. Option (3) is similar but redefines α(n1).

At first I liked, conceptually, options (3) and (4) since a 1-to-1 map seems more like what we want to be measuring, but I think they open a can of worms because: suppose that both n and m map to n2, but n maps better. So, we say (effectively) that n maps to n2 while m is unmatched. But, then we should map m to its second-best match... but, maybe there's a different node mapping there? Well, I guess we could use the Gale-Shapely algorithm to resolve this?

Option (2) (thx to @hfr1tz3) would certainly be simpler.

One practical consideration is: under the current definition of TPR, one can increase TPR by adding a bunch of entirely-unary nodes above any correctly matching node. And, I suppose that those nodes would not be wrong. So, maybe (2) would be doing the right thing here, while (3&4) would be a bit wrong (since they'd penalize for those nodes).

On the other hand, I'm vaguely imagining a situation where the "right answer" should be that n1 -> n2 and m1 -> m2 but things are arranged so that both n1 and m1 map to n2; in this case, a version of (3&4) with Gale-Shapely would do the right thing. This could happen if m2 is unary above n2 for a long span, but then another branch comes in between the two; and then in the inferred ARG we get this almost right except that the branch coming in has a wrong sample and the span of that new branch is a bit longer than it should be.

Thoughts?

@nspope
Copy link

nspope commented Dec 20, 2024

Hmm ... I don't have a strong opinion and you've clearly thought this through more carefully, but I suspect we can always come up with hypothetical situations where one choice will do better than the other. I also don't think we have a straightforward way of measuring how frequent these situations will be in practice (e.g. the scenario in your last paragraph). So, I lean towards (2) which is the simplest to implement and explain.

Let's say we choose (2) and regret it in the future -- then, we could introduce a ties="average" argument that toggles the original behavior, and introduce another method (ties="best" or something) that becomes the default.

@hfr1tz3
Copy link
Collaborator

hfr1tz3 commented Dec 20, 2024

A wandering thought: If we use the symmetrized definitions (where $\overline{\textnormal{sim}}(T1,T2) = \frac{1}{2}(\textnormal{sim}(T_1, T_2)+\textnormal{sim}(T_2,T_1))$ would the TPR be less than 1? I am assuming that if $2>TPR(T_1,T_2)>1$ than most likely (hopefully guaranteed) $TPR(T_2,T_1)<1$ so then their average is less than 1. Could be an additional easy option.

Maybe we want to add an option for compare to give the symmetrized difference anyway?

I also like @nspope 's option to do a ties="average" argument.

@petrelharp
Copy link
Contributor Author

AFACT, TPR is not bounded above: consider as T2 the two-node tree sequence in which node 0 inherits from node 1 on the entire chromosome; and for T1 insert an additional n unary nodes in; this I think has TPR=n+1.

I'm not seeing an intuitive reason for the symmetrized version?

@hfr1tz3
Copy link
Collaborator

hfr1tz3 commented Dec 20, 2024

I'm not seeing an intuitive reason for the symmetrized version?

The basis would be to actually have the 'metric version' of ARF and TPR being used, however you bring up a good counter-example. So appending option 2 would be best instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants