Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature: Interpolate between closest ranks #2995

Merged
merged 4 commits into from
Oct 22, 2024
Merged

Conversation

edugfilho
Copy link
Collaborator

@edugfilho edugfilho commented Oct 22, 2024

implements #2985

totalFreq += histogram[i].value;
cumFreq.push({ bin: histogram[i].bin, cumFreq: totalFreq });
}
const normalizer = cumFreq.at(-1).cumFreq

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just totalFreq, fwiw. Except in the edge case where histogram is empty, but then the .cumFreq would throw an error. (It's probably impossible?)

}
const normalizer = cumFreq.at(-1).cumFreq
if (normalizer != 1) {
cumFreq = cumFreq.map((item) => {return {bin: item.bin, cumFreq: item.cumFreq/normalizer}})

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that it really matters, but rather than creating a new set of objects, this could do an in-place update:

  const normalizer = 1 / totalFreq;
  cumFreq.forEach(item => { item.cumFreq *= normalizer });

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you. I appreciate the review!

if (targetFreq <= cumFreq[0].cumFreq) {
percentileValues[percentile] = cumFreq[0].bin;
}
if (targetFreq >= cumFreq.at(-1).cumFreq) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (targetFreq >= totalFreq)

var y0 = cumFreq[i].bin;
var y1 = cumFreq[i + 1].bin;
// Linear interpolation formula
var percentileValue = y0 + ((targetFreq - x0) * (y1 - y0)) / (x1 - x0);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this works, but I'm not sure how the bin values are given. If one bucket is for values 3-7, then will .bin be 3, 7, or (3+7)/2=5? With exponentially increasing bucket sizes, it seems like it would make a difference. ...after more research, I think your formula is exactly right as long as the bins are given as the lowest value in each bucket, which seems like it must be the case. (Then the range of values that fall in some bucket i are [cumFreq[i].bin, cumFreq[i+1].bin) which is exactly what you're using here, so I should shut up already.)

It's a little sketchy, because the linear interpolation assumes that the values are uniformly distributed within the bucket. Apparently sometimes people will use a logarithmic interpolation instead, which is for when the values are exponentially distributed within the bucket. I'm not at all sure about this stuff, but I don't think we can assume any particular distribution. And fortunately, the linear interpolation won't be very far off most other distributions. I think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your assumptions are correct. I added a note about this in the tooltip that will show next to the check-box that toggles the feature.
I also think that if we need to add different types of interpolation based on something (e.g. probe metadata) we can do that quickly now that we already have the core feature implemented

</h3>
<span
use:tooltipAction={{
text: 'Applies a moving average to smooth out short-term fluctuations on percentile values.',
text: 'Generates percentiles using the Between Closest Ranks Linear Interpolation. This can show an innacurate representation of the data if the underlying distribution is not continuous and/or the data between bins is not uniformly distributed.',

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/innacurate/inaccurate/

Also, this isn't quite right. This does not depend on the distribution of data between bins, only on the distribution of data within bins. It will actually handle any distribution between bins, which is fortunate since it's usually going to some weird multimodal exponential-ish thing.

Copy link
Collaborator Author

@edugfilho edugfilho Oct 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, my bad. I'll open another PR for this. I thought you were done.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, my bad. I'll open another PR for this. I thought you were done.

Sorry, I thought I was too.

@edugfilho edugfilho merged commit cd25a09 into main Oct 22, 2024
4 checks passed
@edugfilho edugfilho deleted the interpolate-between-ranks branch October 22, 2024 19:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants