-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature: Interpolate between closest ranks #2995
Conversation
totalFreq += histogram[i].value; | ||
cumFreq.push({ bin: histogram[i].bin, cumFreq: totalFreq }); | ||
} | ||
const normalizer = cumFreq.at(-1).cumFreq |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just totalFreq
, fwiw. Except in the edge case where histogram
is empty, but then the .cumFreq
would throw an error. (It's probably impossible?)
} | ||
const normalizer = cumFreq.at(-1).cumFreq | ||
if (normalizer != 1) { | ||
cumFreq = cumFreq.map((item) => {return {bin: item.bin, cumFreq: item.cumFreq/normalizer}}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not that it really matters, but rather than creating a new set of objects, this could do an in-place update:
const normalizer = 1 / totalFreq;
cumFreq.forEach(item => { item.cumFreq *= normalizer });
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you. I appreciate the review!
if (targetFreq <= cumFreq[0].cumFreq) { | ||
percentileValues[percentile] = cumFreq[0].bin; | ||
} | ||
if (targetFreq >= cumFreq.at(-1).cumFreq) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (targetFreq >= totalFreq)
var y0 = cumFreq[i].bin; | ||
var y1 = cumFreq[i + 1].bin; | ||
// Linear interpolation formula | ||
var percentileValue = y0 + ((targetFreq - x0) * (y1 - y0)) / (x1 - x0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this works, but I'm not sure how the bin values are given. If one bucket is for values 3-7, then will .bin
be 3, 7, or (3+7)/2=5? With exponentially increasing bucket sizes, it seems like it would make a difference. ...after more research, I think your formula is exactly right as long as the bins are given as the lowest value in each bucket, which seems like it must be the case. (Then the range of values that fall in some bucket i are [cumFreq[i].bin, cumFreq[i+1].bin) which is exactly what you're using here, so I should shut up already.)
It's a little sketchy, because the linear interpolation assumes that the values are uniformly distributed within the bucket. Apparently sometimes people will use a logarithmic interpolation instead, which is for when the values are exponentially distributed within the bucket. I'm not at all sure about this stuff, but I don't think we can assume any particular distribution. And fortunately, the linear interpolation won't be very far off most other distributions. I think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your assumptions are correct. I added a note about this in the tooltip that will show next to the check-box that toggles the feature.
I also think that if we need to add different types of interpolation based on something (e.g. probe metadata) we can do that quickly now that we already have the core feature implemented
</h3> | ||
<span | ||
use:tooltipAction={{ | ||
text: 'Applies a moving average to smooth out short-term fluctuations on percentile values.', | ||
text: 'Generates percentiles using the Between Closest Ranks Linear Interpolation. This can show an innacurate representation of the data if the underlying distribution is not continuous and/or the data between bins is not uniformly distributed.', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/innacurate/inaccurate/
Also, this isn't quite right. This does not depend on the distribution of data between bins, only on the distribution of data within bins. It will actually handle any distribution between bins, which is fortunate since it's usually going to some weird multimodal exponential-ish thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, my bad. I'll open another PR for this. I thought you were done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, my bad. I'll open another PR for this. I thought you were done.
Sorry, I thought I was too.
implements #2985