-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error bound check for CKMS appears to be off #27
Comments
Is it possible for you to share your test code, or at least the data you're using? Regarding the paper, that's the one per this discussion but back when I originally implemented this library the one available freely online had a bug in its algorithm. I know the IEEE is accurate per discussion with Graham Cormode. If you can share your test data I'll try and reproduce the issue in this library's test code. |
My Elixir code is a mess atm, but I wrote a small test in Rust that shows what I mean. Below program works fine for What i'm unsure about is how to correctly compute the exact quantile: should the index be use quantiles::ckms::CKMS;
use rand::Rng;
use std::cmp;
fn main() {
let error = 0.001;
let n = 10000;
let mut data = Vec::<u16>::new();
let mut ckms = CKMS::<u16>::new(error);
for _ in 0..n {
let v = rand::thread_rng().gen_range(u16::MIN, u16::MAX);
data.push(v);
ckms.insert(v);
}
data.sort();
for p in 0..1001 {
let q = p as f64 / 1000.0;
let idx = cmp::max(1, (q * n as f64).ceil() as usize);
let expected = data[idx - 1];
let (_rank, actual) = ckms.query(q).unwrap();
let diff = expected as f64 - actual as f64;
if diff.abs() > error {
println!(
"q {}: expected {} actual {} diff {} > {}",
q, expected, actual, diff, error
);
}
}
} |
Excellent. Thank you. I'm going to convert your code here into a test in the suite. Will report back. |
This commit demonstrates the error reported by @c0deaddict. In some cases, for reasons unknown, when elements are inserted into the CKMS the error bound is not obeyed. This appears to be related to the total number of elements inserted into the structure and not the points themselves.
I think the issue that was pointed in #27 is an error with the user supplied quantile code. Note that `error_nominal_test` and `error_nominal_with_merge_test` assert that error bounds are maintained without triggering the reported failure. This commit simplifies the supplied test code and indicates that the disagreement happens with the user quantile calculation. Note that the calculated rank and user index are quite different.
@c0deaddict I think the error is in your comparison code. I put together a branch that includes your test code and trims it down some. This is it: Lines 293 to 366 in cde2a22
I first wondered if the reported issue could be triggered without resorting to random data, and it could. I found somewhat accidentally that the issue could be triggered by a drastically smaller number of points, 10 in the above snippet. I went back to the paper and for the data set 0 .. 10 calculated the rank and approximation the paper suggests. That's what these lines are: Lines 320 to 338 in cde2a22
If you run the code this portion of the test will pass. Good sign. At least my understanding of the paper and the implementation correspond. The remaining portion of the test that fails is: Lines 340 to 365 in cde2a22
This fails like so:
The important things to note here is that the Lines 419 to 448 in cde2a22
and Lines 457 to 499 in cde2a22
both assert that the error guarantee holds, for randomly generated datasets and queries. If you buy this analysis I think what today's debugging session suggests to me is:
|
@blt First of all thanks for your thorough investigation, and apologies for my late reply.
How should the odd/even members be taken into account? Using There seem to be different methods of calculating percentiles. Numpy for example can do a linear interpolation if the quantile is not at an exact array index: https://github.com/numpy/numpy/blob/master/numpy/lib/function_base.py#L3865-L3969 I think the main question is: what method for calculating exact quantiles did the authors of the paper use?
These tests look pretty solid. Only thing that seems strange is that the assertion checks if the actual value
Yes indeed. Is this the paper you based it on: https://ieeexplore.ieee.org/abstract/document/4274974/ Maybe it's a good idea to include a link to it in the source code? Fortunately there are ways to work around the pay wall :-)
That would be very useful! |
Not a problem! Things take the time they take.
This is a very good question, and not necessarily something I appreciated when I first wrote this library. I won't likely have a chance to get to the paper until at least next weekend, but this does strike me as a very important question to pursue.
Ah, indeed. Yes that is an oversight. That should be corrected. I'd be curious to see what turns up when it is.
It's linked in the README but that's a little distant. What you get in the source documentation is not the best: Lines 6 to 12 in 3f72442
Cool. When I have a little time next I'll fix the test bug you pointed out above and see about including this in the project. Sounds fun. |
I'm trying to make Elixir bindings for this library. During testing of the bindings I noticed that in my test quantile queries were off by more than the error bound (compared to the quantiles on a sorted list). Reading through the tests in this project, I noticed that the error bound check is not using
.abs()
:quantiles/src/ckms/mod.rs
Line 376 in 04b0351
The test unfortunately fails when mentioned line is changed to
(v - percentile(&data, prcnt)).abs() < err
.Is this the paper on which the algorithm is based? http://www.dimacs.rutgers.edu/~graham/pubs/papers/bq-pods.pdf I tried to read through it to discover what the error bound is, but could not find it.
The text was updated successfully, but these errors were encountered: