Error bound check for CKMS appears to be off #27

c0deaddict · 2020-07-28T15:49:44Z

I'm trying to make Elixir bindings for this library. During testing of the bindings I noticed that in my test quantile queries were off by more than the error bound (compared to the quantiles on a sorted list). Reading through the tests in this project, I noticed that the error bound check is not using .abs():

quantiles/src/ckms/mod.rs

Line 376 in 04b0351

(v - percentile(&data, prcnt)) < err,

The test unfortunately fails when mentioned line is changed to (v - percentile(&data, prcnt)).abs() < err.

Is this the paper on which the algorithm is based? http://www.dimacs.rutgers.edu/~graham/pubs/papers/bq-pods.pdf I tried to read through it to discover what the error bound is, but could not find it.

The text was updated successfully, but these errors were encountered:

blt · 2020-07-28T17:08:56Z

Is it possible for you to share your test code, or at least the data you're using?

Regarding the paper, that's the one per this discussion but back when I originally implemented this library the one available freely online had a bug in its algorithm. I know the IEEE is accurate per discussion with Graham Cormode.

If you can share your test data I'll try and reproduce the issue in this library's test code.

c0deaddict · 2020-07-29T18:44:42Z

My Elixir code is a mess atm, but I wrote a small test in Rust that shows what I mean. Below program works fine for n = 100 or n = 1000, but fails for n = 10000 or bigger n. It does not seems to depend on the random seed.

What i'm unsure about is how to correctly compute the exact quantile: should the index be ceil, floor or round?

use quantiles::ckms::CKMS;
use rand::Rng;
use std::cmp;

fn main() {
    let error = 0.001;
    let n = 10000;

    let mut data = Vec::<u16>::new();
    let mut ckms = CKMS::<u16>::new(error);

    for _ in 0..n {
        let v = rand::thread_rng().gen_range(u16::MIN, u16::MAX);
        data.push(v);
        ckms.insert(v);
    }

    data.sort();

    for p in 0..1001 {
        let q = p as f64 / 1000.0;
        let idx = cmp::max(1, (q * n as f64).ceil() as usize);
        let expected = data[idx - 1];
        let (_rank, actual) = ckms.query(q).unwrap();
        let diff = expected as f64 - actual as f64;
        if diff.abs() > error {
            println!(
                "q {}: expected {} actual {} diff {} > {}",
                q, expected, actual, diff, error
            );
        }
    }
}

blt · 2020-08-01T23:22:23Z

Excellent. Thank you. I'm going to convert your code here into a test in the suite. Will report back.

@c0deaddict

This commit demonstrates the error reported by @c0deaddict. In some cases, for reasons unknown, when elements are inserted into the CKMS the error bound is not obeyed. This appears to be related to the total number of elements inserted into the structure and not the points themselves.

I think the issue that was pointed in #27 is an error with the user supplied quantile code. Note that `error_nominal_test` and `error_nominal_with_merge_test` assert that error bounds are maintained without triggering the reported failure. This commit simplifies the supplied test code and indicates that the disagreement happens with the user quantile calculation. Note that the calculated rank and user index are quite different.

blt · 2020-08-02T02:12:17Z

@c0deaddict I think the error is in your comparison code. I put together a branch that includes your test code and trims it down some. This is it:

quantiles/src/ckms/mod.rs

Lines 293 to 366 in cde2a22

    
           #[cfg(test)] 
        
           mod test { 
        
               use super::*; 
        
               use crate::ckms::store::invariant; 
        
               use quickcheck::{QuickCheck, TestResult}; 
        
               use std::cmp; 
        
               use std::cmp::Ordering; 
        
               use std::f64::consts::E; 
        
               fn percentile(data: &[f64], prcnt: f64) -> f64 { 
        
                   let idx = (prcnt * (data.len() as f64)) as usize; 
        
                   data[idx] 
        
               } 
        
               #[test] 
        
               fn test_insert() { 
        
                   let error = 0.001; 
        
                   let total_points = 10; 
        
                   let mut data = Vec::<u16>::new(); 
        
                   let mut ckms = CKMS::<u16>::new(error); 
        
                   for v in 0..total_points { 
        
                       data.push(v); 
        
                       ckms.insert(v); 
        
                   } 
        
                   assert_eq!(ckms.query(0.0), Some((1, 0))); 
        
                   assert_eq!(ckms.query(0.001), Some((1, 0))); 
        
                   assert_eq!(ckms.query(0.01), Some((1, 0))); 
        
                   assert_eq!(ckms.query(0.1), Some((1, 0))); 
        
                   assert_eq!(ckms.query(0.101), Some((1, 0))); 
        
                   assert_eq!(ckms.query(0.14), Some((1, 0))); 
        
                   assert_eq!(ckms.query(0.145), Some((1, 0))); 
        
                   assert_eq!(ckms.query(0.146), Some((1, 0))); 
        
                   assert_eq!(ckms.query(0.15), Some((2, 1))); 
        
                   assert_eq!(ckms.query(0.199), Some((2, 1))); 
        
                   assert_eq!(ckms.query(0.2), Some((2, 1))); 
        
                   assert_eq!(ckms.query(0.3), Some((3, 2))); 
        
                   assert_eq!(ckms.query(0.4), Some((4, 3))); 
        
                   assert_eq!(ckms.query(0.5), Some((5, 4))); 
        
                   assert_eq!(ckms.query(0.6), Some((6, 5))); 
        
                   assert_eq!(ckms.query(0.7), Some((7, 6))); 
        
                   assert_eq!(ckms.query(0.8), Some((8, 7))); 
        
                   assert_eq!(ckms.query(0.9), Some((9, 8))); 
        
                   assert_eq!(ckms.query(1.0), Some((10, 9))); 
        
                   for q in &[ 
        
                       0.0, 0.001, 0.01, 0.1, 0.101, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 
        
                   ] { 
        
                       let idx = cmp::max(1, (q * total_points as f64).ceil() as usize); 
        
                       let expected = data[idx - 1]; 
        
                       let (rank, actual) = ckms.query(*q).unwrap(); 
        
                       let diff = if expected > actual { 
        
                           expected - actual 
        
                       } else { 
        
                           actual - expected 
        
                       }; 
        
                       let result = (diff as f64).partial_cmp(&error); 
        
                       assert!(result.is_some()); 
        
                       assert_ne!( 
        
                           result, 
        
                           Some(Ordering::Greater), 
        
                           "total_points: {} | idx: {} | (rank, actual): ({}, {}) | quantile: {} | {} > {}", 
        
                           total_points, 
        
                           idx, 
        
                           rank, 
        
                           actual, 
        
                           q, 
        
                           diff, 
        
                           error 
        
                       ); 
        
                   } 
        
               }

I first wondered if the reported issue could be triggered without resorting to random data, and it could. I found somewhat accidentally that the issue could be triggered by a drastically smaller number of points, 10 in the above snippet. I went back to the paper and for the data set 0 .. 10 calculated the rank and approximation the paper suggests. That's what these lines are:

quantiles/src/ckms/mod.rs

Lines 320 to 338 in cde2a22

    
           assert_eq!(ckms.query(0.0), Some((1, 0))); 
        
           assert_eq!(ckms.query(0.001), Some((1, 0))); 
        
           assert_eq!(ckms.query(0.01), Some((1, 0))); 
        
           assert_eq!(ckms.query(0.1), Some((1, 0))); 
        
           assert_eq!(ckms.query(0.101), Some((1, 0))); 
        
           assert_eq!(ckms.query(0.14), Some((1, 0))); 
        
           assert_eq!(ckms.query(0.145), Some((1, 0))); 
        
           assert_eq!(ckms.query(0.146), Some((1, 0))); 
        
           assert_eq!(ckms.query(0.15), Some((2, 1))); 
        
           assert_eq!(ckms.query(0.199), Some((2, 1))); 
        
           assert_eq!(ckms.query(0.2), Some((2, 1))); 
        
           assert_eq!(ckms.query(0.3), Some((3, 2))); 
        
           assert_eq!(ckms.query(0.4), Some((4, 3))); 
        
           assert_eq!(ckms.query(0.5), Some((5, 4))); 
        
           assert_eq!(ckms.query(0.6), Some((6, 5))); 
        
           assert_eq!(ckms.query(0.7), Some((7, 6))); 
        
           assert_eq!(ckms.query(0.8), Some((8, 7))); 
        
           assert_eq!(ckms.query(0.9), Some((9, 8))); 
        
           assert_eq!(ckms.query(1.0), Some((10, 9)));

If you run the code this portion of the test will pass. Good sign. At least my understanding of the paper and the implementation correspond. The remaining portion of the test that fails is:

quantiles/src/ckms/mod.rs

Lines 340 to 365 in cde2a22

    
           for q in &[ 
        
               0.0, 0.001, 0.01, 0.1, 0.101, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 
        
           ] { 
        
               let idx = cmp::max(1, (q * total_points as f64).ceil() as usize); 
        
               let expected = data[idx - 1]; 
        
               let (rank, actual) = ckms.query(*q).unwrap(); 
        
               let diff = if expected > actual { 
        
                   expected - actual 
        
               } else { 
        
                   actual - expected 
        
               }; 
        
               let result = (diff as f64).partial_cmp(&error); 
        
               assert!(result.is_some()); 
        
               assert_ne!( 
        
                   result, 
        
                   Some(Ordering::Greater), 
        
                   "total_points: {} | idx: {} | (rank, actual): ({}, {}) | quantile: {} | {} > {}", 
        
                   total_points, 
        
                   idx, 
        
                   rank, 
        
                   actual, 
        
                   q, 
        
                   diff, 
        
                   error 
        
               ); 
        
           }

This fails like so:

---- ckms::test::test_insert stdout ----
thread 'ckms::test::test_insert' panicked at 'assertion failed: `(left != right)`
  left: `Some(Greater)`,
 right: `Some(Greater)`: total_points: 10 | idx: 2 | (rank, actual): (1, 0) | quantile: 0.101 | 1 > 0.001', src/ckms/mod.rs:353:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace


failures:
    ckms::test::test_insert

The important things to note here is that the idx disagrees with the rank. If you adjust from ceil to floor then the test will pass, but that's only incidentally. Quantile calculation like this requires you to take into account whether the dataset has an odd or even number of members, which the calculation in your example doesn't do. If you adjust to 11 total samples -- and comment out the hand-calculation bit -- then you'll find the test will pass. I believe what this means is that the calculation you've supplied is faulty. I feel pretty confident about that after having gone back over the paper, plus remembering that the tests

quantiles/src/ckms/mod.rs

Lines 419 to 448 in cde2a22

    
           #[test] 
        
           fn error_nominal_test() { 
        
               fn inner(mut data: Vec<f64>, prcnt: f64) -> TestResult { 
        
                   data.sort_by(|a, b| a.partial_cmp(b).unwrap()); 
        
                   if !between_inclusive(prcnt, 0.0, 1.0) || data.is_empty() { 
        
                       return TestResult::discard(); 
        
                   } 
        
                   let err = 0.001; 
        
                   let mut ckms = CKMS::<f64>::new(err); 
        
                   for d in &data { 
        
                       ckms.insert(*d); 
        
                   } 
        
                   if let Some((_, v)) = ckms.query(prcnt) { 
        
                       debug_assert!( 
        
                           (v - percentile(&data, prcnt)) < err, 
        
                           "v: {} | percentile: {} | prcnt: {} | data: {:?}", 
        
                           v, 
        
                           percentile(&data, prcnt), 
        
                           prcnt, 
        
                           data 
        
                       ); 
        
                       TestResult::passed() 
        
                   } else { 
        
                       TestResult::failed() 
        
                   } 
        
               } 
        
               QuickCheck::new().quickcheck(inner as fn(Vec<f64>, f64) -> TestResult); 
        
           }

and

quantiles/src/ckms/mod.rs

Lines 457 to 499 in cde2a22

    
           #[test] 
        
           fn error_nominal_with_merge_test() { 
        
               fn inner(lhs: Vec<f64>, rhs: Vec<f64>, prcnt: f64, err: f64) -> TestResult { 
        
                   if !between_inclusive(prcnt, 0.0, 1.0) 
        
                       || !between_inclusive(err, 0.0, 1.0) 
        
                       || (lhs.len() + rhs.len()) < 1 
        
                       || lhs.is_empty() 
        
                       || rhs.is_empty() 
        
                   { 
        
                       return TestResult::discard(); 
        
                   } 
        
                   let mut data = lhs.clone(); 
        
                   data.append(&mut rhs.clone()); 
        
                   data.sort_by(|a, b| a.partial_cmp(b).unwrap()); 
        
                   let err = 0.001; 
        
                   let mut ckms = CKMS::<f64>::new(err); 
        
                   for d in &lhs { 
        
                       ckms.insert(*d); 
        
                   } 
        
                   let mut ckms_rhs = CKMS::<f64>::new(err); 
        
                   for d in &rhs { 
        
                       ckms_rhs.insert(*d); 
        
                   } 
        
                   ckms += ckms_rhs; 
        
                   if let Some((_, v)) = ckms.query(prcnt) { 
        
                       debug_assert!( 
        
                           (v - percentile(&data, prcnt)) < err, 
        
                           "v: {} | percentile: {} | prcnt: {} | data: {:?}", 
        
                           v, 
        
                           percentile(&data, prcnt), 
        
                           prcnt, 
        
                           data 
        
                       ); 
        
                       TestResult::passed() 
        
                   } else { 
        
                       TestResult::failed() 
        
                   } 
        
               } 
        
               QuickCheck::new().quickcheck(inner as fn(Vec<f64>, Vec<f64>, f64, f64) -> TestResult); 
        
           }

both assert that the error guarantee holds, for randomly generated datasets and queries. If you buy this analysis I think what today's debugging session suggests to me is:

That the paper that lays out the data structure is behind a paywall is a severe impediment to trust of the structure and
this library would be greatly improved with a reference quantile structure that consumed arbitrarily large space but returned exact results for queries.

c0deaddict · 2020-08-09T13:52:38Z

@blt First of all thanks for your thorough investigation, and apologies for my late reply.

@c0deaddict I think the error is in your comparison code. I put together a branch that includes your test code and trims it down some. This is it:

quantiles/src/ckms/mod.rs

Lines 293 to 366 in cde2a22

#[cfg(test)]

mod test {

use super::*;

use crate::ckms::store::invariant;

use quickcheck::{QuickCheck, TestResult};

use std::cmp;

use std::cmp::Ordering;

use std::f64::consts::E;

fn percentile(data: &[f64], prcnt: f64) -> f64 {

let idx = (prcnt * (data.len() as f64)) as usize;

data[idx]

}

#[test]

fn test_insert() {

let error = 0.001;

let total_points = 10;

let mut data = Vec::<u16>::new();

let mut ckms = CKMS::<u16>::new(error);

for v in 0..total_points {

data.push(v);

ckms.insert(v);

}

assert_eq!(ckms.query(0.0), Some((1, 0)));

assert_eq!(ckms.query(0.001), Some((1, 0)));

assert_eq!(ckms.query(0.01), Some((1, 0)));

assert_eq!(ckms.query(0.1), Some((1, 0)));

assert_eq!(ckms.query(0.101), Some((1, 0)));

assert_eq!(ckms.query(0.14), Some((1, 0)));

assert_eq!(ckms.query(0.145), Some((1, 0)));

assert_eq!(ckms.query(0.146), Some((1, 0)));

assert_eq!(ckms.query(0.15), Some((2, 1)));

assert_eq!(ckms.query(0.199), Some((2, 1)));

assert_eq!(ckms.query(0.2), Some((2, 1)));

assert_eq!(ckms.query(0.3), Some((3, 2)));

assert_eq!(ckms.query(0.4), Some((4, 3)));

assert_eq!(ckms.query(0.5), Some((5, 4)));

assert_eq!(ckms.query(0.6), Some((6, 5)));

assert_eq!(ckms.query(0.7), Some((7, 6)));

assert_eq!(ckms.query(0.8), Some((8, 7)));

assert_eq!(ckms.query(0.9), Some((9, 8)));

assert_eq!(ckms.query(1.0), Some((10, 9)));

for q in &[

0.0, 0.001, 0.01, 0.1, 0.101, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0,

] {

let idx = cmp::max(1, (q * total_points as f64).ceil() as usize);

let expected = data[idx - 1];

let (rank, actual) = ckms.query(*q).unwrap();

let diff = if expected > actual {

expected - actual

} else {

actual - expected

};

let result = (diff as f64).partial_cmp(&error);

assert!(result.is_some());

assert_ne!(

result,

Some(Ordering::Greater),

"total_points: {} | idx: {} | (rank, actual): ({}, {}) | quantile: {} | {} > {}",

total_points,

idx,

rank,

actual,

q,

diff,

error

);

}

}

I first wondered if the reported issue could be triggered without resorting to random data, and it could. I found somewhat accidentally that the issue could be triggered by a drastically smaller number of points, 10 in the above snippet. I went back to the paper and for the data set 0 .. 10 calculated the rank and approximation the paper suggests. That's what these lines are:

quantiles/src/ckms/mod.rs

Lines 320 to 338 in cde2a22

assert_eq!(ckms.query(0.0), Some((1, 0)));

assert_eq!(ckms.query(0.001), Some((1, 0)));

assert_eq!(ckms.query(0.01), Some((1, 0)));

assert_eq!(ckms.query(0.1), Some((1, 0)));

assert_eq!(ckms.query(0.101), Some((1, 0)));

assert_eq!(ckms.query(0.14), Some((1, 0)));

assert_eq!(ckms.query(0.145), Some((1, 0)));

assert_eq!(ckms.query(0.146), Some((1, 0)));

assert_eq!(ckms.query(0.15), Some((2, 1)));

assert_eq!(ckms.query(0.199), Some((2, 1)));

assert_eq!(ckms.query(0.2), Some((2, 1)));

assert_eq!(ckms.query(0.3), Some((3, 2)));

assert_eq!(ckms.query(0.4), Some((4, 3)));

assert_eq!(ckms.query(0.5), Some((5, 4)));

assert_eq!(ckms.query(0.6), Some((6, 5)));

assert_eq!(ckms.query(0.7), Some((7, 6)));

assert_eq!(ckms.query(0.8), Some((8, 7)));

assert_eq!(ckms.query(0.9), Some((9, 8)));

assert_eq!(ckms.query(1.0), Some((10, 9)));

If you run the code this portion of the test will pass. Good sign. At least my understanding of the paper and the implementation correspond. The remaining portion of the test that fails is:

quantiles/src/ckms/mod.rs

Lines 340 to 365 in cde2a22

for q in &[

0.0, 0.001, 0.01, 0.1, 0.101, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0,

] {

let idx = cmp::max(1, (q * total_points as f64).ceil() as usize);

let expected = data[idx - 1];

let (rank, actual) = ckms.query(*q).unwrap();

let diff = if expected > actual {

expected - actual

} else {

actual - expected

};

let result = (diff as f64).partial_cmp(&error);

assert!(result.is_some());

assert_ne!(

result,

Some(Ordering::Greater),

"total_points: {} | idx: {} | (rank, actual): ({}, {}) | quantile: {} | {} > {}",

total_points,

idx,

rank,

actual,

q,

diff,

error

);

}

This fails like so:
---- ckms::test::test_insert stdout ----
thread 'ckms::test::test_insert' panicked at 'assertion failed: `(left != right)`
  left: `Some(Greater)`,
 right: `Some(Greater)`: total_points: 10 | idx: 2 | (rank, actual): (1, 0) | quantile: 0.101 | 1 > 0.001', src/ckms/mod.rs:353:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace


failures:
    ckms::test::test_insert
The important things to note here is that the idx disagrees with the rank. If you adjust from ceil to floor then the test will pass, but that's only incidentally. Quantile calculation like this requires you to take into account whether the dataset has an odd or even number of members, which the calculation in your example doesn't do. If you adjust to 11 total samples -- and comment out the hand-calculation bit -- then you'll find the test will pass. I believe what this means is that the calculation you've supplied is faulty.

How should the odd/even members be taken into account? Using ceil or floor for one or the other doesn't give right results either. I get the best results by using round.

There seem to be different methods of calculating percentiles. Numpy for example can do a linear interpolation if the quantile is not at an exact array index: https://github.com/numpy/numpy/blob/master/numpy/lib/function_base.py#L3865-L3969 I think the main question is: what method for calculating exact quantiles did the authors of the paper use?

I feel pretty confident about that after having gone back over the paper, plus remembering that the tests

quantiles/src/ckms/mod.rs

Lines 419 to 448 in cde2a22

#[test]

fn error_nominal_test() {

fn inner(mut data: Vec<f64>, prcnt: f64) -> TestResult {

data.sort_by(|a, b| a.partial_cmp(b).unwrap());

if !between_inclusive(prcnt, 0.0, 1.0) || data.is_empty() {

return TestResult::discard();

}

let err = 0.001;

let mut ckms = CKMS::<f64>::new(err);

for d in &data {

ckms.insert(*d);

}

if let Some((_, v)) = ckms.query(prcnt) {

debug_assert!(

(v - percentile(&data, prcnt)) < err,

"v: {} | percentile: {} | prcnt: {} | data: {:?}",

v,

percentile(&data, prcnt),

prcnt,

data

);

TestResult::passed()

} else {

TestResult::failed()

}

}

QuickCheck::new().quickcheck(inner as fn(Vec<f64>, f64) -> TestResult);

}

and

quantiles/src/ckms/mod.rs

Lines 457 to 499 in cde2a22

#[test]

fn error_nominal_with_merge_test() {

fn inner(lhs: Vec<f64>, rhs: Vec<f64>, prcnt: f64, err: f64) -> TestResult {

if !between_inclusive(prcnt, 0.0, 1.0)

|| !between_inclusive(err, 0.0, 1.0)

|| (lhs.len() + rhs.len()) < 1

|| lhs.is_empty()

|| rhs.is_empty()

{

return TestResult::discard();

}

let mut data = lhs.clone();

data.append(&mut rhs.clone());

data.sort_by(|a, b| a.partial_cmp(b).unwrap());

let err = 0.001;

let mut ckms = CKMS::<f64>::new(err);

for d in &lhs {

ckms.insert(*d);

}

let mut ckms_rhs = CKMS::<f64>::new(err);

for d in &rhs {

ckms_rhs.insert(*d);

}

ckms += ckms_rhs;

if let Some((_, v)) = ckms.query(prcnt) {

debug_assert!(

(v - percentile(&data, prcnt)) < err,

"v: {} | percentile: {} | prcnt: {} | data: {:?}",

v,

percentile(&data, prcnt),

prcnt,

data

);

TestResult::passed()

} else {

TestResult::failed()

}

}

QuickCheck::new().quickcheck(inner as fn(Vec<f64>, Vec<f64>, f64, f64) -> TestResult);

}

both assert that the error guarantee holds, for randomly generated datasets and queries.

These tests look pretty solid. Only thing that seems strange is that the assertion checks if the actual value v is not bigger than by more than err than the expected value percentile(&data, prcnt), but it could happen that v is smaller than percentile(&data, prcnt) by more than err. For example if v = 80 and percentile(&data, prcnt) = 100, this would slip through but is an error, right?

If you buy this analysis I think what today's debugging session suggests to me is:
1. That the paper that lays out the data structure is behind a paywall is a severe impediment to trust of the structure and

Yes indeed. Is this the paper you based it on: https://ieeexplore.ieee.org/abstract/document/4274974/ Maybe it's a good idea to include a link to it in the source code?

Fortunately there are ways to work around the pay wall :-)

2. this library would be greatly improved with a reference quantile structure that consumed arbitrarily large space but returned exact results for queries.

That would be very useful!

blt · 2020-08-09T16:12:50Z

@blt First of all thanks for your thorough investigation, and apologies for my late reply.

Not a problem! Things take the time they take.

How should the odd/even members be taken into account? Using ceil or floor for one or the other doesn't give right results either. I get the best results by using round.

There seem to be different methods of calculating percentiles. Numpy for example can do a linear interpolation if the quantile is not at an exact array index: https://github.com/numpy/numpy/blob/master/numpy/lib/function_base.py#L3865-L3969 I think the main question is: what method for calculating exact quantiles did the authors of the paper use?

This is a very good question, and not necessarily something I appreciated when I first wrote this library. I won't likely have a chance to get to the paper until at least next weekend, but this does strike me as a very important question to pursue.

These tests look pretty solid. Only thing that seems strange is that the assertion checks if the actual value v is not bigger than by more than err than the expected value percentile(&data, prcnt), but it could happen that v is smaller than percentile(&data, prcnt) by more than err. For example if v = 80 and percentile(&data, prcnt) = 100, this would slip through but is an error, right?

Ah, indeed. Yes that is an oversight. That should be corrected. I'd be curious to see what turns up when it is.

Yes indeed. Is this the paper you based it on: https://ieeexplore.ieee.org/abstract/document/4274974/ Maybe it's a good idea to include a link to it in the source code?

It's linked in the README but that's a little distant. What you get in the source documentation is not the best:

quantiles/src/ckms/mod.rs

Lines 6 to 12 in 3f72442

    
           //! As of this writing you _must_ use the presentation in the IEEE version of 
        
           //! the paper. The authors' self-published copy of the paper is incorrect and 
        
           //! this implementation will _not_ make sense if you follow along using that 
        
           //! version. Only the 'full biased' invariant is used. The 'targeted quantiles' 
        
           //! variant of this algorithm is fundamentally flawed, an issue which the 
        
           //! authors correct in their "Space- and Time-Efficient Deterministic Algorithms 
        
           //! for Biased Quantiles over Data Streams"

That would be very useful!

Cool. When I have a little time next I'll fix the test bug you pointed out above and see about including this in the project. Sounds fun.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error bound check for CKMS appears to be off #27

Error bound check for CKMS appears to be off #27

c0deaddict commented Jul 28, 2020 •

edited

Loading

blt commented Jul 28, 2020

c0deaddict commented Jul 29, 2020

blt commented Aug 1, 2020

blt commented Aug 2, 2020

c0deaddict commented Aug 9, 2020

blt commented Aug 9, 2020

Error bound check for CKMS appears to be off #27

Error bound check for CKMS appears to be off #27

Comments

c0deaddict commented Jul 28, 2020 • edited Loading

blt commented Jul 28, 2020

c0deaddict commented Jul 29, 2020

blt commented Aug 1, 2020

blt commented Aug 2, 2020

c0deaddict commented Aug 9, 2020

blt commented Aug 9, 2020

c0deaddict commented Jul 28, 2020 •

edited

Loading