Update the way we use K-S tests to assist breakpoint selection #145

ATimHewson · 2021-01-19T14:54:43Z

We need a more complete way of using K-S tests to advise the user on potential breakpoints.

The key outputs we would like to send to the screen, to advise the user, are the standard K-S test outputs:

the D-statistic (max separation in y of the two CDFs), and
the P-value (likelihood that the CDF separation seen did not occur by chance)

A. We would like to see 1 and 2, from the outset, across a range of possible breakpoints for all the data (this is different to what is currently done, where the code assumes successive removal of left-of-breakpoint data prior to each re-application of the K-S test).
So the first set of K-S test runs like this might result in the user selecting one breakpoint.

B. After A the user will likely want to repeat the above for all data on one side of the selected breakpoint, to see if another breakpoint could be utilised there. And then they might look at the portion of data that lies on the other side of the first breakpoint in the same way. And then they might want to further divide up in the same way.

To start A we need a way of deciding how many independent K-S tests to run, and a way to output the results. It is difficult to advise on how many tests would be optimal - it is a compromise between run time, and dataset size. Perhaps 20 would be good value to start out with (this could perhaps be a user-defined value?).

Then how do we decide which 20 potential breakpoints to test? There are two main options. One would be to evenly divide up the governing variable range, between max and min, into 21 subsets and use those 20 breakpoints. A second would be to rank all governing variable values and divide into 21 equally sized subsets, and accordingly use the dividing points that correspond; these would of course be less evenly spread than in the first option. Probably the second case is preferable to give a general overview.

As regards conveying the output of the 20(?) tests, we could have
i) tabular text, and/or
ii) a graph.
The graph is visually more appealing but would be more work. In each case we need to convey what the breakpoint is (in governing variable units), and what the values of 1 and 2 above are. If we went for the graphical option we would need two y-axes, a linear one for the D-statistics and non-linear for the P-value (some form of logarithmic?). Values for the P-value may be limited to a maximum of 99.99 (2 decimal places) if what I remember from my limited python experience is correct. This may prove a bit of a limitation, because as I understand it very high P values which could be useful to us strictly need a supercomputer to derive (!), but let's see.

With the above implemented the process of semi-subjective but informed selection of breakpoints by the user should be much improved (with clear graphical output as supporting evidence, if we can achieve that).

Priority for this issue should be to get something working with tabular output; the graphical output may take quite a bit longer and so is lower priority unless it can be done quickly.

FatimaPillosu · 2021-01-21T21:46:16Z

Hi @onyb ,

on the basis of what Tim mentioned, I have created the Matlab code to show you what we might need to have in Python.
The Matlab considers all the cases in the node that you are analysing. Then, the code runs the K-S for a pre-determined number of breakpoints determining the K-S's D statistic and the pvalue of the test (i.e. significance levels of the K-S test). Notice that the smaller the pvalue, the more significant the test result is. Finally, the code plots the Dstat and the (-1*log(pvalue)), in the same diagram for each of the considered breakpoints (see figure below). We have implemented a diagram with two different scales. The code provides you with the layout for the diagram that we might want to have, but if you have better ideas, please feel free to comment on them.

I have included in the code two different cases (i.e. case1: node at the top of the decision tree; case2: a node at the bottom of the decision tree). I did that to highlight the differences that you will encounter in these two situations. For case1, the pvalues are so so so so small that, even applying the log transformation, you cannot see anything, they are just set to 0. However, when you are at the bottom of the tree (where the node has a smaller number of cases), the pvalues start increasing and then you can see something in the diagram. Matlab applying a super high precision, so you can see that the pvalues, although very very very very small, they are not 0. I don't know how that might behold by Python. We need to see.

I'm putting this in priority 1 as, from what @ATimHewson told me, @EstiGascon and @AugustinVintzileos need this implementation as soon as possible. Thus, if there is something that it is not clear from Tim's explanation or my code, please, let's have a meeting as soon as you can. I think it would be quicker to discuss any problems directly rather than via email or via issues.

I add directly in the issue the Matlab code and the diagram for case2. I have also added in google drive the PDT.csv file that I used to create the plots, so we both start from the same data, and if results are different we can discuss them.

Cheers,

Fatima

Matlab Code

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%                                                               %
% TEST ON NEW WAY ON CREATING THE BREAKPOINTS FOR DECISION TREE %
%                                                               %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

% The used K-S function is described here:
% https://uk.mathworks.com/help/stats/kstest2.html#btno0gd-ks2stat

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% INPUTS
InputData = 'PDT.mat';
NumBP = 20;
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

load(InputData)

% Selection of the decision tree node to analyse and the predictor to consider

% % CASE 1
% predictor = "cpr";
% node = "Top of the tree";
% units = "-";
% VAR = cpr;

% CASE 2
predictor = "sr24h";
node = "(cpr>0.25 & cpr<0.75) & (tp>10 & tp<=inf) & (wspd700>20 & wspd700<=inf) & (cape>-inf & cape<=50)";
units = "J/kg";
VAR = sr24h(cpr>0.25 & cpr<0.75 & tp_acc>10 & tp_acc<=9999 & wspd700>20 & wspd700<=9999 & cape_wa>-9999 & cape_wa<=50);

[VAR, ind] = sort(VAR);
SizeVAR = length(VAR);
Len_SplitsVAR = round(SizeVAR/(NumBP+1));
BPs_ind = (1:NumBP)' * Len_SplitsVAR;
BPs = VAR(BPs_ind);
BPs = unique (BPs);

indBP_1 = min(VAR);
indBP_3 = max(VAR);
pvalue_vec = zeros(length(BPs),1);
Dstat_vec = zeros(length(BPs),1);

for i = 1 : length(BPs)
    
    disp(i)
    
    indBP_2 = BPs(i);
    
    data2test_1 = FER(VAR>=indBP_1 & VAR<indBP_2);
    data2test_2 = FER(VAR>=indBP_2 & VAR<=indBP_3); 
    
    if ~isempty(data2test_1) && ~isempty(data2test_2)
        [h,pvalue,Dstat] = kstest2(data2test_1,data2test_2, 'Alpha', 0.05);
    
        pvalue_vec(i) = pvalue;
        Dstat_vec(i) = Dstat;
    end
    
end

figure
yyaxis left
plot(BPs,Dstat_vec, "o-", "LineWidth", 1.5)
title({["Node"], node})
xlabel(strcat("Breakpoints for ", predictor, " [", units, "]"))
ylabel("K-S Dstat")

yyaxis right
plot(BPs, (-1 * log(pvalue_vec)),  "o-", "LineWidth", 1.5)
ylabel("log(pvalue)")

FatimaPillosu · 2021-01-28T21:38:11Z

Here there is an idea on how the GUI for the new way of using the K-S test might look like.
Scan 28 Jan 2021.pdf

FatimaPillosu · 2021-02-26T17:17:53Z

Added the mockup for the new way of running the K-S test.
Mockup.pptx

AugustinVintzileos · 2021-02-26T17:36:47Z

Thanks Fatima, This looks very helpful! Question: Slide 7 why are you choosing 167? Combination of Dstat and pvalue? Best, Augustin

…

________________________________ From: Fatima Pillosu <[email protected]> Sent: Friday, February 26, 2021 12:18 PM To: esowc/ecPoint-Calibrate <[email protected]> Cc: Augustin Vintzileos <[email protected]>; Mention <[email protected]> Subject: Re: [esowc/ecPoint-Calibrate] Update the way we use K-S tests to assist breakpoint selection (#145) Added the mockup for the new way of running the K-S test. Mockup.pptx<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fesowc%2FecPoint-Calibrate%2Ffiles%2F6051313%2FMockup.pptx&data=04%7C01%7CAugustin.Vintzileos%40ecmwf.int%7C8dd9de64b41942795ec808d8da7a7d3b%7C21b711c6aab74d369ffbac0357bc20ba%7C0%7C1%7C637499566904974782%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=EI8K51Ni%2FhBd6PM1nOp87nkLYRQLq%2FbQMMWLtYINbak%3D&reserved=0> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fesowc%2FecPoint-Calibrate%2Fissues%2F145%23issuecomment-786779547&data=04%7C01%7CAugustin.Vintzileos%40ecmwf.int%7C8dd9de64b41942795ec808d8da7a7d3b%7C21b711c6aab74d369ffbac0357bc20ba%7C0%7C1%7C637499566904974782%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=0i6Q%2Bd%2BruGa6GQj2OCSIhPz19JqNDfA5aw3bVLXNGOk%3D&reserved=0>, or unsubscribe<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FARHK63II6OTYCXYYT6KEKR3TA7JVBANCNFSM4WI4Q7KA&data=04%7C01%7CAugustin.Vintzileos%40ecmwf.int%7C8dd9de64b41942795ec808d8da7a7d3b%7C21b711c6aab74d369ffbac0357bc20ba%7C0%7C1%7C637499566904984786%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=7p4o2EHUOloBuRAIpilqF28ZcN7dyTc9YXX1MpTz%2BrE%3D&reserved=0>.

FatimaPillosu · 2021-03-01T06:53:28Z

Hi Augustin,
those numbers are totally pulled out from the air. Please, don't give them any physical meaning.
They were just there to help Anirudha visualize the GUI for the K-S test.
Cheers,
Fatima

FatimaPillosu · 2021-03-12T15:52:00Z

Hi @onyb ,

I'm suggesting the following changes to the GUI for the K-S test:

onyb · 2021-03-15T09:53:39Z

@FatimaPillosu I have a question regarding the bug that's causing spikes in the graph. If I recall correctly, our hypothesis was that we were considering the first and last item in the (sorted) predictor, and agreed to not consider them as breakpoints.

Just wanted to check if my understanding is correct with the following example:

Imagine we have 100 values in our predictor, and want to 5 breakpoints. Which of the following breakpoint indexes should be chosen? Note that the indexes range from 0 to 99.

A. [ 0, 25, 50, 74, 99]
Remark: this is the current version


B. [ 1, 25, 50, 74, 98]
Remark: pick 5 breakpoints, excluding first index (0) and last index (99)


C. [16, 33, 50, 66, 82]
Remark: pick 7 breakpoints ([0, 16, 33, 50, 66, 82, 99]), then exclude first index (0) and last index (99)

FatimaPillosu · 2021-03-15T10:23:52Z

Hi @onyb ,

the right answer is C. Indeed, you will have:
n. of elements in each class = n. of tot points / (n. of breakpoints + 1) = 100 / (5 + 1) = 16.66666
If you round this number (to the lowest or to the highest value, it does not make any difference) you will get the indexes in C.
There is no need at all to include the min and max value of your dataset in the list of indexes.

Cheers,

Fatima

FatimaPillosu · 2021-03-16T14:18:21Z

Hi Anirudha,
The release 0.23.0 looks really good. Just a small point.
In the diagram below, it would be really good to have also the units of the variable in the x axis.

onyb · 2021-03-16T15:49:39Z

@FatimaPillosu I was going to cover the list of things that I did not include in v0.23.0 because it would've otherwise taken me longer to ship this release. Here's the list of items to expect in the next release:

The scale above the figure indicating min and max ranges.
Display range of definitive breakpoint selected.
Make the < and > buttons selectable, as in the mockups.
Make definitive breakpoints editable and add a delete button to each.

Regarding the units on the x-axis, it turned out to be trickier than I thought. In the second module of the software, we don't have access to the predictor files, hence it's not possible to extract the units. We can try to read the units from the comments in the ASCII tables, but they are not structured in a way that's easy to parse. It may be slightly easier to do this with Parquet files, which has dedicated storage for metadata where we can store the units in a structured format.

onyb · 2021-03-31T12:32:12Z

Sneak peak

No breakpoint selected

Primary breakpoint selected (not definitive)

I'm going to keep this issue open until the next release is published.

onyb · 2021-04-06T08:11:43Z

Fixed in v0.24.0.

Note: Displaying the number of elements for a given definitive breakpoints range was not implemented since the value must be recomputed whenever the breakpoints are edited. Will tackle this separately.

ATimHewson added the Science Science discussions about new implementations in the software label Jan 19, 2021

ATimHewson added this to the Science Discussions milestone Jan 19, 2021

ATimHewson assigned FatimaPillosu, AugustinVintzileos and ATimHewson Jan 19, 2021

FatimaPillosu added priority 1 STOP. This is preventing EVERYONE from using the software. and removed Science Science discussions about new implementations in the software labels Jan 21, 2021

FatimaPillosu modified the milestones: Science Discussions, Enhancements Jan 21, 2021

ecmwf deleted a comment from EstiGascon Mar 17, 2021

onyb added a commit that referenced this issue Mar 28, 2021

GH-145: display slider for breakpoints

39aa6b0

onyb closed this as completed Apr 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update the way we use K-S tests to assist breakpoint selection #145

Update the way we use K-S tests to assist breakpoint selection #145

ATimHewson commented Jan 19, 2021

FatimaPillosu commented Jan 21, 2021 •

edited by onyb

Loading

FatimaPillosu commented Jan 28, 2021

FatimaPillosu commented Feb 26, 2021

AugustinVintzileos commented Feb 26, 2021 via email

FatimaPillosu commented Mar 1, 2021

FatimaPillosu commented Mar 12, 2021

onyb commented Mar 15, 2021

FatimaPillosu commented Mar 15, 2021

FatimaPillosu commented Mar 16, 2021

onyb commented Mar 16, 2021 •

edited

Loading

onyb commented Mar 31, 2021

onyb commented Apr 6, 2021

Update the way we use K-S tests to assist breakpoint selection #145

Update the way we use K-S tests to assist breakpoint selection #145

Comments

ATimHewson commented Jan 19, 2021

FatimaPillosu commented Jan 21, 2021 • edited by onyb Loading

FatimaPillosu commented Jan 28, 2021

FatimaPillosu commented Feb 26, 2021

AugustinVintzileos commented Feb 26, 2021 via email

FatimaPillosu commented Mar 1, 2021

FatimaPillosu commented Mar 12, 2021

onyb commented Mar 15, 2021

FatimaPillosu commented Mar 15, 2021

FatimaPillosu commented Mar 16, 2021

onyb commented Mar 16, 2021 • edited Loading

onyb commented Mar 31, 2021

Sneak peak

No breakpoint selected

Primary breakpoint selected (not definitive)

onyb commented Apr 6, 2021

FatimaPillosu commented Jan 21, 2021 •

edited by onyb

Loading

onyb commented Mar 16, 2021 •

edited

Loading