Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update the way we use K-S tests to assist breakpoint selection #145

Closed
ATimHewson opened this issue Jan 19, 2021 · 12 comments
Closed

Update the way we use K-S tests to assist breakpoint selection #145

ATimHewson opened this issue Jan 19, 2021 · 12 comments
Assignees
Labels
priority 1 STOP. This is preventing EVERYONE from using the software.
Milestone

Comments

@ATimHewson
Copy link

We need a more complete way of using K-S tests to advise the user on potential breakpoints.

The key outputs we would like to send to the screen, to advise the user, are the standard K-S test outputs:

  1. the D-statistic (max separation in y of the two CDFs), and
  2. the P-value (likelihood that the CDF separation seen did not occur by chance)

A. We would like to see 1 and 2, from the outset, across a range of possible breakpoints for all the data (this is different to what is currently done, where the code assumes successive removal of left-of-breakpoint data prior to each re-application of the K-S test).
So the first set of K-S test runs like this might result in the user selecting one breakpoint.

B. After A the user will likely want to repeat the above for all data on one side of the selected breakpoint, to see if another breakpoint could be utilised there. And then they might look at the portion of data that lies on the other side of the first breakpoint in the same way. And then they might want to further divide up in the same way.

To start A we need a way of deciding how many independent K-S tests to run, and a way to output the results. It is difficult to advise on how many tests would be optimal - it is a compromise between run time, and dataset size. Perhaps 20 would be good value to start out with (this could perhaps be a user-defined value?).

Then how do we decide which 20 potential breakpoints to test? There are two main options. One would be to evenly divide up the governing variable range, between max and min, into 21 subsets and use those 20 breakpoints. A second would be to rank all governing variable values and divide into 21 equally sized subsets, and accordingly use the dividing points that correspond; these would of course be less evenly spread than in the first option. Probably the second case is preferable to give a general overview.

As regards conveying the output of the 20(?) tests, we could have
i) tabular text, and/or
ii) a graph.
The graph is visually more appealing but would be more work. In each case we need to convey what the breakpoint is (in governing variable units), and what the values of 1 and 2 above are. If we went for the graphical option we would need two y-axes, a linear one for the D-statistics and non-linear for the P-value (some form of logarithmic?). Values for the P-value may be limited to a maximum of 99.99 (2 decimal places) if what I remember from my limited python experience is correct. This may prove a bit of a limitation, because as I understand it very high P values which could be useful to us strictly need a supercomputer to derive (!), but let's see.

With the above implemented the process of semi-subjective but informed selection of breakpoints by the user should be much improved (with clear graphical output as supporting evidence, if we can achieve that).

Priority for this issue should be to get something working with tabular output; the graphical output may take quite a bit longer and so is lower priority unless it can be done quickly.

@ATimHewson ATimHewson added the Science Science discussions about new implementations in the software label Jan 19, 2021
@ATimHewson ATimHewson added this to the Science Discussions milestone Jan 19, 2021
@FatimaPillosu FatimaPillosu added priority 1 STOP. This is preventing EVERYONE from using the software. and removed Science Science discussions about new implementations in the software labels Jan 21, 2021
@FatimaPillosu
Copy link
Contributor

FatimaPillosu commented Jan 21, 2021

Hi @onyb ,

on the basis of what Tim mentioned, I have created the Matlab code to show you what we might need to have in Python.
The Matlab considers all the cases in the node that you are analysing. Then, the code runs the K-S for a pre-determined number of breakpoints determining the K-S's D statistic and the pvalue of the test (i.e. significance levels of the K-S test). Notice that the smaller the pvalue, the more significant the test result is. Finally, the code plots the Dstat and the (-1*log(pvalue)), in the same diagram for each of the considered breakpoints (see figure below). We have implemented a diagram with two different scales. The code provides you with the layout for the diagram that we might want to have, but if you have better ideas, please feel free to comment on them.

I have included in the code two different cases (i.e. case1: node at the top of the decision tree; case2: a node at the bottom of the decision tree). I did that to highlight the differences that you will encounter in these two situations. For case1, the pvalues are so so so so small that, even applying the log transformation, you cannot see anything, they are just set to 0. However, when you are at the bottom of the tree (where the node has a smaller number of cases), the pvalues start increasing and then you can see something in the diagram. Matlab applying a super high precision, so you can see that the pvalues, although very very very very small, they are not 0. I don't know how that might behold by Python. We need to see.

I'm putting this in priority 1 as, from what @ATimHewson told me, @EstiGascon and @AugustinVintzileos need this implementation as soon as possible. Thus, if there is something that it is not clear from Tim's explanation or my code, please, let's have a meeting as soon as you can. I think it would be quicker to discuss any problems directly rather than via email or via issues.

I add directly in the issue the Matlab code and the diagram for case2. I have also added in google drive the PDT.csv file that I used to create the plots, so we both start from the same data, and if results are different we can discuss them.

Cheers,

Fatima


Matlab Code

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%                                                               %
% TEST ON NEW WAY ON CREATING THE BREAKPOINTS FOR DECISION TREE %
%                                                               %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

% The used K-S function is described here:
% https://uk.mathworks.com/help/stats/kstest2.html#btno0gd-ks2stat

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% INPUTS
InputData = 'PDT.mat';
NumBP = 20;
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

load(InputData)

% Selection of the decision tree node to analyse and the predictor to consider

% % CASE 1
% predictor = "cpr";
% node = "Top of the tree";
% units = "-";
% VAR = cpr;

% CASE 2
predictor = "sr24h";
node = "(cpr>0.25 & cpr<0.75) & (tp>10 & tp<=inf) & (wspd700>20 & wspd700<=inf) & (cape>-inf & cape<=50)";
units = "J/kg";
VAR = sr24h(cpr>0.25 & cpr<0.75 & tp_acc>10 & tp_acc<=9999 & wspd700>20 & wspd700<=9999 & cape_wa>-9999 & cape_wa<=50);

[VAR, ind] = sort(VAR);
SizeVAR = length(VAR);
Len_SplitsVAR = round(SizeVAR/(NumBP+1));
BPs_ind = (1:NumBP)' * Len_SplitsVAR;
BPs = VAR(BPs_ind);
BPs = unique (BPs);

indBP_1 = min(VAR);
indBP_3 = max(VAR);
pvalue_vec = zeros(length(BPs),1);
Dstat_vec = zeros(length(BPs),1);

for i = 1 : length(BPs)
    
    disp(i)
    
    indBP_2 = BPs(i);
    
    data2test_1 = FER(VAR>=indBP_1 & VAR<indBP_2);
    data2test_2 = FER(VAR>=indBP_2 & VAR<=indBP_3); 
    
    if ~isempty(data2test_1) && ~isempty(data2test_2)
        [h,pvalue,Dstat] = kstest2(data2test_1,data2test_2, 'Alpha', 0.05);
    
        pvalue_vec(i) = pvalue;
        Dstat_vec(i) = Dstat;
    end
    
end

figure
yyaxis left
plot(BPs,Dstat_vec, "o-", "LineWidth", 1.5)
title({["Node"], node})
xlabel(strcat("Breakpoints for ", predictor, " [", units, "]"))
ylabel("K-S Dstat")

yyaxis right
plot(BPs, (-1 * log(pvalue_vec)),  "o-", "LineWidth", 1.5)
ylabel("log(pvalue)")

Example

@FatimaPillosu
Copy link
Contributor

Here there is an idea on how the GUI for the new way of using the K-S test might look like.
Scan 28 Jan 2021.pdf

@FatimaPillosu
Copy link
Contributor

Added the mockup for the new way of running the K-S test.
Mockup.pptx

@AugustinVintzileos
Copy link

AugustinVintzileos commented Feb 26, 2021 via email

@FatimaPillosu
Copy link
Contributor

Hi Augustin,
those numbers are totally pulled out from the air. Please, don't give them any physical meaning.
They were just there to help Anirudha visualize the GUI for the K-S test.
Cheers,
Fatima

@FatimaPillosu
Copy link
Contributor

Hi @onyb ,

I'm suggesting the following changes to the GUI for the K-S test:
image

image

@onyb
Copy link
Contributor

onyb commented Mar 15, 2021

@FatimaPillosu I have a question regarding the bug that's causing spikes in the graph. If I recall correctly, our hypothesis was that we were considering the first and last item in the (sorted) predictor, and agreed to not consider them as breakpoints.

Just wanted to check if my understanding is correct with the following example:

Imagine we have 100 values in our predictor, and want to 5 breakpoints. Which of the following breakpoint indexes should be chosen? Note that the indexes range from 0 to 99.

A. [ 0, 25, 50, 74, 99]
Remark: this is the current version


B. [ 1, 25, 50, 74, 98]
Remark: pick 5 breakpoints, excluding first index (0) and last index (99)


C. [16, 33, 50, 66, 82]
Remark: pick 7 breakpoints ([0, 16, 33, 50, 66, 82, 99]), then exclude first index (0) and last index (99)

@FatimaPillosu
Copy link
Contributor

Hi @onyb ,

the right answer is C. Indeed, you will have:
n. of elements in each class = n. of tot points / (n. of breakpoints + 1) = 100 / (5 + 1) = 16.66666
If you round this number (to the lowest or to the highest value, it does not make any difference) you will get the indexes in C.
There is no need at all to include the min and max value of your dataset in the list of indexes.

Cheers,

Fatima

@FatimaPillosu
Copy link
Contributor

Hi Anirudha,
The release 0.23.0 looks really good. Just a small point.
In the diagram below, it would be really good to have also the units of the variable in the x axis.
image

@onyb
Copy link
Contributor

onyb commented Mar 16, 2021

@FatimaPillosu I was going to cover the list of things that I did not include in v0.23.0 because it would've otherwise taken me longer to ship this release. Here's the list of items to expect in the next release:

  • The scale above the figure indicating min and max ranges.
  • Display range of definitive breakpoint selected.
  • Make the < and > buttons selectable, as in the mockups.
  • Make definitive breakpoints editable and add a delete button to each.

Regarding the units on the x-axis, it turned out to be trickier than I thought. In the second module of the software, we don't have access to the predictor files, hence it's not possible to extract the units. We can try to read the units from the comments in the ASCII tables, but they are not structured in a way that's easy to parse. It may be slightly easier to do this with Parquet files, which has dedicated storage for metadata where we can store the units in a structured format.

@ecmwf ecmwf deleted a comment from EstiGascon Mar 17, 2021
onyb added a commit that referenced this issue Mar 28, 2021
@onyb
Copy link
Contributor

onyb commented Mar 31, 2021

Sneak peak

No breakpoint selected

Screenshot 2021-03-31 at 14 27 15

Primary breakpoint selected (not definitive)

Screenshot 2021-03-31 at 14 28 15

I'm going to keep this issue open until the next release is published.

@onyb
Copy link
Contributor

onyb commented Apr 6, 2021

Fixed in v0.24.0.

Note: Displaying the number of elements for a given definitive breakpoints range was not implemented since the value must be recomputed whenever the breakpoints are edited. Will tackle this separately.

@onyb onyb closed this as completed Apr 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority 1 STOP. This is preventing EVERYONE from using the software.
Projects
None yet
Development

No branches or pull requests

4 participants