-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update the way we use K-S tests to assist breakpoint selection #145
Comments
Hi @onyb , on the basis of what Tim mentioned, I have created the Matlab code to show you what we might need to have in Python. I have included in the code two different cases (i.e. case1: node at the top of the decision tree; case2: a node at the bottom of the decision tree). I did that to highlight the differences that you will encounter in these two situations. For case1, the pvalues are so so so so small that, even applying the log transformation, you cannot see anything, they are just set to 0. However, when you are at the bottom of the tree (where the node has a smaller number of cases), the pvalues start increasing and then you can see something in the diagram. Matlab applying a super high precision, so you can see that the pvalues, although very very very very small, they are not 0. I don't know how that might behold by Python. We need to see. I'm putting this in priority 1 as, from what @ATimHewson told me, @EstiGascon and @AugustinVintzileos need this implementation as soon as possible. Thus, if there is something that it is not clear from Tim's explanation or my code, please, let's have a meeting as soon as you can. I think it would be quicker to discuss any problems directly rather than via email or via issues. I add directly in the issue the Matlab code and the diagram for case2. I have also added in google drive the PDT.csv file that I used to create the plots, so we both start from the same data, and if results are different we can discuss them. Cheers, Fatima Matlab Code
|
Here there is an idea on how the GUI for the new way of using the K-S test might look like. |
Added the mockup for the new way of running the K-S test. |
Hi Augustin, |
Hi @onyb , I'm suggesting the following changes to the GUI for the K-S test: |
@FatimaPillosu I have a question regarding the bug that's causing spikes in the graph. If I recall correctly, our hypothesis was that we were considering the first and last item in the (sorted) predictor, and agreed to not consider them as breakpoints. Just wanted to check if my understanding is correct with the following example: Imagine we have 100 values in our predictor, and want to 5 breakpoints. Which of the following breakpoint indexes should be chosen? Note that the indexes range from
|
Hi @onyb , the right answer is C. Indeed, you will have: Cheers, Fatima |
@FatimaPillosu I was going to cover the list of things that I did not include in v0.23.0 because it would've otherwise taken me longer to ship this release. Here's the list of items to expect in the next release:
Regarding the units on the x-axis, it turned out to be trickier than I thought. In the second module of the software, we don't have access to the predictor files, hence it's not possible to extract the units. We can try to read the units from the comments in the ASCII tables, but they are not structured in a way that's easy to parse. It may be slightly easier to do this with Parquet files, which has dedicated storage for metadata where we can store the units in a structured format. |
Fixed in v0.24.0. Note: Displaying the number of elements for a given definitive breakpoints range was not implemented since the value must be recomputed whenever the breakpoints are edited. Will tackle this separately. |
We need a more complete way of using K-S tests to advise the user on potential breakpoints.
The key outputs we would like to send to the screen, to advise the user, are the standard K-S test outputs:
A. We would like to see 1 and 2, from the outset, across a range of possible breakpoints for all the data (this is different to what is currently done, where the code assumes successive removal of left-of-breakpoint data prior to each re-application of the K-S test).
So the first set of K-S test runs like this might result in the user selecting one breakpoint.
B. After A the user will likely want to repeat the above for all data on one side of the selected breakpoint, to see if another breakpoint could be utilised there. And then they might look at the portion of data that lies on the other side of the first breakpoint in the same way. And then they might want to further divide up in the same way.
To start A we need a way of deciding how many independent K-S tests to run, and a way to output the results. It is difficult to advise on how many tests would be optimal - it is a compromise between run time, and dataset size. Perhaps 20 would be good value to start out with (this could perhaps be a user-defined value?).
Then how do we decide which 20 potential breakpoints to test? There are two main options. One would be to evenly divide up the governing variable range, between max and min, into 21 subsets and use those 20 breakpoints. A second would be to rank all governing variable values and divide into 21 equally sized subsets, and accordingly use the dividing points that correspond; these would of course be less evenly spread than in the first option. Probably the second case is preferable to give a general overview.
As regards conveying the output of the 20(?) tests, we could have
i) tabular text, and/or
ii) a graph.
The graph is visually more appealing but would be more work. In each case we need to convey what the breakpoint is (in governing variable units), and what the values of 1 and 2 above are. If we went for the graphical option we would need two y-axes, a linear one for the D-statistics and non-linear for the P-value (some form of logarithmic?). Values for the P-value may be limited to a maximum of 99.99 (2 decimal places) if what I remember from my limited python experience is correct. This may prove a bit of a limitation, because as I understand it very high P values which could be useful to us strictly need a supercomputer to derive (!), but let's see.
With the above implemented the process of semi-subjective but informed selection of breakpoints by the user should be much improved (with clear graphical output as supporting evidence, if we can achieve that).
Priority for this issue should be to get something working with tabular output; the graphical output may take quite a bit longer and so is lower priority unless it can be done quickly.
The text was updated successfully, but these errors were encountered: