Skip to content

Commit

Permalink
Merge pull request #109 from orchardbirds/language_sprucing
Browse files Browse the repository at this point in the history
Language sprucing
  • Loading branch information
Mateusz Garbacz authored Mar 25, 2021
2 parents 1936ddb + d4a64ba commit 1c61ba3
Show file tree
Hide file tree
Showing 23 changed files with 18,994 additions and 238 deletions.
18 changes: 9 additions & 9 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ We're very much open to contributions but there are some things to keep in mind:

- Discuss the feature and implementation you want to add on Github before you write a PR for it. On disagreements, maintainer(s) will have the final word.
- Features need a somewhat general usecase. If the usecase is very niche it will be hard for us to consider maintaining it.
- If you’re going to add a feature consider if you could help out in the maintenance of it.
- If you’re going to add a feature, consider if you could help out in the maintenance of it.
- When issues or pull requests are not going to be resolved or merged, they should be closed as soon as possible. This is kinder than deciding this after a long period. Our issue tracker should reflect work to be done.

That said, there are many ways to contribute to probatus, including:
Expand Down Expand Up @@ -49,20 +49,20 @@ pre-commit install

### Code structure

* Model validation modules assume that trained models passed for validation are developed in scikit-learn framework (have predict_proba and other standard functions), or follows scikit-learn API e.g. XGBoost.
* Every python file used for model validation, needs to be in `/probatus/`
* Class structure for a given module should have a base class, and specific functionality classes that inherit from base. If a given module implements only single way of computing the output, the base class is not required.
* Functions should not be as short a possible lines of code. If a lot of code is needed, try to put together snippets of code into
* Model validation modules assume that trained models passed for validation are developed in a scikit-learn framework (i.e. have predict_proba and other standard functions), or follow a scikit-learn API e.g. XGBoost.
* Every python file used for model validation needs to be in `/probatus/`
* Class structure for a given module should have a base class and specific functionality classes that inherit from base. If a given module implements only a single way of computing the output, the base class is not required.
* Functions should not be as short as possible in terms of lines of code. If a lot of code is needed, try to put together snippets of code into
other functions. This make the code more readable, and easier to test.
* Classes follow the probatus API structure:
* Each class implements fit(), compute() and fit_compute() methods. Fit is used to fit object with provided data (unless no fit is required), and compute calculates the output e.g. DataFrame with report for the user. Lastly, fit_compute applies one after the other.
* If applicable, plot() method presents user with the appropriate graphs.
* For compute(), and plot(), check if the object is fitted first.
* Each class implements `fit()`, `compute()` and `fit_compute()` methods. `fit()` is used to fit an object with provided data (unless no fit is required), and `compute()` calculates the output e.g. DataFrame with a report for the user. Lastly, `fit_compute()` applies one after the other.
* If applicable, the `plot()` method presents the user with the appropriate graphs.
* For `compute()` and `plot()`, check if the object is fitted first.


### Documentation

Documentation is a very crucial part of the project, because it ensures usability of the package. We develop the docs in the following way:
Documentation is a very crucial part of the project because it ensures usability of the package. We develop the docs in the following way:

* We use [mkdocs](https://www.mkdocs.org/) with [mkdocs-material](https://squidfunk.github.io/mkdocs-material/) theme. The `docs/` folder contains all the relevant documentation.
* We use `mkdocs serve` to view the documentation locally. Use it to test the documentation everytime you make any changes.
Expand Down
2 changes: 1 addition & 1 deletion docs/api/feature_elimination.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Features Elimination

This module allows to apply features elimination.
This module allows us to apply features elimination.


::: probatus.feature_elimination.feature_elimination
2 changes: 1 addition & 1 deletion docs/api/imputation_selector.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Imputation Selector

This module allows to select imputation strategies.
This module allows us to select imputation strategies.


::: probatus.missing_values.imputation
8 changes: 4 additions & 4 deletions docs/api/metric_volatility.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
# Metric Volatility

The aim of this module is analysis of how well a model performs on a given dataset, and how stable the performance is.
The aim of this module is the analysis of how well a model performs on a given dataset, and how stable the performance is.

The following features are implemented:

- **TrainTestVolatility** - Estimation of volatility of metrics. The estimation is done by splitting the data into train and test multiple times and training and scoring a model based on these metrics.
- **TrainTestVolatility** - Estimation of the volatility of metrics. The estimation is done by splitting the data into train and test multiple times and training and scoring a model based on these metrics.

- **SplitSeedVolatility** - Estimates volatility of metrics based on splitting the data into train and test sets multiple times randomly, each time with different seed.
- **SplitSeedVolatility** - Estimates the volatility of metrics based on splitting the data into train and test sets multiple times randomly, each time with a different seed.

- **BootstrappedVolatility** - Estimates volatility of metrics based on splitting the data into train and test with static seed, and bootstrapping train and test set.
- **BootstrappedVolatility** - Estimates the volatility of metrics based on splitting the data into train and test with static seed, and bootstrapping the train and test set.


::: probatus.metric_volatility.volatility
4 changes: 2 additions & 2 deletions docs/api/model_interpret.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Model Interpretation using SHAP

The aim of this module is providing tools for model interpretation using [SHAP](https://shap.readthedocs.io/en/latest/) library.
The class below is a convenience wrapper, that implements multiple plots for tree-based & linear models.
The aim of this module is to provide tools for model interpretation using the [SHAP](https://shap.readthedocs.io/en/latest/) library.
The class below is a convenience wrapper that implements multiple plots for tree-based & linear models.

::: probatus.interpret.model_interpret
::: probatus.interpret.shap_dependence
8 changes: 4 additions & 4 deletions docs/api/sample_similarity.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,17 +2,17 @@

The goal of sample similarity module is understanding how different two samples are from a multivariate perspective.

One of the ways to indicate that is Resemblance Model. Having two datasets say X1 and X2, one can analyse how easy is it to recognize which dataset a randomly selected row comes from. The Resemblance model assigns label 0 to X1 dataset, and label 1 to X2 and trains a binary classification model that to predict, which sample a given row comes from.
By looking at the test AUC, one can conclude that the samples have different distribution the AUC is significantly higher than 0.5. Further, by analysing feature importance one can understand, which of the features have predictive power.
One of the ways to indicate this is Resemblance Model. Having two datasets - say X1 and X2 - one can analyse how easy it is to recognize which dataset a randomly selected row comes from. The Resemblance model assigns label 0 to the dataset X1, and label 1 to X2 and trains a binary classification model to predict which sample a given row comes from.
By looking at the test AUC, one can conclude that the samples have a different distribution if the AUC is significantly higher than 0.5. Furthermore, by analysing feature importance one can understand which of the features have predictive power.

<img src="../img/resemblance_model_schema.png"/>


The following features are implemented:

- **SHAPImportanceResemblance (Recommended)** - The class applies SHAP library, in order to interpret the tree based resemblance model model.
- **SHAPImportanceResemblance (Recommended)** - The class applies SHAP library, in order to interpret the tree based resemblance model.

- **PermutationImportanceResemblance** - The class applies permutation feature importance, in order to understand, which features does the current model rely the most on. The higher the importance of the feature, the more a given feature possibly differs in X2 compared to X1. The importance indicates how much the test AUC drops if a given feature is permuted.
- **PermutationImportanceResemblance** - The class applies permutation feature importance in order to understand which features the current model relies on the most. The higher the importance of the feature, the more a given feature possibly differs in X2 compared to X1. The importance indicates how much the test AUC drops if a given feature is permuted.


::: probatus.sample_similarity.resemblance_model
Expand Down
2 changes: 1 addition & 1 deletion docs/api/stat_tests.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# Statistical Tests

This module allows to apply different statistical tests.
This module allows us to apply different statistical tests.

::: probatus.stat_tests.distribution_statistics
2 changes: 1 addition & 1 deletion docs/api/utils.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Utility Functions

This module contains various smaller functionalities, that can be used across the `probatus` package.
This module contains various smaller functionalities that can be used across the `probatus` package.

::: probatus.utils.scoring

Expand Down
20 changes: 10 additions & 10 deletions docs/howto/reproducibility.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -18,17 +18,17 @@
"- Inputs of `probatus` modules,\n",
"- The `random_state` of `probatus` modules.\n",
"\n",
"Below sections cover how to ensure reproducibility of the results, by controling these aspects\n",
"The below sections cover how to ensure reproducibility of the results by controling these aspects.\n",
"\n",
"## Inputs of probatus modules\n",
"\n",
"There are various parameters that modules of probatus take as input. Below we will cover the most often occurring ones.\n",
"\n",
"### Static dataset\n",
"\n",
"When using `probatus`, one of the most crucial aspects is the provided dataset. Therefore, the first thing to do, is to ensure that the passed dataset does not change along the way. \n",
"When using `probatus`, one of the most crucial aspects is the provided dataset. Therefore, the first thing to do is to ensure that the passed dataset does not change along the way. \n",
"\n",
"Below is a code snipped of random data preparation. In sklearn, you can ensure this by setting `random_state` parameter. Possibly in your projects, you will use a different dataset, however, always make sure that the input data is static."
"Below is a code snipped of random data preparation. In sklearn, you can ensure this by setting the `random_state` parameter. You will probably use a different dataset in your projects, but always make sure that the input data is static."
]
},
{
Expand All @@ -50,9 +50,9 @@
"\n",
"Whenever you split the data in any way, you need to make sure that the splits are always the same. \n",
"\n",
"If you use `train_test_split` functionality from sklearn, this can be enforced by setting the `random_state` parameter. \n",
"If you use the `train_test_split` functionality from sklearn, this can be enforced by setting the `random_state` parameter. \n",
"\n",
"Another crucial aspect, is how you use the `cv` parameter, which defines the folds settings that you will use in the experiments. If the `cv` is set to integer, you don't need to worry about it, the `random_state` of `probatus` will take care of it. However, if you want to pass a custom cv generator object, you have to set the `random_state` there as well.\n",
"Another crucial aspect is how you use the `cv` parameter, which defines the folds settings that you will use in the experiments. If the `cv` is set to an integer, you don't need to worry about it - the `random_state` of `probatus` will take care of it. However, if you want to pass a custom cv generator object, you have to set the `random_state` there as well.\n",
"\n",
"Below are some examples of static splits:"
]
Expand Down Expand Up @@ -80,7 +80,7 @@
"source": [
"### Static classifier\n",
"\n",
"Most of `probatus` modules work with the provided classifiers. Whenever, one needs to provide a not fitted classifier, it is enough to set the `random_state`. However, if the classifier needs to be fitted beforehand, you have to make sure that the model training is reproducible as well."
"Most of `probatus` modules work with the provided classifiers. Whenever one needs to provide a not-fitted classifier, it is enough to set the `random_state`. However, if the classifier needs to be fitted beforehand, you have to make sure that the model training is reproducible as well."
]
},
{
Expand All @@ -100,7 +100,7 @@
"source": [
"### Static search CV for hyperparameter tuning\n",
"\n",
"Some of the modules e.g. `ShapRFECV`, allow you to perform optimization of the model. Whenever, you use such functionality, make sure that the these classes have set `random_state`. This way, in every round of optimization, you will explore the same set of parameters permutations. In case the search space is also generated based on randomness, make sure that the `random_state` is set to it as well."
"Some of the modules e.g. `ShapRFECV`, allow you to perform optimization of the model. Whenever, you use such functionality, make sure that these classes have set the `random_state`. This way, in every round of optimization, you will explore the same set of parameter permutations. In case the search space is also generated based on randomness, make sure that the `random_state` is set to it as well."
]
},
{
Expand Down Expand Up @@ -129,7 +129,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Before running `probatus` modules think about the inputs, and consider if there is any other type of randomness involved. If there is, one option to possibly solve the issue is setting the random seed at the beginning of the code"
"Before running `probatus` modules think about the inputs, and consider if there is any other type of randomness involved. If there is, one option to possibly solve the issue is setting the random seed at the beginning of the code."
]
},
{
Expand All @@ -149,7 +149,7 @@
"source": [
"## Reproducibility in probatus\n",
"\n",
"Most of the modules in `probatus` allow you to set the `random_state`. This setting essentially, makes sure that any code that the functions operate on, has a static flow. So as long as set it and you ensure all other inputs do not cause an additional fluctuations between runs, you can make sure that your results are reproducible"
"Most of the modules in `probatus` allow you to set the `random_state`. This setting essentially makes sure that any code that the functions operate on has a static flow. As long as it is seet and you ensure all other inputs do not cause additional fluctuations between runs, you can make sure that your results are reproducible."
]
},
{
Expand Down Expand Up @@ -299,4 +299,4 @@
},
"nbformat": 4,
"nbformat_minor": 4
}
}
8 changes: 4 additions & 4 deletions docs/tutorials/nb_binning.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -99,7 +99,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The `SimpleBucketer` object creates binning of the values in `x` into equally sized bins. The attributes `counts`, the number of elements per bin, and `boundaries`, the actual boundaries that resulted from the binning strategy are assigned to the object instance. In this example we choose to get 4 bins:"
"The `SimpleBucketer` object creates binning of the values of `x` into equally sized bins. The attributes `counts`, the number of elements per bin, and `boundaries`, the actual boundaries that resulted from the binning strategy, are assigned to the object instance. In this example we choose to get 4 bins:"
]
},
{
Expand Down Expand Up @@ -589,9 +589,9 @@
"metadata": {},
"source": [
"Comparing the `TreeBucketer` and the `QuantileBucketer` (the dots compare the average distribution of class 1 in the bin): <br>\n",
"Each buckets obtained by the `TreeBucketer` follow the probability distribution (ie the entries in the bucket have the same probability of being class 1). <br>\n",
"Each buckets obtained by the `TreeBucketer` follow the probability distribution (i.e. the entries in the bucket have the same probability of being class 1). <br>\n",
"On the contrary, the `QuantileBucketer` splits the values below 4 in 6 buckets, which all have the same probability of being class 1.<br>\n",
"Note also that the tree is grown with the maximum depth of 4, which potentially let's it grow up to 16 buckets ($2^4$).<br>\n",
"Note also that the tree is grown with the maximum depth of 4, which potentially lets it grow up to 16 buckets ($2^4$).<br>\n",
"\n",
"The learned tree is visualized below, whreere the splitting according to the step function is visualized clearly.\n",
"\n"
Expand Down Expand Up @@ -643,4 +643,4 @@
},
"nbformat": 4,
"nbformat_minor": 4
}
}
Loading

0 comments on commit 1c61ba3

Please sign in to comment.