Repo cleanup (#62)

* main readme * download data * split data * train model readme * evaluate * interpret model * validate model * single cell images * Update 2.train_model/README.md Co-authored-by: Jenna Tomkinson <[email protected]> * Update 5.validate_model/README.md Co-authored-by: Jenna Tomkinson <[email protected]> * Update 2.train_model/README.md Co-authored-by: Jenna Tomkinson <[email protected]> * Update README.md Co-authored-by: Jenna Tomkinson <[email protected]> * Update README.md Co-authored-by: Jenna Tomkinson <[email protected]> * Update README.md Co-authored-by: Jenna Tomkinson <[email protected]> --------- Co-authored-by: Gregory Way <[email protected]> Co-authored-by: Jenna Tomkinson <[email protected]>
WayScience · Mar 5, 2024 · 08a3e8d · 08a3e8d
1 parent 8487535
commit 08a3e8d
Show file tree

Hide file tree

Showing 8 changed files with 31 additions and 26 deletions.
diff --git a/0.download_data/README.md b/0.download_data/README.md
@@ -1,14 +1,14 @@
 # Download Data
 
-In this module, a labeled dataset is downloaded from `mitocheck_data`.
+In this module, we download labeled datasets from `mitocheck_data`.
 
 ### Download/Preprocess Data
 
-Complete instructions for data download and preprocessing can be found at: https://github.com/WayScience/mitocheck_data
+Complete instructions for data download and preprocessing are located at: https://github.com/WayScience/mitocheck_data
 
 ### Usage
 
-In this repository, all labeled data is downloaded from a version controlled [mitocheck_data](https://github.com/WayScience/mitocheck_data).
+In this repository, we download all labeled data from a version controlled [mitocheck_data](https://github.com/WayScience/mitocheck_data).
 We specify the path to each set of `mitocheck_data` with `labeled_data_paths` in [download_data.ipynb](download_data.ipynb).
 
 ### Data Preview

diff --git a/1.split_data/README.md b/1.split_data/README.md
@@ -5,8 +5,7 @@ In this module, we split the training data into training and testing datasets.
 Data is split into subsets in [split_data.ipynb](split_data.ipynb).
 The testing dataset is determined by randomly sampling 15% (stratified by phenotypic class) of the single-cell dataset.
 The training dataset is the subset remaining after the testing samples are removed.
-Sample indexes associated with training and testing subsets are stored in [indexes/](indexes/).
-Sample indexes are later used to load subsets from labeled data in [0.download_data/data/](../0.download_data/data/).
+We store sample indexes associated with training and testing subsets in [indexes/](indexes/), and we later use these sample indexes to load subsets from labeled data in [0.download_data/data/](../0.download_data/data/).
 
 ## Step 1: Split Data
 

diff --git a/2.train_model/README.md b/2.train_model/README.md
@@ -4,7 +4,7 @@ In this module, we train ML models to predict phenotypic class from cell feature
 
 In [train_multi_class_models.ipynb](train_multi_class_models.ipynb), we train models to predict the phenotypic class of the cell features from 15 possible classes (anaphase, metaphase, apoptosis, etc).
 In [train_single_class_models.ipynb](train_single_class_models.ipynb), we train models to predict whether the cell features are from a particular phenotypic class or not.
-This means a set of models are made to predict a "yes" or "no" for each of the 15 phenotypic classes used in the multi-class models. 
+Thus, we create a set of models to predict a "yes" or "no" for each of the 15 phenotypic classes used in the multi-class models. 
 
 We use [sklearn.linear_model.LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) for our machine learning models.
 We use the following parameters for our each Logisic Regression model:
@@ -26,13 +26,13 @@ We search over the following parameters: `[0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.
 We search over the following parameters: `[1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02, 1.e+03]`
 
 We train models for each combination of the following model types, feature, balance, and dataset types:
-- model_types: final, shuffled_baseline
+- model_types: `final`, `shuffled_baseline`
     - Which version of features the model is trained on. For `shuffled_baseline`, each column of the feature data is shuffled independently to create a shuffled baseline for comparison.
-- feature_types: CP, DP, CP_and_DP, CP_zernike_only, CP_areashape_only
-    - Which features to use for trainining.
-- balance_types: balanced, unbalanced
+- feature_types: `CP`, `DP`, `CP_and_DP`, `CP_zernike_only`, `CP_areashape_only`
+    - Which features to use for training.
+- balance_types: `balanced`, `unbalanced`
     - Whether or not to balance `class_weight` of each model when training.
-- dataset_types: ic, no_ic
+- dataset_types: `ic`, `no_ic`
     - Which `mitocheck_data` dataset to use for feature training. We have datasets extracted with and without illumination correction.
 
 The notebooks save each model in [models/](models/).
@@ -54,4 +54,4 @@ bash train_model.sh
 
 ## Results
 
-The weighted F1 score of the best estimators for the grid searches can be found in [train_model.ipynb](train_model.ipynb).
+The weighted F1 score of the best estimators for the grid searches are located in [train_model.ipynb](train_model.ipynb).
diff --git a/3.evaluate_model/README.md b/3.evaluate_model/README.md
@@ -3,7 +3,7 @@
 In this module, we evaluate the final and shuffled baseline ML models.
 
 After training the models in [2.train_model](../2.train_model/), we use these models to predict the labels of the training and testing datasets and evaluate their predictive performance.
-We evaluate each model for each combination of model type (final, shuffled baseline), feature type (CP, DP, CP_and_DP), balance type (balanced, unbalanced), dataset type (ic, no_ic) and dataset (train, test).
+We evaluate each model for each combination of model type (`final`, `shuffled_baseline`), feature type (`CP`, `DP`, `CP_and_DP`, `CP_zernike_only`, `CP_areashape_only`), balance type (`balanced`, `unbalanced`), dataset type (`ic`, `no_ic`) and dataset (`train`, `test`).
 See [2.train_model/README.md](../2.train_model/README.md) for more information on model combinations.
 
 In [get_model_predictions.ipynb](get_model_predictions.ipynb), we derive the predicted and true phenotypic class for each model, feature type, and dataset combination.
@@ -12,7 +12,7 @@ These predictions are saved in [predictions](predictions/).
 In [confusion_matrices.ipynb](confusion_matrices.ipynb), we evaluate these sets of predictions with a confusion matrix to see the true/false positives and negatives (see [sklearn.metrics.confusion_matrix](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html) for more details).
 The confusion matrix data are saved to [confusion_matrices](evaluations/confusion_matrices/).
 
-In [F1_scores.ipynb](F1_scores.ipynb), we evaluate each model (final, shuffled baseline) trained with each feature type (CP, DP, CP_and_DP) on each dataset (train, test, etc) to determine phenotypic and weighted f1 scores.
+In [F1_scores.ipynb](F1_scores.ipynb), we evaluate each model to determine phenotypic and weighted f1 scores.
 F1 score measures the models precision and recall performance for each phenotypic class (see [sklearn.metrics.f1_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html) for more details).
 The f1 score data are saved to [F1_scores](evaluations/F1_scores/).
 
@@ -21,7 +21,7 @@ These PR curves are created for each label type of the logistic regression model
 E.g. each multi-class model has 15 labels (1 for each phenotypic class) and 15 PR curves while each single-class model has 2 labels (positive and negative label for its respective phenotype) and 2 PR curves.
 The precision recall curves and their data are saved to [precision_recall_curves](evaluations/precision_recall_curves/).
 
-In [get_LOIO_probabilities.ipynb](get_LOIO_probabilities.ipynb), we use the optimal hyperparameters from each final logistic regression model (DP, CP, CP_and_DP) to fit and evaluate new models in a Leave One Image Out (LOIO) fashion.
+In [get_LOIO_probabilities.ipynb](get_LOIO_probabilities.ipynb), we use the optimal hyperparameters from each final logistic regression model to fit and evaluate new models in a Leave One Image Out (LOIO) fashion.
 These optimal hyper parameters are found with Grid Search Cross Validation in [train_model.ipynb](../2.train_model/train_model.ipynb) and are saved with model data in [models/](../2.train_model/models/).
 LOIO evaluation gives an idea of how well the model will perform on cells that are in an image the model has never seen before.
 If the model performs well in LOIO evaluation, we can be confident it will generalize well to images it has never seen before.
@@ -32,7 +32,7 @@ The LOIO evaluation procedure is as follows:
     - Train a logistic regression model with optimal hyperparameters (`C` and `l1_ratio`) determined for a particular model in [train_model.ipynb](../2.train_model/train_model.ipynb) on every cell that is **not** in the specific image.
     - Predict probabilities on every cell that **is** in the specific image.
 
-These probabilities are saved to [LOIO_probas](evaluations/LOIO_probas/).
+We save these probabilities to [LOIO_probas](evaluations/LOIO_probas/).
 
 **Notes:** 
 1) Intermediate `.tsv` data are stored in tidy format, a standardized data structure (see [Tidy Data](https://vita.had.co.nz/papers/tidy-data.pdf) by Hadley Wickham for more details).

diff --git a/4.interpret_model/README.md b/4.interpret_model/README.md
@@ -22,7 +22,7 @@ The correlations for each pair of coefficient vectors are displayed above their
 
 ## Results
 
-Each model's interpretations can be found in [interpret_model_coefficients.ipynb](interpret_model_coefficients.ipynb).
+Each model's interpretations are located in [interpret_model_coefficients.ipynb](interpret_model_coefficients.ipynb).
 
 **Notes:** 
 1) Intermediate `.tsv` data are stored in tidy format, a standardized data structure (see [Tidy Data](https://vita.had.co.nz/papers/tidy-data.pdf) by Hadley Wickham for more details).

diff --git a/5.validate_model/README.md b/5.validate_model/README.md
@@ -11,11 +11,11 @@ The Cell Health dataset has cell painting images across 119 CRISPR guide perturb
 More information regarding the generation of this dataset can be found at https://github.com/broadinstitute/cell-health.
 
 In [Cell-Health-Data/4.classify-features](https://github.com/WayScience/cell-health-data/tree/master/4.classify-features), we use the trained models to determine phenotypic class probabilities for each of the Cell Health cells.
-These probabilities are averaged across CRISPR guide/cell line to create 357 *classifiction profiles* (119 CRISPR guides x 3 cell lines).
+We average these probabilities across CRISPR guide/cell line to create 357 *classifiction profiles* (119 CRISPR guides x 3 cell lines).
 
-As part of [Predicting cell health phenotypes using image-based morphology profiling](https://www.molbiolcell.org/doi/10.1091/mbc.E20-12-0784), Way et al derived cell health indicators.
+Way et al. derived cell health indicators as part of [Predicting cell health phenotypes using image-based morphology profiling](https://www.molbiolcell.org/doi/10.1091/mbc.E20-12-0784).
 These indicators consist of 70 specific cell health phenotypes including proliferation, apoptosis, reactive oxygen species, DNA damage, and cell cycle stage.
-These indicators are averaged across across CRISPR guide/cell line to create 357 [*Cell Health label profiles*](https://github.com/broadinstitute/cell-health/blob/master/1.generate-profiles/data/consensus/cell_health_median.tsv.gz).
+Way et al averaged these indicators across CRISPR guide/cell line to create 357 [*Cell Health label profiles*](https://github.com/broadinstitute/cell-health/blob/master/1.generate-profiles/data/consensus/cell_health_median.tsv.gz).
 
 We use [pandas.DataFrame.corr](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html) to find the [Pearson correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) between the *classifiction profiles* and the *Cell Health label profiles*.
 The Pearson correlation coefficient measures the linear relationship between two datasets, with correlations of -1/+1 implying exact linear inverse/direct relationships respectively.

diff --git a/6.single_cell_images/README.md b/6.single_cell_images/README.md
@@ -4,7 +4,7 @@ In this module, we use the model on single-cell images to clearly demonstrate it
 
 ## Single-Cell Sample Image Dataset
 
-The [single-cell sample image data](mitocheck_single_cell_sample_images) have kindly been provided by Dr. Thomas Walter of the MitoCheck consortium.
+Dr. Thomas Walter of the MitoCheck consortium kindly provided the [single-cell sample image data](mitocheck_single_cell_sample_images).
 This dataset contains sample single-cell images in the following format:
 
 ```
@@ -16,7 +16,7 @@ mitocheck_single_cell_sample_images
 
 ```
 
-Because the features for these cells have already been extracted in [`mitocheck_data`](https://github.com/WayScience/mitocheck_data), we do not re-extract features from these images in this module.
+Because we already extracted the features for these cells in [`mitocheck_data`](https://github.com/WayScience/mitocheck_data), we do not re-extract features from these images in this module.
 Instead, features are associated with a single-cell image based on the cell's location metadata (plate, well, frame, x, y).
 
 ## Top 5 Performing Classes

diff --git a/README.md b/README.md
@@ -36,11 +36,11 @@ The repository structure is as follows:
 | :---- | :----- | :---------- |
 | [0.download_data](0.download_data/) | Download training data | Download labeled single-cell dataset from [mitocheck_data](https://github.com/WayScience/mitocheck_data) |
 | [1.split_data](1.split_data/) | Create data subsets | Create training and testing data subsets |
-| [2.train_model](2.train_model/) | Train model | Train ML models on training data subset and shuffled baseline training dataset |
+| [2.train_model](2.train_model/) | Train model | Train ML models on combinations of features, data subsets, balance types, model types |
 | [3.evaluate_model](3.evaluate_model/) | Evaluate model | Evaluate ML models on all data subsets |
-| [4.interpret_model](4.interpret_model/) | Interpret model | Interpret ML models |
-| [5.validate_model](5.validate_model/) | Validate model | Validate ML models |
-| [6.single_cell_images](6.single_cell_images/) | Single Cell Images | View single cell images and model interpretation |
+| [4.interpret_model](4.interpret_model/) | Interpret model | Interpret ML model coefficients |
+| [5.validate_model](5.validate_model/) | Validate model | Validate ML models on other datasets |
+| [6.single_cell_images](6.single_cell_images/) | Single cell images | View single cell images and model interpretation |
 | [7.figures](7.figures/) | Figures | Create paper-worthy figures |
 
 ## Data
@@ -49,6 +49,10 @@ Specific data download/preprocessing instructions are available at: https://gith
 This repository downloads labeled single-cell data from a specific version of the [mitocheck_data](https://github.com/WayScience/mitocheck_data) repository.
 For more information see [0.download_data/](0.download_data/).
 
+We use the following 2 datasets from the `mitocheck_data` repository:
+- `ic`: single-cell nuclei features extracted after performing illumination correction on images
+- `no_ic`: single-cell nuclei features extracted without performing illumination correction on images
+
 ### Supplementary Table 1 - Full list of JUMP-CP phenotype enrichment
 
 We report the top 100 most enriched treatments per phenotype in Supplementary Table 1 of our paper.
@@ -72,6 +76,8 @@ We use [seaborn](https://seaborn.pydata.org/) for data visualization.
 
 All parts of the machine learning pipeline are completed with the following feature types:
 - `CP`: Use only CellProfiler features from `MitoCheck` labeled cells
+- `CP_zernike_only`: Use only CellProfiler Zernike shape features from `MitoCheck` labeled cells
+- `CP_areashape_only`: Use only CellProfiler AreaShape features from `MitoCheck` labeled cells
 - `DP`: Use only DeepProfiler features from `MitoCheck` labeled cells
 - `CP_and_DP`: Use CellProfiler and DeepProfiler features from `MitoCheck` labeled cells