Merge branch 'dev'

BesenbacherLab · Dec 7, 2024 · 5b68a04 · 5b68a04
2 parents a7ee9de + 53ab1b0
commit 5b68a04
Show file tree

Hide file tree

Showing 43 changed files with 3,820 additions and 816 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,26 @@
 # Change log
 
+## 1.1.0
+
+This release adds multiple CLI commands that:
+
+1) allow reproducing results from the article and seeing the effect of adding your own datasets:
+
+ - Adds `lionheart cross_validate` command. Perform nested leave-one-dataset-out cross-validation on your dataset(s) and/or the included features.
+ - Adds `lionheart validate` command. Validate a model on the included external dataset or a custom dataset.
+ - Adds `lionheart evaluate_univariates` command. Evaluate each feature (cell-type) separately on your dataset(s) and/or the included features.
+
+2) expands what you can do with your own data:
+
+ - Adds `lionheart customize_thresholds` command. Calculate the ROC curve and probability densities (for deciding probability thresholds) on your data and/or the included features for a custom model or an included model. Allows using probability thresholds suited to your own data when using `lionheart predict_sample` and `lionheart validate`.
+ - Adds `--custom_threshold_dirs` argument in `lionheart predict_sample`. Allows passing the ROC curves and probability densities extracted with `lionheart customize_thresholds`.
+
+Also:
+
+ - Adds `matplotlib` as dependency.
+ - Bumps `generalize` dependency requirement to `0.2.1`.
+ - Bumps `utipy` dependency requirement to `1.0.3`.
+
 ## 1.0.2
 
  - Fixes bug when training model on a single dataset.

diff --git a/README.md b/README.md
@@ -10,8 +10,13 @@ Preprint: https://www.medrxiv.org/content/10.1101/2024.11.26.24317971v1
 
 The code was developed and implemented by [@ludvigolsen](https://github.com/LudvigOlsen).
 
+If you experience an issue, please [report it](https://github.com/BesenbacherLab/lionheart/issues).
+
+
 ## Installation
 
+This section describes the installation of `lionheart` and the custom version of `mosdepth` (exp. time: <10m). The code has only been tested on linux but should also work on Mac and Windows.
+
 Install the main package:
 
 ```
@@ -39,7 +44,7 @@ $ curl https://nim-lang.org/choosenim/init.sh -sSf | sh
 
 # Add to PATH
 # Change the path to fit with your system
-# Tip: Consider adding it to the terminal configuration file (e.g. ~/.bashrc)
+# Tip: Consider adding it to the terminal configuration file (e.g., ~/.bashrc)
 $ export PATH=/home/<username>/.nimble/bin:$PATH
 
 # Install and use nim 1.6.4 
@@ -62,6 +67,22 @@ $ wget https://zenodo.org/records/14215762/files/inference_resources_v002.tar.gz
 $ tar -xvzf inference_resources_v002.tar.gz 
 ```
 
+## Main commands
+
+This section describes the commands in `lionheart` and lists their *main* output files:
+
+| Command                          | Description                                                         | Main Output                                                                         |
+| :------------------------------- | :------------------------------------------------------------------ | :---------------------------------------------------------------------------------- |
+| `lionheart extract_features`     | Extract features from a BAM file.                                   | `feature_dataset.npy` and correction profiles                                       |
+| `lionheart predict_sample`       | Predict cancer status of a sample.                                  | `prediction.csv`                                                                    |
+| `lionheart collect`              | Collect predictions and/or features across samples.                 | `predictions.csv`, `feature_dataset.npy`, and correction profiles *for all samples* |
+| `lionheart customize_thresholds` | Extract ROC curve and more for using custom probability thresholds. | `ROC_curves.json` and `probability_densities.csv`                                   |
+| `lionheart cross_validate`       | Cross-validate the model on new data and/or the included features.  | `evaluation_summary.csv`,  `splits_summary.csv`                                     |
+| `lionheart train_model`          | Train a model on your own data and/or the included features.        | `model.joblib` and training data results                                            |
+| `lionheart validate`             | Validate a model on a validation dataset.                           | `evaluation_scores.csv` and `predictions.csv`                                       |
+| `lionheart evaluate_univariates` | Evaluate the cancer detection potential of each feature separately. | `univariate_evaluations.csv`                                                        |
+
+
 ## Examples
 
 ### Run via command-line interface
@@ -77,9 +98,9 @@ $ lionheart -h
 
 # Extract feature from a given BAM file
 # `mosdepth_path` is the path to the customized `mosdepth` installation
-# E.g. "/home/<username>/mosdepth/mosdepth"
+# E.g., "/home/<username>/mosdepth/mosdepth"
 # `ld_library_path` is the path to the `lib` folder in the conda environment
-# E.g. "/home/<username>/anaconda3/envs/lionheart/lib/"
+# E.g., "/home/<username>/anaconda3/envs/lionheart/lib/"
 $ lionheart extract_features --bam_file {bam_file} --resources_dir {resources_dir} --out_dir {out_dir} --mosdepth_path {mosdepth_path} --ld_library_path {ld_library_path} --n_jobs {cores}
 
 # `sample_dir` is the `out_dir` of `extract_features`
@@ -88,6 +109,7 @@ $ lionheart predict_sample --sample_dir {sample_dir} --resources_dir {resources_
 
 After running these commands for a set of samples, you can use `lionheart collect` to collect features and predictions across the samples. You can then use `lionheart train_model` to train a model on your own data (and optionally the included features).
 
+
 ### Via `gwf` workflow
 
 We provide a simple workflow for submitting jobs to slurm via the `gwf` package. Make a copy of the `workflow` directory, open `workflow.py`, change the paths and list the samples to run `lionheart` on.
@@ -124,3 +146,62 @@ $ gwf run
 $ gwf status
 $ gwf status -f summary
 ```
+
+### Reproduction of results
+
+This section shows how to reproduce the main results (cross-validation and external validation) from the paper. It uses the included features so the reproduction can be run without access to the raw sequencing data.
+
+Note that different compilations of scikit-learn on different operating systems may lead to slightly different results. On linux, the results should match the reported results.
+
+#### Cross-validation analysis
+
+We start by performing the nested leave-one-dataset-out cross-validation analysis from Figure 3A (not including the benchmarks).
+
+Note that the default settings are the ones used in the paper.
+
+```
+# Perform the cross-validation
+# {cv_out_dir} should specify where you want the output files
+$ lionheart cross_validate --out_dir {cv_out_dir} --resources_dir {resources_dir} --use_included_features --num_jobs 10
+```
+
+The output directory should now include multiple files. The main results are in `evaluation_summary.csv` and `splits_summary.csv`. Note that the results are given for multiple probability thresholds. The threshold reported in the paper is the "Max. J Threshold". You can extract the relevant lines of the summaries with:
+
+```
+$ awk 'NR==1 || /Average/ && /J Threshold/' {cv_out_dir}/evaluation_summary.csv
+$ awk 'NR==1 || /Average/ && /J Threshold/' {cv_out_dir}/splits_summary.csv
+```
+
+#### External validation analysis
+
+To reproduce the external validation, we first train a model on all the included training datasets and then validate it on the included validation dataset:
+
+```
+# Train a model on the included datasets
+# {new_model_dir} should specify where you want the model files
+$ lionheart train_model --out_dir {new_model_dir} --resources_dir {resources_dir} --use_included_features
+
+# Validate the model on the included validation dataset
+# {val_out_dir} should specify where you want the output files
+$ lionheart validate --out_dir {val_out_dir} --resources_dir {resources_dir} --model_dir {new_model_dir} --use_included_validation --thresholds 'max_j'
+```
+
+The model training creates the `model.joblib` file along with predictions and evaluations from the *training data* (e.g., `predictions.csv`, `evaluation_scores.csv`, and `ROC_curves.json`).
+
+The validation creates `evaluation_scores.csv` and `predictions.csv` from applying the model on the validation dataset. You will find the reported AUC score in `evaluation_scores.csv`:
+
+```
+$ cat {val_out_dir}/evaluation_scores.csv
+```
+
+#### Univariate analyses
+
+Finally, we reproduce the univariate modeling evaluations in Figure 2D and 2E:
+
+```
+# Evaluate the classification potential of each cell type separately
+# {univariates_dir} should specify where you want the evaluation files
+$ lionheart evaluate_univariates --out_dir {univariates_dir} --resources_dir {resources_dir} --use_included_features --num_jobs 10
+```
+
+This creates the `univariate_evaluations.csv` file with evaluation metrics per cell-type. There are coefficients and p-values (bonferroni-corrected) from univariate logistic regression models and evaluation metrics from per-cell-type leave-one-dataset-out cross-validation.
diff --git a/conftest.py b/conftest.py
@@ -18,6 +18,10 @@ def _run_cli(
         if output_dir_arg not in command_args:
             command_args.extend([output_dir_arg, str(output_dir)])
 
+        command_args = [str(cmd) for cmd in command_args]
+
+        print(" ".join(command_args))
+
         # Run the command
         result = subprocess.run(
             command_args,

diff --git a/environment.yml b/environment.yml
@@ -16,10 +16,11 @@ dependencies:
   - scipy=1.7.3
   - statsmodels=0.14.1
   - seaborn=0.13.2
+  - matplotlib=3.9.1
   - pip:
     - nattrs==0.2.2
     - pytest==7.1.2
     - scikit-learn==1.0.2
-    - utipy==1.0.2
+    - utipy==1.0.3
     - poetry==1.8.2
     - generalize
diff --git a/lionheart/cli.py b/lionheart/cli.py
@@ -1,17 +1,20 @@
 import argparse
 from lionheart.commands import (
     collect_samples,
+    customize_thresholds,
     extract_features,
     predict,
     train_model,
     validate,
     cross_validate,
+    evaluate_univariates,
     guides,
 )
 from lionheart.utils.cli_utils import (
     LION_ASCII,
     LIONHEART_ASCII,
     LIONHEART_STRING,
+    README_STRING,
     CustomRichHelpFormatter,
     wrap_command_description,
 )
@@ -34,6 +37,8 @@ def main():
 Start by <b>extracting</b> the features from a BAM file (hg38 only). Then <b>predict</b> whether a sample is from a cancer patient or not.
 
 Easily <b>train</b> a new model on your own data or perform <b>cross-validation</b> to compare against the paper.
+
+{README_STRING}
         """,
         formatter_class=CustomRichHelpFormatter,
     )
@@ -81,51 +86,87 @@ def main():
         "collect",
         help="Collect predictions and/or features across samples",
         description=wrap_command_description(
-            "COLLECT predictions and/or extracted features for multiple samples."
+            "COLLECT predictions and/or extracted features for multiple samples. "
+            "\nCollecting the features creates a 'dataset' that can be used in "
+            "other `lionheart` commands."
         ),
         formatter_class=parser.formatter_class,
     )
     # Delegate the argument setup to the respective command module
     collect_samples.setup_parser(parser_cl)
 
     # Command 4
+    parser_er = subparsers.add_parser(
+        "customize_thresholds",
+        help="Extract ROC curve and probability densitities for using custom probability thresholds",
+        description=wrap_command_description(
+            "Extract the ROC curve and probability densities for a model's predictions on one or more datasets. "
+            "\nThis allows using probability thresholds fitted to your own data in "
+            "`lionheart predict_sample` and `lionheart validate`."
+        ),
+        formatter_class=parser.formatter_class,
+        epilog=customize_thresholds.EPILOG,
+    )
+    # Delegate the argument setup to the respective command module
+    customize_thresholds.setup_parser(parser_er)
+
+    # # Command 5
+    parser_cv = subparsers.add_parser(
+        "cross_validate",
+        help="Cross-validate the cancer detection model on your own data and/or the included features",
+        description=wrap_command_description(
+            "CROSS-VALIDATE your features with nested leave-one-dataset-out (or classic) cross-validation. "
+            "Use your extracted features and/or the included features. "
+            "\nAllows seeing the effect on generalization of adding your own data to the training. "
+            "\n\nNote: The settings are optimized for use with the included features and optional "
+            "additional datasets. They may not be optimal for more custom designs."
+        ),
+        formatter_class=parser.formatter_class,
+        epilog=cross_validate.EPILOG,
+    )
+    # Delegate the argument setup to the respective command module
+    cross_validate.setup_parser(parser_cv)
+
+    # Command 6
     parser_tm = subparsers.add_parser(
         "train_model",
         help="Train a model on your own data and/or the included features",
         description=wrap_command_description(
-            "TRAIN A MODEL on your extracted features and/or the included features."
+            "TRAIN A MODEL on your extracted features and/or the included features. "
+            "\n\nNOTE: The included evaluation is of predictions of the training data."
         ),
         formatter_class=parser.formatter_class,
         epilog=train_model.EPILOG,
     )
     # Delegate the argument setup to the respective command module
     train_model.setup_parser(parser_tm)
 
-    # # Command 5
-    # parser_va = subparsers.add_parser(
-    #     "validate",
-    #     help="Validate a trained model on one or more validation datasets",
-    #     description=wrap_command_description(
-    #         "VALIDATE your trained model one or more validation datasets, such as the included validation dataset."
-    #     ),
-    #     formatter_class=parser.formatter_class,
-    #     epilog=validate.EPILOG,
-    # )
-    # # Delegate the argument setup to the respective command module
-    # validate.setup_parser(parser_va)
+    # # Command 7
+    parser_va = subparsers.add_parser(
+        "validate",
+        help="Validate a model on a validation dataset",
+        description=wrap_command_description(
+            "VALIDATE your trained model on a validation dataset, such as the included validation dataset."
+        ),
+        formatter_class=parser.formatter_class,
+        epilog=validate.EPILOG,
+    )
+    # Delegate the argument setup to the respective command module
+    validate.setup_parser(parser_va)
 
-    # # Command 6
-    # parser_cv = subparsers.add_parser(
-    #     "cross_validate",
-    #     help="Cross-validate the cancer detection model on your own data and/or the included features",
-    #     description=wrap_command_description(
-    #         "CROSS-VALIDATE your features with nested leave-one-dataset-out (or classic) cross-validation. "
-    #         "Use your extracted features and/or the included features."
-    #     ),
-    #     formatter_class=parser.formatter_class,
-    # )
-    # # Delegate the argument setup to the respective command module
-    # cross_validate.setup_parser(parser_cv)
+    # # Command 8
+    parser_eu = subparsers.add_parser(
+        "evaluate_univariates",
+        help="Evaluate the cancer detection potential of each feature separately on your own data and/or the included features",
+        description=wrap_command_description(
+            "EVALUATE your features separately on their cancer detection potential. "
+            "Use your extracted features and/or the included features. "
+        ),
+        formatter_class=parser.formatter_class,
+        epilog=evaluate_univariates.EPILOG,
+    )
+    # Delegate the argument setup to the respective command module
+    evaluate_univariates.setup_parser(parser_eu)
 
     args = parser.parse_args()
     if args.command == "guide_me":