BesenbacherLab
diff --git a/.gitignore b/.gitignore
@@ -172,3 +172,5 @@ poetry.lock
 dist/
 
 lionheart/__pycache__/
+
+TOCHANGE.md
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -0,0 +1,55 @@
+# Changelog
+
+## 1.1.*
+
+ - Adds `lionheart --version` command to CLI.
+
+## 1.1.5
+
+ - Fixes package specification in pyproject.toml
+
+**Future note**: An *upcoming* version will contain completely recomputed resource files with changed bin-coordinates to reduce RAM usage of the `mosdepth` coverage extraction. At the same time, we will be updating the exclusion bin index files to fix a small discrepency between the shared features and the features extracted with the current `lionheart` version. Stay tuned for updates in the coming month(s).
+
+## 1.1.4
+
+ - Adds project URLs to package to list them on the `pypi` site.
+
+## 1.1.2
+
+ - Fixes writing of README in `lionheart predict_sample`. Thanks to @LauraAndersen for detecting the problem.
+ - Improvements to installation guide in repository README.
+ - Workflow example improvements.
+
+## 1.1.1
+
+ - Improves CLI documentation for some commands (in `--help` pages).
+
+## 1.1.0
+
+This release adds multiple CLI commands that:
+
+1) allow reproducing results from the article and seeing the effect of adding your own datasets:
+
+ - Adds `lionheart cross_validate` command. Perform nested leave-one-dataset-out cross-validation on your dataset(s) and/or the included features.
+ - Adds `lionheart validate` command. Validate a model on the included external dataset or a custom dataset.
+ - Adds `lionheart evaluate_univariates` command. Evaluate each feature (cell-type) separately on your dataset(s) and/or the included features.
+
+2) expands what you can do with your own data:
+
+ - Adds `lionheart customize_thresholds` command. Calculate the ROC curve and probability densities (for deciding probability thresholds) on your data and/or the included features for a custom model or an included model. Allows using probability thresholds suited to your own data when using `lionheart predict_sample` and `lionheart validate`.
+ - Adds `--custom_threshold_dirs` argument in `lionheart predict_sample`. Allows passing the ROC curves and probability densities extracted with `lionheart customize_thresholds`.
+
+Also:
+
+ - Adds `matplotlib` as dependency.
+ - Bumps `generalize` dependency requirement to `0.2.1`.
+ - Bumps `utipy` dependency requirement to `1.0.3`.
+
+## 1.0.2
+
+ - Fixes bug when training model on a single dataset.
+ - Adds tests for a subset of the CLI tools.
+
+## 1.0.1
+
+ - Fixed model name.
diff --git a/LICENSE b/LICENSE
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2024 Ludvig Renbo Olsen
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
diff --git a/README.md b/README.md
@@ -6,14 +6,22 @@ This software lets you run feature extraction and predict the cancer status of y
 
 Developed for hg38. See the `remap` directory for the applied remapping pipeline.
 
+Preprint: https://www.medrxiv.org/content/10.1101/2024.11.26.24317971v1
+
 The code was developed and implemented by [@ludvigolsen](https://github.com/LudvigOlsen).
 
+If you experience an issue, please [report it](https://github.com/BesenbacherLab/lionheart/issues).
+
+
 ## Installation
 
+This section describes the installation of `lionheart` and the custom version of `mosdepth` (exp. time: <10m). The code has only been tested on linux but should also work on Mac and Windows.
+
 Install the main package:
 
 ```
 # Create and activate conda environment
+$ conda config --set channel_priority true
 $ conda env create -f https://raw.githubusercontent.com/BesenbacherLab/lionheart/refs/heads/main/environment.yml
 $ conda activate lionheart
 
@@ -37,19 +45,27 @@ $ curl https://nim-lang.org/choosenim/init.sh -sSf | sh
 
 # Add to PATH
 # Change the path to fit with your system
-# Tip: Consider adding it to the terminal configuration file (e.g. ~/.bashrc)
+# Tip: Consider adding it to the terminal configuration file (e.g., ~/.bashrc)
 $ export PATH=/home/<username>/.nimble/bin:$PATH
 
 # Install and use nim 1.6.4 
 # NOTE: This step should be done even when nim is already installed
 $ choosenim 1.6.14
 ```
 
-Now that nim is installed, we can install the custom mosdepth with:
+Now that nim is installed, we can install the custom mosdepth. To not override an existing mosdepth installation, we install it into a separate directory:
 
 ```
+# Make a directory for installing the nim packages into
+$ mkdir mosdepth_installation
+
 # Install modified mosdepth
-$ nimble install -y https://github.com/LudvigOlsen/mosdepth
+$ NIMBLE_DIR=mosdepth_installation nimble install -y https://github.com/LudvigOlsen/mosdepth
+
+# Get path to mosdepth binary to use in the software
+$ find mosdepth_installation/pkgs/ -name "mosdepth*"
+>> mosdepth_installation/pkgs/mosdepth-0.x.x/mosdepth
+
 ```
 
 ## Get Resources
@@ -60,7 +76,21 @@ $ wget https://zenodo.org/records/14215762/files/inference_resources_v002.tar.gz
 $ tar -xvzf inference_resources_v002.tar.gz 
 ```
 
-NOTE: Current version has an old model, so only the feature extraction works.
+## Main commands
+
+This section describes the commands in `lionheart` and lists their *main* output files:
+
+| Command                          | Description                                                         | Main Output                                                                         |
+| :------------------------------- | :------------------------------------------------------------------ | :---------------------------------------------------------------------------------- |
+| `lionheart extract_features`     | Extract features from a BAM file.                                   | `feature_dataset.npy` and correction profiles                                       |
+| `lionheart predict_sample`       | Predict cancer status of a sample.                                  | `prediction.csv`                                                                    |
+| `lionheart collect`              | Collect predictions and/or features across samples.                 | `predictions.csv`, `feature_dataset.npy`, and correction profiles *for all samples* |
+| `lionheart customize_thresholds` | Extract ROC curve and more for using custom probability thresholds. | `ROC_curves.json` and `probability_densities.csv`                                   |
+| `lionheart cross_validate`       | Cross-validate the model on new data and/or the included features.  | `evaluation_summary.csv`,  `splits_summary.csv`                                     |
+| `lionheart train_model`          | Train a model on your own data and/or the included features.        | `model.joblib` and training data results                                            |
+| `lionheart validate`             | Validate a model on a validation dataset.                           | `evaluation_scores.csv` and `predictions.csv`                                       |
+| `lionheart evaluate_univariates` | Evaluate the cancer detection potential of each feature separately. | `univariate_evaluations.csv`                                                        |
+
 
 ## Examples
 
@@ -77,9 +107,9 @@ $ lionheart -h
 
 # Extract feature from a given BAM file
 # `mosdepth_path` is the path to the customized `mosdepth` installation
-# E.g. "/home/<username>/mosdepth/mosdepth"
+# E.g., "/home/<username>/mosdepth/mosdepth"
 # `ld_library_path` is the path to the `lib` folder in the conda environment
-# E.g. "/home/<username>/anaconda3/envs/lionheart/lib/"
+# E.g., "/home/<username>/anaconda3/envs/lionheart/lib/"
 $ lionheart extract_features --bam_file {bam_file} --resources_dir {resources_dir} --out_dir {out_dir} --mosdepth_path {mosdepth_path} --ld_library_path {ld_library_path} --n_jobs {cores}
 
 # `sample_dir` is the `out_dir` of `extract_features`
@@ -88,6 +118,7 @@ $ lionheart predict_sample --sample_dir {sample_dir} --resources_dir {resources_
 
 After running these commands for a set of samples, you can use `lionheart collect` to collect features and predictions across the samples. You can then use `lionheart train_model` to train a model on your own data (and optionally the included features).
 
+
 ### Via `gwf` workflow
 
 We provide a simple workflow for submitting jobs to slurm via the `gwf` package. Make a copy of the `workflow` directory, open `workflow.py`, change the paths and list the samples to run `lionheart` on.
@@ -124,3 +155,62 @@ $ gwf run
 $ gwf status
 $ gwf status -f summary
 ```
+
+### Reproduction of results
+
+This section shows how to reproduce the main results (cross-validation and external validation) from the paper. It uses the included features so the reproduction can be run without access to the raw sequencing data.
+
+Note that different compilations of scikit-learn on different operating systems may lead to slightly different results. On linux, the results should match the reported results.
+
+#### Cross-validation analysis
+
+We start by performing the nested leave-one-dataset-out cross-validation analysis from Figure 3A (not including the benchmarks).
+
+Note that the default settings are the ones used in the paper.
+
+```
+# Perform the cross-validation
+# {cv_out_dir} should specify where you want the output files
+$ lionheart cross_validate --out_dir {cv_out_dir} --resources_dir {resources_dir} --use_included_features --num_jobs 10
+```
+
+The output directory should now include multiple files. The main results are in `evaluation_summary.csv` and `splits_summary.csv`. Note that the results are given for multiple probability thresholds. The threshold reported in the paper is the "Max. J Threshold". You can extract the relevant lines of the summaries with:
+
+```
+$ awk 'NR==1 || /Average/ && /J Threshold/' {cv_out_dir}/evaluation_summary.csv
+$ awk 'NR==1 || /Average/ && /J Threshold/' {cv_out_dir}/splits_summary.csv
+```
+
+#### External validation analysis
+
+To reproduce the external validation, we first train a model on all the included training datasets and then validate it on the included validation dataset:
+
+```
+# Train a model on the included datasets
+# {new_model_dir} should specify where you want the model files
+$ lionheart train_model --out_dir {new_model_dir} --resources_dir {resources_dir} --use_included_features
+
+# Validate the model on the included validation dataset
+# {val_out_dir} should specify where you want the output files
+$ lionheart validate --out_dir {val_out_dir} --resources_dir {resources_dir} --model_dir {new_model_dir} --use_included_validation --thresholds 'max_j'
+```
+
+The model training creates the `model.joblib` file along with predictions and evaluations from the *training data* (e.g., `predictions.csv`, `evaluation_scores.csv`, and `ROC_curves.json`).
+
+The validation creates `evaluation_scores.csv` and `predictions.csv` from applying the model on the validation dataset. You will find the reported AUC score in `evaluation_scores.csv`:
+
+```
+$ cat {val_out_dir}/evaluation_scores.csv
+```
+
+#### Univariate analyses
+
+Finally, we reproduce the univariate modeling evaluations in Figure 2D and 2E:
+
+```
+# Evaluate the classification potential of each cell type separately
+# {univariates_dir} should specify where you want the evaluation files
+$ lionheart evaluate_univariates --out_dir {univariates_dir} --resources_dir {resources_dir} --use_included_features --num_jobs 10
+```
+
+This creates the `univariate_evaluations.csv` file with evaluation metrics per cell-type. There are coefficients and p-values (bonferroni-corrected) from univariate logistic regression models and evaluation metrics from per-cell-type leave-one-dataset-out cross-validation.
Original file line number	Diff line number	Diff line change
		@@ -172,3 +172,5 @@ poetry.lock
		dist/

		lionheart/__pycache__/

		TOCHANGE.md