Skip to content
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: BesenbacherLab/lionheart
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: v1.0.0
Choose a base ref
...
head repository: BesenbacherLab/lionheart
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: main
Choose a head ref
Loading
Showing with 4,876 additions and 852 deletions.
  1. +2 −0 .gitignore
  2. +55 −0 CHANGELOG.md
  3. +21 −0 LICENSE
  4. +96 −6 README.md
  5. +549 −0 conftest.py
  6. +2 −1 environment.yml
  7. +5 −2 lionheart/__init__.py
  8. +80 −26 lionheart/cli.py
  9. +239 −36 lionheart/commands/cross_validate.py
  10. +199 −0 lionheart/commands/customize_thresholds.py
  11. +225 −0 lionheart/commands/evaluate_univariates.py
  12. +1 −1 lionheart/commands/extract_features.py
  13. +1 −1 lionheart/commands/guides.py
  14. +109 −328 lionheart/commands/predict.py
  15. +39 −137 lionheart/commands/train_model.py
  16. +323 −272 lionheart/commands/validate.py
  17. +4 −4 lionheart/features/correction/correction.py
  18. +2 −2 lionheart/features/correction/normalize_megabins.py
  19. +2 −2 lionheart/features/correction/poisson.py
  20. +1 −1 lionheart/features/create_dataset_inference.py
  21. +1 −0 lionheart/modeling/__init__.py
  22. +30 −8 lionheart/modeling/prepare_modeling.py
  23. +245 −0 lionheart/modeling/prepare_modeling_command.py
  24. +3 −3 lionheart/modeling/read_meta_data.py
  25. +376 −0 lionheart/modeling/run_cross_validate.py
  26. +261 −0 lionheart/modeling/run_customize_thresholds.py
  27. +4 −3 lionheart/modeling/run_full_modeling.py
  28. +414 −0 lionheart/modeling/run_predict_single_model.py
  29. +242 −0 lionheart/modeling/run_univariate_analyses.py
  30. 0 lionheart/plotting/__init__.py
  31. +161 −0 lionheart/plotting/plot_inner_scores.py
  32. +1 −0 lionheart/utils/__init__.py
  33. +1 −1 lionheart/utils/cli_utils.py
  34. +10 −1 lionheart/utils/global_vars.py
  35. +6 −0 lionheart/utils/utils.py
  36. +22 −11 pyproject.toml
  37. +1 −1 remap/workflow.py
  38. +112 −0 tests/test_common_workflows.py
  39. +149 −0 tests/test_cross_validate.py
  40. +77 −0 tests/test_customize_thresholds.py
  41. +112 −0 tests/test_evaluate_univariates.py
  42. +137 −0 tests/test_predict_sample.py
  43. +102 −0 tests/test_predict_single_model.py
  44. +201 −0 tests/test_train_model.py
  45. +246 −0 tests/test_validate.py
  46. +7 −5 workflow/target_creators.py
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -172,3 +172,5 @@ poetry.lock
dist/

lionheart/__pycache__/

TOCHANGE.md
55 changes: 55 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
# Changelog

## 1.1.*

- Adds `lionheart --version` command to CLI.

## 1.1.5

- Fixes package specification in pyproject.toml

**Future note**: An *upcoming* version will contain completely recomputed resource files with changed bin-coordinates to reduce RAM usage of the `mosdepth` coverage extraction. At the same time, we will be updating the exclusion bin index files to fix a small discrepency between the shared features and the features extracted with the current `lionheart` version. Stay tuned for updates in the coming month(s).

## 1.1.4

- Adds project URLs to package to list them on the `pypi` site.

## 1.1.2

- Fixes writing of README in `lionheart predict_sample`. Thanks to @LauraAndersen for detecting the problem.
- Improvements to installation guide in repository README.
- Workflow example improvements.

## 1.1.1

- Improves CLI documentation for some commands (in `--help` pages).

## 1.1.0

This release adds multiple CLI commands that:

1) allow reproducing results from the article and seeing the effect of adding your own datasets:

- Adds `lionheart cross_validate` command. Perform nested leave-one-dataset-out cross-validation on your dataset(s) and/or the included features.
- Adds `lionheart validate` command. Validate a model on the included external dataset or a custom dataset.
- Adds `lionheart evaluate_univariates` command. Evaluate each feature (cell-type) separately on your dataset(s) and/or the included features.

2) expands what you can do with your own data:

- Adds `lionheart customize_thresholds` command. Calculate the ROC curve and probability densities (for deciding probability thresholds) on your data and/or the included features for a custom model or an included model. Allows using probability thresholds suited to your own data when using `lionheart predict_sample` and `lionheart validate`.
- Adds `--custom_threshold_dirs` argument in `lionheart predict_sample`. Allows passing the ROC curves and probability densities extracted with `lionheart customize_thresholds`.

Also:

- Adds `matplotlib` as dependency.
- Bumps `generalize` dependency requirement to `0.2.1`.
- Bumps `utipy` dependency requirement to `1.0.3`.

## 1.0.2

- Fixes bug when training model on a single dataset.
- Adds tests for a subset of the CLI tools.

## 1.0.1

- Fixed model name.
21 changes: 21 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
MIT License

Copyright (c) 2024 Ludvig Renbo Olsen

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
102 changes: 96 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
@@ -6,14 +6,22 @@ This software lets you run feature extraction and predict the cancer status of y

Developed for hg38. See the `remap` directory for the applied remapping pipeline.

Preprint: https://www.medrxiv.org/content/10.1101/2024.11.26.24317971v1

The code was developed and implemented by [@ludvigolsen](https://github.com/LudvigOlsen).

If you experience an issue, please [report it](https://github.com/BesenbacherLab/lionheart/issues).


## Installation

This section describes the installation of `lionheart` and the custom version of `mosdepth` (exp. time: <10m). The code has only been tested on linux but should also work on Mac and Windows.

Install the main package:

```
# Create and activate conda environment
$ conda config --set channel_priority true
$ conda env create -f https://raw.githubusercontent.com/BesenbacherLab/lionheart/refs/heads/main/environment.yml
$ conda activate lionheart
@@ -37,19 +45,27 @@ $ curl https://nim-lang.org/choosenim/init.sh -sSf | sh
# Add to PATH
# Change the path to fit with your system
# Tip: Consider adding it to the terminal configuration file (e.g. ~/.bashrc)
# Tip: Consider adding it to the terminal configuration file (e.g., ~/.bashrc)
$ export PATH=/home/<username>/.nimble/bin:$PATH
# Install and use nim 1.6.4
# NOTE: This step should be done even when nim is already installed
$ choosenim 1.6.14
```

Now that nim is installed, we can install the custom mosdepth with:
Now that nim is installed, we can install the custom mosdepth. To not override an existing mosdepth installation, we install it into a separate directory:

```
# Make a directory for installing the nim packages into
$ mkdir mosdepth_installation
# Install modified mosdepth
$ nimble install -y https://github.com/LudvigOlsen/mosdepth
$ NIMBLE_DIR=mosdepth_installation nimble install -y https://github.com/LudvigOlsen/mosdepth
# Get path to mosdepth binary to use in the software
$ find mosdepth_installation/pkgs/ -name "mosdepth*"
>> mosdepth_installation/pkgs/mosdepth-0.x.x/mosdepth
```

## Get Resources
@@ -60,7 +76,21 @@ $ wget https://zenodo.org/records/14215762/files/inference_resources_v002.tar.gz
$ tar -xvzf inference_resources_v002.tar.gz
```

NOTE: Current version has an old model, so only the feature extraction works.
## Main commands

This section describes the commands in `lionheart` and lists their *main* output files:

| Command | Description | Main Output |
| :------------------------------- | :------------------------------------------------------------------ | :---------------------------------------------------------------------------------- |
| `lionheart extract_features` | Extract features from a BAM file. | `feature_dataset.npy` and correction profiles |
| `lionheart predict_sample` | Predict cancer status of a sample. | `prediction.csv` |
| `lionheart collect` | Collect predictions and/or features across samples. | `predictions.csv`, `feature_dataset.npy`, and correction profiles *for all samples* |
| `lionheart customize_thresholds` | Extract ROC curve and more for using custom probability thresholds. | `ROC_curves.json` and `probability_densities.csv` |
| `lionheart cross_validate` | Cross-validate the model on new data and/or the included features. | `evaluation_summary.csv`, `splits_summary.csv` |
| `lionheart train_model` | Train a model on your own data and/or the included features. | `model.joblib` and training data results |
| `lionheart validate` | Validate a model on a validation dataset. | `evaluation_scores.csv` and `predictions.csv` |
| `lionheart evaluate_univariates` | Evaluate the cancer detection potential of each feature separately. | `univariate_evaluations.csv` |


## Examples

@@ -77,9 +107,9 @@ $ lionheart -h
# Extract feature from a given BAM file
# `mosdepth_path` is the path to the customized `mosdepth` installation
# E.g. "/home/<username>/mosdepth/mosdepth"
# E.g., "/home/<username>/mosdepth/mosdepth"
# `ld_library_path` is the path to the `lib` folder in the conda environment
# E.g. "/home/<username>/anaconda3/envs/lionheart/lib/"
# E.g., "/home/<username>/anaconda3/envs/lionheart/lib/"
$ lionheart extract_features --bam_file {bam_file} --resources_dir {resources_dir} --out_dir {out_dir} --mosdepth_path {mosdepth_path} --ld_library_path {ld_library_path} --n_jobs {cores}
# `sample_dir` is the `out_dir` of `extract_features`
@@ -88,6 +118,7 @@ $ lionheart predict_sample --sample_dir {sample_dir} --resources_dir {resources_

After running these commands for a set of samples, you can use `lionheart collect` to collect features and predictions across the samples. You can then use `lionheart train_model` to train a model on your own data (and optionally the included features).


### Via `gwf` workflow

We provide a simple workflow for submitting jobs to slurm via the `gwf` package. Make a copy of the `workflow` directory, open `workflow.py`, change the paths and list the samples to run `lionheart` on.
@@ -124,3 +155,62 @@ $ gwf run
$ gwf status
$ gwf status -f summary
```

### Reproduction of results

This section shows how to reproduce the main results (cross-validation and external validation) from the paper. It uses the included features so the reproduction can be run without access to the raw sequencing data.

Note that different compilations of scikit-learn on different operating systems may lead to slightly different results. On linux, the results should match the reported results.

#### Cross-validation analysis

We start by performing the nested leave-one-dataset-out cross-validation analysis from Figure 3A (not including the benchmarks).

Note that the default settings are the ones used in the paper.

```
# Perform the cross-validation
# {cv_out_dir} should specify where you want the output files
$ lionheart cross_validate --out_dir {cv_out_dir} --resources_dir {resources_dir} --use_included_features --num_jobs 10
```

The output directory should now include multiple files. The main results are in `evaluation_summary.csv` and `splits_summary.csv`. Note that the results are given for multiple probability thresholds. The threshold reported in the paper is the "Max. J Threshold". You can extract the relevant lines of the summaries with:

```
$ awk 'NR==1 || /Average/ && /J Threshold/' {cv_out_dir}/evaluation_summary.csv
$ awk 'NR==1 || /Average/ && /J Threshold/' {cv_out_dir}/splits_summary.csv
```

#### External validation analysis

To reproduce the external validation, we first train a model on all the included training datasets and then validate it on the included validation dataset:

```
# Train a model on the included datasets
# {new_model_dir} should specify where you want the model files
$ lionheart train_model --out_dir {new_model_dir} --resources_dir {resources_dir} --use_included_features
# Validate the model on the included validation dataset
# {val_out_dir} should specify where you want the output files
$ lionheart validate --out_dir {val_out_dir} --resources_dir {resources_dir} --model_dir {new_model_dir} --use_included_validation --thresholds 'max_j'
```

The model training creates the `model.joblib` file along with predictions and evaluations from the *training data* (e.g., `predictions.csv`, `evaluation_scores.csv`, and `ROC_curves.json`).

The validation creates `evaluation_scores.csv` and `predictions.csv` from applying the model on the validation dataset. You will find the reported AUC score in `evaluation_scores.csv`:

```
$ cat {val_out_dir}/evaluation_scores.csv
```

#### Univariate analyses

Finally, we reproduce the univariate modeling evaluations in Figure 2D and 2E:

```
# Evaluate the classification potential of each cell type separately
# {univariates_dir} should specify where you want the evaluation files
$ lionheart evaluate_univariates --out_dir {univariates_dir} --resources_dir {resources_dir} --use_included_features --num_jobs 10
```

This creates the `univariate_evaluations.csv` file with evaluation metrics per cell-type. There are coefficients and p-values (bonferroni-corrected) from univariate logistic regression models and evaluation metrics from per-cell-type leave-one-dataset-out cross-validation.
Loading