Skip to content

Commit

Permalink
Merge branch 'dev'
Browse files Browse the repository at this point in the history
  • Loading branch information
Ludvig committed Dec 7, 2024
2 parents a7ee9de + 53ab1b0 commit 5b68a04
Show file tree
Hide file tree
Showing 43 changed files with 3,820 additions and 816 deletions.
21 changes: 21 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,26 @@
# Change log

## 1.1.0

This release adds multiple CLI commands that:

1) allow reproducing results from the article and seeing the effect of adding your own datasets:

- Adds `lionheart cross_validate` command. Perform nested leave-one-dataset-out cross-validation on your dataset(s) and/or the included features.
- Adds `lionheart validate` command. Validate a model on the included external dataset or a custom dataset.
- Adds `lionheart evaluate_univariates` command. Evaluate each feature (cell-type) separately on your dataset(s) and/or the included features.

2) expands what you can do with your own data:

- Adds `lionheart customize_thresholds` command. Calculate the ROC curve and probability densities (for deciding probability thresholds) on your data and/or the included features for a custom model or an included model. Allows using probability thresholds suited to your own data when using `lionheart predict_sample` and `lionheart validate`.
- Adds `--custom_threshold_dirs` argument in `lionheart predict_sample`. Allows passing the ROC curves and probability densities extracted with `lionheart customize_thresholds`.

Also:

- Adds `matplotlib` as dependency.
- Bumps `generalize` dependency requirement to `0.2.1`.
- Bumps `utipy` dependency requirement to `1.0.3`.

## 1.0.2

- Fixes bug when training model on a single dataset.
Expand Down
87 changes: 84 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,13 @@ Preprint: https://www.medrxiv.org/content/10.1101/2024.11.26.24317971v1

The code was developed and implemented by [@ludvigolsen](https://github.com/LudvigOlsen).

If you experience an issue, please [report it](https://github.com/BesenbacherLab/lionheart/issues).


## Installation

This section describes the installation of `lionheart` and the custom version of `mosdepth` (exp. time: <10m). The code has only been tested on linux but should also work on Mac and Windows.

Install the main package:

```
Expand Down Expand Up @@ -39,7 +44,7 @@ $ curl https://nim-lang.org/choosenim/init.sh -sSf | sh
# Add to PATH
# Change the path to fit with your system
# Tip: Consider adding it to the terminal configuration file (e.g. ~/.bashrc)
# Tip: Consider adding it to the terminal configuration file (e.g., ~/.bashrc)
$ export PATH=/home/<username>/.nimble/bin:$PATH
# Install and use nim 1.6.4
Expand All @@ -62,6 +67,22 @@ $ wget https://zenodo.org/records/14215762/files/inference_resources_v002.tar.gz
$ tar -xvzf inference_resources_v002.tar.gz
```

## Main commands

This section describes the commands in `lionheart` and lists their *main* output files:

| Command | Description | Main Output |
| :------------------------------- | :------------------------------------------------------------------ | :---------------------------------------------------------------------------------- |
| `lionheart extract_features` | Extract features from a BAM file. | `feature_dataset.npy` and correction profiles |
| `lionheart predict_sample` | Predict cancer status of a sample. | `prediction.csv` |
| `lionheart collect` | Collect predictions and/or features across samples. | `predictions.csv`, `feature_dataset.npy`, and correction profiles *for all samples* |
| `lionheart customize_thresholds` | Extract ROC curve and more for using custom probability thresholds. | `ROC_curves.json` and `probability_densities.csv` |
| `lionheart cross_validate` | Cross-validate the model on new data and/or the included features. | `evaluation_summary.csv`, `splits_summary.csv` |
| `lionheart train_model` | Train a model on your own data and/or the included features. | `model.joblib` and training data results |
| `lionheart validate` | Validate a model on a validation dataset. | `evaluation_scores.csv` and `predictions.csv` |
| `lionheart evaluate_univariates` | Evaluate the cancer detection potential of each feature separately. | `univariate_evaluations.csv` |


## Examples

### Run via command-line interface
Expand All @@ -77,9 +98,9 @@ $ lionheart -h
# Extract feature from a given BAM file
# `mosdepth_path` is the path to the customized `mosdepth` installation
# E.g. "/home/<username>/mosdepth/mosdepth"
# E.g., "/home/<username>/mosdepth/mosdepth"
# `ld_library_path` is the path to the `lib` folder in the conda environment
# E.g. "/home/<username>/anaconda3/envs/lionheart/lib/"
# E.g., "/home/<username>/anaconda3/envs/lionheart/lib/"
$ lionheart extract_features --bam_file {bam_file} --resources_dir {resources_dir} --out_dir {out_dir} --mosdepth_path {mosdepth_path} --ld_library_path {ld_library_path} --n_jobs {cores}
# `sample_dir` is the `out_dir` of `extract_features`
Expand All @@ -88,6 +109,7 @@ $ lionheart predict_sample --sample_dir {sample_dir} --resources_dir {resources_

After running these commands for a set of samples, you can use `lionheart collect` to collect features and predictions across the samples. You can then use `lionheart train_model` to train a model on your own data (and optionally the included features).


### Via `gwf` workflow

We provide a simple workflow for submitting jobs to slurm via the `gwf` package. Make a copy of the `workflow` directory, open `workflow.py`, change the paths and list the samples to run `lionheart` on.
Expand Down Expand Up @@ -124,3 +146,62 @@ $ gwf run
$ gwf status
$ gwf status -f summary
```

### Reproduction of results

This section shows how to reproduce the main results (cross-validation and external validation) from the paper. It uses the included features so the reproduction can be run without access to the raw sequencing data.

Note that different compilations of scikit-learn on different operating systems may lead to slightly different results. On linux, the results should match the reported results.

#### Cross-validation analysis

We start by performing the nested leave-one-dataset-out cross-validation analysis from Figure 3A (not including the benchmarks).

Note that the default settings are the ones used in the paper.

```
# Perform the cross-validation
# {cv_out_dir} should specify where you want the output files
$ lionheart cross_validate --out_dir {cv_out_dir} --resources_dir {resources_dir} --use_included_features --num_jobs 10
```

The output directory should now include multiple files. The main results are in `evaluation_summary.csv` and `splits_summary.csv`. Note that the results are given for multiple probability thresholds. The threshold reported in the paper is the "Max. J Threshold". You can extract the relevant lines of the summaries with:

```
$ awk 'NR==1 || /Average/ && /J Threshold/' {cv_out_dir}/evaluation_summary.csv
$ awk 'NR==1 || /Average/ && /J Threshold/' {cv_out_dir}/splits_summary.csv
```

#### External validation analysis

To reproduce the external validation, we first train a model on all the included training datasets and then validate it on the included validation dataset:

```
# Train a model on the included datasets
# {new_model_dir} should specify where you want the model files
$ lionheart train_model --out_dir {new_model_dir} --resources_dir {resources_dir} --use_included_features
# Validate the model on the included validation dataset
# {val_out_dir} should specify where you want the output files
$ lionheart validate --out_dir {val_out_dir} --resources_dir {resources_dir} --model_dir {new_model_dir} --use_included_validation --thresholds 'max_j'
```

The model training creates the `model.joblib` file along with predictions and evaluations from the *training data* (e.g., `predictions.csv`, `evaluation_scores.csv`, and `ROC_curves.json`).

The validation creates `evaluation_scores.csv` and `predictions.csv` from applying the model on the validation dataset. You will find the reported AUC score in `evaluation_scores.csv`:

```
$ cat {val_out_dir}/evaluation_scores.csv
```

#### Univariate analyses

Finally, we reproduce the univariate modeling evaluations in Figure 2D and 2E:

```
# Evaluate the classification potential of each cell type separately
# {univariates_dir} should specify where you want the evaluation files
$ lionheart evaluate_univariates --out_dir {univariates_dir} --resources_dir {resources_dir} --use_included_features --num_jobs 10
```

This creates the `univariate_evaluations.csv` file with evaluation metrics per cell-type. There are coefficients and p-values (bonferroni-corrected) from univariate logistic regression models and evaluation metrics from per-cell-type leave-one-dataset-out cross-validation.
4 changes: 4 additions & 0 deletions conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,10 @@ def _run_cli(
if output_dir_arg not in command_args:
command_args.extend([output_dir_arg, str(output_dir)])

command_args = [str(cmd) for cmd in command_args]

print(" ".join(command_args))

# Run the command
result = subprocess.run(
command_args,
Expand Down
3 changes: 2 additions & 1 deletion environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,10 +16,11 @@ dependencies:
- scipy=1.7.3
- statsmodels=0.14.1
- seaborn=0.13.2
- matplotlib=3.9.1
- pip:
- nattrs==0.2.2
- pytest==7.1.2
- scikit-learn==1.0.2
- utipy==1.0.2
- utipy==1.0.3
- poetry==1.8.2
- generalize
93 changes: 67 additions & 26 deletions lionheart/cli.py
Original file line number Diff line number Diff line change
@@ -1,17 +1,20 @@
import argparse
from lionheart.commands import (
collect_samples,
customize_thresholds,
extract_features,
predict,
train_model,
validate,
cross_validate,
evaluate_univariates,
guides,
)
from lionheart.utils.cli_utils import (
LION_ASCII,
LIONHEART_ASCII,
LIONHEART_STRING,
README_STRING,
CustomRichHelpFormatter,
wrap_command_description,
)
Expand All @@ -34,6 +37,8 @@ def main():
Start by <b>extracting</b> the features from a BAM file (hg38 only). Then <b>predict</b> whether a sample is from a cancer patient or not.
Easily <b>train</b> a new model on your own data or perform <b>cross-validation</b> to compare against the paper.
{README_STRING}
""",
formatter_class=CustomRichHelpFormatter,
)
Expand Down Expand Up @@ -81,51 +86,87 @@ def main():
"collect",
help="Collect predictions and/or features across samples",
description=wrap_command_description(
"COLLECT predictions and/or extracted features for multiple samples."
"COLLECT predictions and/or extracted features for multiple samples. "
"\nCollecting the features creates a 'dataset' that can be used in "
"other `lionheart` commands."
),
formatter_class=parser.formatter_class,
)
# Delegate the argument setup to the respective command module
collect_samples.setup_parser(parser_cl)

# Command 4
parser_er = subparsers.add_parser(
"customize_thresholds",
help="Extract ROC curve and probability densitities for using custom probability thresholds",
description=wrap_command_description(
"Extract the ROC curve and probability densities for a model's predictions on one or more datasets. "
"\nThis allows using probability thresholds fitted to your own data in "
"`lionheart predict_sample` and `lionheart validate`."
),
formatter_class=parser.formatter_class,
epilog=customize_thresholds.EPILOG,
)
# Delegate the argument setup to the respective command module
customize_thresholds.setup_parser(parser_er)

# # Command 5
parser_cv = subparsers.add_parser(
"cross_validate",
help="Cross-validate the cancer detection model on your own data and/or the included features",
description=wrap_command_description(
"CROSS-VALIDATE your features with nested leave-one-dataset-out (or classic) cross-validation. "
"Use your extracted features and/or the included features. "
"\nAllows seeing the effect on generalization of adding your own data to the training. "
"\n\nNote: The settings are optimized for use with the included features and optional "
"additional datasets. They may not be optimal for more custom designs."
),
formatter_class=parser.formatter_class,
epilog=cross_validate.EPILOG,
)
# Delegate the argument setup to the respective command module
cross_validate.setup_parser(parser_cv)

# Command 6
parser_tm = subparsers.add_parser(
"train_model",
help="Train a model on your own data and/or the included features",
description=wrap_command_description(
"TRAIN A MODEL on your extracted features and/or the included features."
"TRAIN A MODEL on your extracted features and/or the included features. "
"\n\nNOTE: The included evaluation is of predictions of the training data."
),
formatter_class=parser.formatter_class,
epilog=train_model.EPILOG,
)
# Delegate the argument setup to the respective command module
train_model.setup_parser(parser_tm)

# # Command 5
# parser_va = subparsers.add_parser(
# "validate",
# help="Validate a trained model on one or more validation datasets",
# description=wrap_command_description(
# "VALIDATE your trained model one or more validation datasets, such as the included validation dataset."
# ),
# formatter_class=parser.formatter_class,
# epilog=validate.EPILOG,
# )
# # Delegate the argument setup to the respective command module
# validate.setup_parser(parser_va)
# # Command 7
parser_va = subparsers.add_parser(
"validate",
help="Validate a model on a validation dataset",
description=wrap_command_description(
"VALIDATE your trained model on a validation dataset, such as the included validation dataset."
),
formatter_class=parser.formatter_class,
epilog=validate.EPILOG,
)
# Delegate the argument setup to the respective command module
validate.setup_parser(parser_va)

# # Command 6
# parser_cv = subparsers.add_parser(
# "cross_validate",
# help="Cross-validate the cancer detection model on your own data and/or the included features",
# description=wrap_command_description(
# "CROSS-VALIDATE your features with nested leave-one-dataset-out (or classic) cross-validation. "
# "Use your extracted features and/or the included features."
# ),
# formatter_class=parser.formatter_class,
# )
# # Delegate the argument setup to the respective command module
# cross_validate.setup_parser(parser_cv)
# # Command 8
parser_eu = subparsers.add_parser(
"evaluate_univariates",
help="Evaluate the cancer detection potential of each feature separately on your own data and/or the included features",
description=wrap_command_description(
"EVALUATE your features separately on their cancer detection potential. "
"Use your extracted features and/or the included features. "
),
formatter_class=parser.formatter_class,
epilog=evaluate_univariates.EPILOG,
)
# Delegate the argument setup to the respective command module
evaluate_univariates.setup_parser(parser_eu)

args = parser.parse_args()
if args.command == "guide_me":
Expand Down
Loading

0 comments on commit 5b68a04

Please sign in to comment.