Polars 1xx #149

ghuls · 2024-07-16T12:55:26Z

No description provided.

Format all python files with "ruff format".

Cleanup code of get_barcodes_passing_qc_for_sample.

Update Polars syntax to 1.0.0+ version and fix some type checking.

…polars_series". Skip empty lines when reading barcode file in "read_barcodes_file_to_polars_series".

…tio can not be calculated. Add workaround for very rare cases in QC where KDE for duplication ratio can not be calculated as all duplication ratio values are 0.0% as the fragment counts for each fragment were 1.

Add mypy stub generation and mypy numpy typing plugin. Later stubs can be generated with: stubgen -o stubs -p pycisTopic

…_barcodes_file_to_polars_series". Only keep one copy of each barcode when reading barcode file in "read_barcodes_file_to_polars_series".

… lazy Polars dataframes. Change df.columns with df.collect_schema().names() so it will work on lazy Polars dataframes.

…ader" instead of "has_header". Update "write_csv" usage to Polars syntax 1.0.0+ by using "include_header" instead of "has_header".

…renaming the grouped column to "by". Do not use "by=" keyword in group_by as Polars 1.0.0+ treats that as renaming the grouped column to "by". Change "sort" and "partition_by" too, to not use the "by" keyword, although those cases still work as before.

Reformat TSS profile chained code by adding no opt "clone" calls.

Restrict version numbers for some depencencies.

Expose "engine" option in "pycistopic qc".

…ell barcode files. Support adding sample ID to cell barcodes when reading fragments or cell barcode files by adding extra arguments to: - `read_fragments_to_polars_df` - `read_barcodes_file_to_polars_series`

Remove unused argument from `get_insert_size_distribution` docstring.

…ing_qc_for_sample". Use greater than or equal for threshold filters in "get_barcodes_passing_qc_for_sample".

Change `pycistopic qc` to `pycistopic qc run`.

…n QC stats. Add `pycistopic qc filter` to be able to filter cell barcodes based on QC stats.

…idx". Fix "get_tss_profile" so it works both with "polars" and "polars-u64-idx". Before it was not setting None values to 0.0 when run with "polars-u64-idx".

…h "polars" and "polars-u64-idx". Fix some columns to pl.UInt32 so dataframes have same schema both with "polars" and "polars-u64-idx".

…se fragment matrix. Add "create_fragment_matrix_from_fragments" to create directly a sparse fragment matrix from a fragments file for consensus peaks of interest. The new code uses a lot less memory as it never builds a full dense matrix. import pycisTopic.cistopic_class import pycisTopic.fragments # Create fragments matrix for fragment file for consensus regions and # for cell barcodes selected after pycistopic QC. counts_fragments_matrix, cbs, region_ids = pycisTopic.fragments.create_fragment_matrix_from_fragments( fragments_bed_filename="fragments_GSM7822226_MM_566.tsv.gz", regions_bed_filename="consensus_regions.bed", barcodes_tsv_filename="cbs_after_pycistopic_qc.tsv", blacklist_bed_filename="hg38-blacklist.v2.bed", # Or None ) # Define sample ID (project name). sample_id = "Sample1" # Create cisTopic object from sparse fragment matrix. cistopic_obj = pycisTopic.cistopic_class.create_cistopic_object( fragment_matrix=counts_fragments_matrix, cell_names=cbs, region_names=region_ids, path_to_fragments={sample_id: "fragments.tsv.gz"}, project=sample_id )

…d_word_topics" in the future. Add "create_regions_topics_frequency_matrix" function, which uses Mallet region topics counts file as input, to replace "load_word_topics" and "get_topics" in the future. "load_word_topics" uses Mallet state file, which can be huge if Mallet is run with a lot of regions and cells as input. Mallet state files are quite big (even several hunderds of GB) and take quite a bit of time to be written to disk when Mallet is run with a lot of regions and cells. Getting count values for each region-topic pair is quite memory intensive in the current code too. Mallet region topics counts files are much smaller and don't require much post processing and time to be written to disk.

…pics_count_matrix". Rename "create_regions_topics_frequency_matrix" to "create_regions_topics_count_matrix" and keep the code for creating the frequency matrix in "get_topics" as both matrices are needed.

…matrix. Allow creating of mallet serialized corpus file directly from sparse matrix. This roughly replaces LDAMallet.convert_corpus_to_mallet_corpus_file and allows this new method to be used independently of the Mallet topic modeling itself. As generating the Mallet serialized corpus only needs to be done once before all topic modelings, being able to run it independently, is a huge plus.

Expose creation of Mallet corpus file from pycistopic CLI interface: pycistopic topic_modeling create_mallet_corpus Usage: Create binary accessibility matrix in Matrix Market format: import pycisTopic.fragments import scipy counts_fragments_matrix, cbs, region_ids = pycisTopic.fragments.create_fragment_matrix_from_fragments( "fragments.tsv.gz", "consensus_regions.bed", "cbs.tsv" ) # Create binary matrix: binary_matrix = counts_fragments_matrix.copy() binary_matrix.data.fill(1) # Write binary matrix in Matrix Market format. scipy.io.mmwrite("binary_accessibility.mtx", binary_matrix) Create Mallet corpus file from binary accessibility matrix in Matrix Market format: $ pycistopic topic_modeling create_mallet_corpus -i "binary_accessibility.mtx" -o "corpus.mallet"

…rpus_file". Add some basic sanity checking to "convert_binary_matrix_to_mallet_corpus_file".

Pass correct alpha and eta to loglikelihood function so it is the same than the one that was actually used during topic modeling.

Remove Mallet text corpus after conversion to serialised corpus file.

Add LDAMalletFilenames class to automatically generate filenames which will be used later in new LDAMallet class.

…ing Mallet. Use --word-topic-counts-file instead of --output-state file when running Mallet. The first generates directly the counts we need, with minimal post processing. The state file also could take ages to write and read for datasets with a lot of regions and could increase the runtime of topic modeling by 24 hours. Changed `LDAMallet.train` to `LDMallet.run_mallet_topic_modeling` and made it a static method. Reading of the output produced by Mallet will be added in further commits.

Add "--verbose" parameter to `pycistopic topic_modeling` subcommands.

…Mallet class. Update `pycistopic topic_modeling mallet` CLI code to use the new LDAMallet class.

…ment parsing code. Move "create_mallet_corpus" argument parsing code above "mallet" argument parsing code, so when both get moved behind a new "mallet" subcommand, the diff will be less noisy.

…elated subcommands under it. Create `pycistopic topic_modeling mallet` subparser and move Mallet related subcommands under it: - `pycistopic topic_modeling create_mallet_corpus` => `pycistopic topic_modeling mallet create_corpus` - `pycistopic topic_modeling mallet` => `pycistopic topic_modeling mallet run`

…ut start from precalculated Mallet output. Rework `run_cgs_model_mallet` to `calculate_model_evaluation_stats` but start from precalculated Mallet output. Removing running Mallet from this function allows easier parallelization of Mallet topic modeling.

Add `pycistopic topic_modeling mallet stats` subcommand, using `calculate_model_evaluation_stats`.

Rename `binary_matrix` to `binary_accessibility_matrix`.

Retrieve also Ensembl gene ID in `get_tss_annotation_from_ensembl`.

Use dictionary for schema as required for Polars 1.xx.

…rsion. Add docstring to loglikelihood function to explain why we use that version.

… 1.2.0). Use`infer_schema=False` instead of `infer_schema_length=0` (polars >= 1.2.0).

Add `get_nonzero_row_indices` function, which is an optimized version of: np.nonzero(np.count_nonzero(matrix, axis=1))[0]

Fix "*" imports in diff_features.py.

…cc` function and helper functions. Add `calculate_per_region_mean_and_dispersion_on_normalized_imputed_acc` function and helper functions: - calculate_partial_imputed_acc_sums_per_cell_for_requested_regions - calculate_per_region_mean_and_dispersion_on_imputed_acc_normalized_chunk The output of `calculate_per_region_mean_and_dispersion_on_normalized_imputed_acc` will be used later as input for a modified `find_highly_variable_features` which will no longer need a normalized imputed accessibility matrix as input, but just list of region names, mean of normalized imputed accessibility per region and dispersion of normalized imputed accessibility per region. The equivalent heavy calculations are now done by `calculate_per_region_mean_and_dispersion_on_normalized_imputed_acc`, which replaces calling `impute_accessibility` and `normalize_scores` and calculating mean of normalized imputed accessibility per region and dispersion of normalized imputed accessibility per region inside current `find_highly_variable_features`. Those calculations happen now in chunks and don't require excessive amounts of RAM anymore. For example, just calculating `impute_accessibility` for 1292285 regions x 1483207 cells would require 7140.361 GiB (1292285 regions x 1483207 cells x 4 bytes per float32). Calling `normalize_scores` would require the same and getting variance of normalized imputed accessibility per region (needed for dispersion calculation would require another one. With this new implementation, the total memory usage is constant: chunk_size x n_cells x 4 bytes , where chunk_size is the number of regions that are processed at once. For example for 1292285 regions x 1483207 cells and different chunksizes: - chunk_size: 20000 ==> 110.508 GB - chunk_size: 2000 ==> 11.051 GB - chunk_size: 1000 ==> 5.525 GB - chunk_size: 500 ==> 2.763 GB

…_dispersion_on_normalized_imputed_acc`. Use more descriptive variable names in `calculate_per_region_mean_and_dispersion_on_normalized_imputed_acc` and related functions and improve comments and help test.

…calculate_per_region_mean_and_dispersion_on_normalized_imputed_acc`. Remove unneeded `normalize_scores` function, as it is superseded by `calculate_per_region_mean_and_dispersion_on_normalized_imputed_acc`.

…r_region_mean_and_dispersion_on_normalized_imputed_acc`. Update `find_highly_variable_features` to use output of `calculate_per_region_mean_and_dispersion_on_normalized_imputed_acc` also remove `save` argument and integrate in `plot` argument.

Add/update numba optimized version related to wilcoxon test: - Add/fix typing of functions. - Add dimension checks. - Add additional numba optimized functions from `pycistarget/motif_enrichment_dem.py`: https://github.com/aertslab/pycistarget/blob/8cd78f110f4eeeb38d8f7006a1100a13d636d984/src/pycistarget/motif_enrichment_dem.py#L19-L124 (should go in a separate package)

Change "wilcox" to "wilcoxon".

Add longer explanation for tss "--no-cache" option as it might not have been super clear that a crash is likely caused by an invalid/ incomplete cached response from BioMart.

…int. Catch likely BioMart request caching problems and give "--no-cache" hint.

use 1e-45 as pseudocount instead of 1e-100 to resolve rounding to 0

Add extra directories to ignore to gitignore.

…s_from_file". Use "read_fragments_to_pyranges" instead of deprecated "read_fragments_from_file".

…_heatmap". Fix `variables=None` and `color_dictionary=None` cases in "cell_topic_heatmap" by: - sorting the input data only if variables is set. - generating a random color mapping if `color_dictionary=None`.

Raise error when target is not "cell" or "region" in "find_clusters".

… IDs do not overlap with any fragments. Fix "create_fragment_matrix_from_fragments" in case last (few) region IDs do not overlap with any fragments by setting sparse matrix dimensions. Otherwise the sparse matrix could have to small dimensions for the number of returned region IDs. The same could happen for CBs if CBs are given that are not in the fragments file.

ghuls added 30 commits June 10, 2024 17:22

Format all python files with "ruff format".

ef010bc

Format all python files with "ruff format".

Cleanup code of get_barcodes_passing_qc_for_sample.

9bb9fda

Cleanup code of get_barcodes_passing_qc_for_sample.

Update Polars syntax to 1.0.0+ version.

0f73b45

Update Polars syntax to 1.0.0+ version and fix some type checking.

Skip empty lines when reading barcode file in "read_barcodes_file_to_…

2915437

…polars_series". Skip empty lines when reading barcode file in "read_barcodes_file_to_polars_series".

Add mypy stub generation and mypy numpy typing plugin.

35e13ea

Add mypy stub generation and mypy numpy typing plugin. Later stubs can be generated with: stubgen -o stubs -p pycisTopic

Only keep one copy of each barcode when reading barcode file in "read…

14fce9a

…_barcodes_file_to_polars_series". Only keep one copy of each barcode when reading barcode file in "read_barcodes_file_to_polars_series".

Change df.columns with df.collect_schema().names() so it will work on…

39ace8a

… lazy Polars dataframes. Change df.columns with df.collect_schema().names() so it will work on lazy Polars dataframes.

Update "write_csv" usage to Polars syntax 1.0.0+ by using "include_he…

9892252

…ader" instead of "has_header". Update "write_csv" usage to Polars syntax 1.0.0+ by using "include_header" instead of "has_header".

Reformat TSS profile chained code by adding no opt "clone" calls.

861f9b7

Reformat TSS profile chained code by adding no opt "clone" calls.

Restrict version numbers for some depencencies.

1e2e0bf

Restrict version numbers for some depencencies.

Expose "engine" option in "pycistopic qc".

bfa6e7b

Expose "engine" option in "pycistopic qc".

Support adding sample ID to cell barcodes when reading fragments or c…

deef721

…ell barcode files. Support adding sample ID to cell barcodes when reading fragments or cell barcode files by adding extra arguments to: - `read_fragments_to_polars_df` - `read_barcodes_file_to_polars_series`

Remove unused argument from get_insert_size_distribution docstring.

49200f5

Remove unused argument from `get_insert_size_distribution` docstring.

Use greater than or equal for threshold filters in "get_barcodes_pass…

15899de

…ing_qc_for_sample". Use greater than or equal for threshold filters in "get_barcodes_passing_qc_for_sample".

Change pycistopic qc to pycistopic qc run.

419a536

Change `pycistopic qc` to `pycistopic qc run`.

Add pycistopic qc filter to be able to filter cell barcodes based o…

f9d78a1

…n QC stats. Add `pycistopic qc filter` to be able to filter cell barcodes based on QC stats.

Fix "get_tss_profile" so it works both with "polars" and "polars-u64-…

4661c7d

…idx". Fix "get_tss_profile" so it works both with "polars" and "polars-u64-idx". Before it was not setting None values to 0.0 when run with "polars-u64-idx".

Fix some columns to pl.UInt32 so dataframes have same schema both wit…

7b285a4

…h "polars" and "polars-u64-idx". Fix some columns to pl.UInt32 so dataframes have same schema both with "polars" and "polars-u64-idx".

Rename "create_regions_topics_frequency_matrix" to "create_regions_to…

fab1300

…pics_count_matrix". Rename "create_regions_topics_frequency_matrix" to "create_regions_topics_count_matrix" and keep the code for creating the frequency matrix in "get_topics" as both matrices are needed.

Add some basic sanity checking to "convert_binary_matrix_to_mallet_co…

b680420

…rpus_file". Add some basic sanity checking to "convert_binary_matrix_to_mallet_corpus_file".

Pass correct alpha and eta to loglikelihood function.

646db07

Pass correct alpha and eta to loglikelihood function so it is the same than the one that was actually used during topic modeling.

Remove Mallet text corpus after conversion to serialised corpus file.

a59ae9e

Remove Mallet text corpus after conversion to serialised corpus file.

Add LDAMalletFilenames class.

b6d0973

Add LDAMalletFilenames class to automatically generate filenames which will be used later in new LDAMallet class.

ghuls added 21 commits August 27, 2024 15:01

Add "--verbose" parameter to pycistopic topic_modeling subcommands.

14b6330

Add "--verbose" parameter to `pycistopic topic_modeling` subcommands.

Update pycistopic topic_modeling mallet CLI code to use the new LDA…

bb8aa47

…Mallet class. Update `pycistopic topic_modeling mallet` CLI code to use the new LDAMallet class.

Move "create_mallet_corpus" argument parsing code above "mallet" argu…

3daefd2

…ment parsing code. Move "create_mallet_corpus" argument parsing code above "mallet" argument parsing code, so when both get moved behind a new "mallet" subcommand, the diff will be less noisy.

Add pycistopic topic_modeling mallet stats subcommand.

801f334

Add `pycistopic topic_modeling mallet stats` subcommand, using `calculate_model_evaluation_stats`.

Rename binary_matrix to binary_accessibility_matrix.

8f2faef

Rename `binary_matrix` to `binary_accessibility_matrix`.

Retrieve also Ensembl gene ID in get_tss_annotation_from_ensembl.

f47cb0c

Retrieve also Ensembl gene ID in `get_tss_annotation_from_ensembl`.

Use dictionary for schema as required for Polars 1.xx.

f15501c

Use dictionary for schema as required for Polars 1.xx.

Add docstring to loglikelihood function to explain why we use that ve…

3e7030d

…rsion. Add docstring to loglikelihood function to explain why we use that version.

Useinfer_schema=False instead of infer_schema_length=0 (polars >=…

7714f86

… 1.2.0). Use`infer_schema=False` instead of `infer_schema_length=0` (polars >= 1.2.0).

Add get_nonzero_row_indices function.

627bac6

Add `get_nonzero_row_indices` function, which is an optimized version of: np.nonzero(np.count_nonzero(matrix, axis=1))[0]

Fix "*" imports in diff_features.py.

ee945cf

Fix "*" imports in diff_features.py.

Use more descriptive variable names in `calculate_per_region_mean_and…

75ae39e

…_dispersion_on_normalized_imputed_acc`. Use more descriptive variable names in `calculate_per_region_mean_and_dispersion_on_normalized_imputed_acc` and related functions and improve comments and help test.

Remove unneeded normalize_scores function, as it is superseded by `…

b375ad0

…calculate_per_region_mean_and_dispersion_on_normalized_imputed_acc`. Remove unneeded `normalize_scores` function, as it is superseded by `calculate_per_region_mean_and_dispersion_on_normalized_imputed_acc`.

Change "wilcox" to "wilcoxon".

acf1545

Change "wilcox" to "wilcoxon".

Add longer explanation for tss "--no-cache" option.

6870e73

Add longer explanation for tss "--no-cache" option as it might not have been super clear that a crash is likely caused by an invalid/ incomplete cached response from BioMart.

Catch likely BioMart request caching problems and give "--no-cache" h…

06d3816

…int. Catch likely BioMart request caching problems and give "--no-cache" hint.

ghuls mentioned this pull request Nov 12, 2024

Problem with TSS aertslab/scenicplus#474

Open

SeppeDeWinter and others added 8 commits December 6, 2024 10:03

import gr_overlap

1d51050

typo

cb5f541

Update topic binarization code

fadafc0

use 1e-45 as pseudocount instead of 1e-100 to resolve rounding to 0

Add extra directories to ignore to gitignore.

acedbc1

Add extra directories to ignore to gitignore.

Use "read_fragments_to_pyranges" instead of deprecated "read_fragment…

ac87d35

…s_from_file". Use "read_fragments_to_pyranges" instead of deprecated "read_fragments_from_file".

Fix variables=None and color_dictionary=None cases in "cell_topic…

ad8aadd

…_heatmap". Fix `variables=None` and `color_dictionary=None` cases in "cell_topic_heatmap" by: - sorting the input data only if variables is set. - generating a random color mapping if `color_dictionary=None`.

Raise error when target is not "cell" or "region" in "find_clusters".

1b7d2ff

Raise error when target is not "cell" or "region" in "find_clusters".

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Polars 1xx #149

Polars 1xx #149

ghuls commented Jul 16, 2024

Polars 1xx #149

Are you sure you want to change the base?

Polars 1xx #149

Conversation

ghuls commented Jul 16, 2024