-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Polars 1xx #149
Open
ghuls
wants to merge
65
commits into
main
Choose a base branch
from
polars_1xx
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Polars 1xx #149
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Format all python files with "ruff format".
Cleanup code of get_barcodes_passing_qc_for_sample.
Update Polars syntax to 1.0.0+ version and fix some type checking.
…polars_series". Skip empty lines when reading barcode file in "read_barcodes_file_to_polars_series".
…tio can not be calculated. Add workaround for very rare cases in QC where KDE for duplication ratio can not be calculated as all duplication ratio values are 0.0% as the fragment counts for each fragment were 1.
Add mypy stub generation and mypy numpy typing plugin. Later stubs can be generated with: stubgen -o stubs -p pycisTopic
…_barcodes_file_to_polars_series". Only keep one copy of each barcode when reading barcode file in "read_barcodes_file_to_polars_series".
… lazy Polars dataframes. Change df.columns with df.collect_schema().names() so it will work on lazy Polars dataframes.
…ader" instead of "has_header". Update "write_csv" usage to Polars syntax 1.0.0+ by using "include_header" instead of "has_header".
…renaming the grouped column to "by". Do not use "by=" keyword in group_by as Polars 1.0.0+ treats that as renaming the grouped column to "by". Change "sort" and "partition_by" too, to not use the "by" keyword, although those cases still work as before.
Reformat TSS profile chained code by adding no opt "clone" calls.
Restrict version numbers for some depencencies.
Expose "engine" option in "pycistopic qc".
…ell barcode files. Support adding sample ID to cell barcodes when reading fragments or cell barcode files by adding extra arguments to: - `read_fragments_to_polars_df` - `read_barcodes_file_to_polars_series`
Remove unused argument from `get_insert_size_distribution` docstring.
…ing_qc_for_sample". Use greater than or equal for threshold filters in "get_barcodes_passing_qc_for_sample".
Change `pycistopic qc` to `pycistopic qc run`.
…n QC stats. Add `pycistopic qc filter` to be able to filter cell barcodes based on QC stats.
…idx". Fix "get_tss_profile" so it works both with "polars" and "polars-u64-idx". Before it was not setting None values to 0.0 when run with "polars-u64-idx".
…h "polars" and "polars-u64-idx". Fix some columns to pl.UInt32 so dataframes have same schema both with "polars" and "polars-u64-idx".
…se fragment matrix. Add "create_fragment_matrix_from_fragments" to create directly a sparse fragment matrix from a fragments file for consensus peaks of interest. The new code uses a lot less memory as it never builds a full dense matrix. import pycisTopic.cistopic_class import pycisTopic.fragments # Create fragments matrix for fragment file for consensus regions and # for cell barcodes selected after pycistopic QC. counts_fragments_matrix, cbs, region_ids = pycisTopic.fragments.create_fragment_matrix_from_fragments( fragments_bed_filename="fragments_GSM7822226_MM_566.tsv.gz", regions_bed_filename="consensus_regions.bed", barcodes_tsv_filename="cbs_after_pycistopic_qc.tsv", blacklist_bed_filename="hg38-blacklist.v2.bed", # Or None ) # Define sample ID (project name). sample_id = "Sample1" # Create cisTopic object from sparse fragment matrix. cistopic_obj = pycisTopic.cistopic_class.create_cistopic_object( fragment_matrix=counts_fragments_matrix, cell_names=cbs, region_names=region_ids, path_to_fragments={sample_id: "fragments.tsv.gz"}, project=sample_id )
…d_word_topics" in the future. Add "create_regions_topics_frequency_matrix" function, which uses Mallet region topics counts file as input, to replace "load_word_topics" and "get_topics" in the future. "load_word_topics" uses Mallet state file, which can be huge if Mallet is run with a lot of regions and cells as input. Mallet state files are quite big (even several hunderds of GB) and take quite a bit of time to be written to disk when Mallet is run with a lot of regions and cells. Getting count values for each region-topic pair is quite memory intensive in the current code too. Mallet region topics counts files are much smaller and don't require much post processing and time to be written to disk.
…pics_count_matrix". Rename "create_regions_topics_frequency_matrix" to "create_regions_topics_count_matrix" and keep the code for creating the frequency matrix in "get_topics" as both matrices are needed.
…matrix. Allow creating of mallet serialized corpus file directly from sparse matrix. This roughly replaces LDAMallet.convert_corpus_to_mallet_corpus_file and allows this new method to be used independently of the Mallet topic modeling itself. As generating the Mallet serialized corpus only needs to be done once before all topic modelings, being able to run it independently, is a huge plus.
Expose creation of Mallet corpus file from pycistopic CLI interface: pycistopic topic_modeling create_mallet_corpus Usage: Create binary accessibility matrix in Matrix Market format: import pycisTopic.fragments import scipy counts_fragments_matrix, cbs, region_ids = pycisTopic.fragments.create_fragment_matrix_from_fragments( "fragments.tsv.gz", "consensus_regions.bed", "cbs.tsv" ) # Create binary matrix: binary_matrix = counts_fragments_matrix.copy() binary_matrix.data.fill(1) # Write binary matrix in Matrix Market format. scipy.io.mmwrite("binary_accessibility.mtx", binary_matrix) Create Mallet corpus file from binary accessibility matrix in Matrix Market format: $ pycistopic topic_modeling create_mallet_corpus -i "binary_accessibility.mtx" -o "corpus.mallet"
…rpus_file". Add some basic sanity checking to "convert_binary_matrix_to_mallet_corpus_file".
Pass correct alpha and eta to loglikelihood function so it is the same than the one that was actually used during topic modeling.
Remove Mallet text corpus after conversion to serialised corpus file.
Add LDAMalletFilenames class to automatically generate filenames which will be used later in new LDAMallet class.
…ing Mallet. Use --word-topic-counts-file instead of --output-state file when running Mallet. The first generates directly the counts we need, with minimal post processing. The state file also could take ages to write and read for datasets with a lot of regions and could increase the runtime of topic modeling by 24 hours. Changed `LDAMallet.train` to `LDMallet.run_mallet_topic_modeling` and made it a static method. Reading of the output produced by Mallet will be added in further commits.
Add "--verbose" parameter to `pycistopic topic_modeling` subcommands.
…Mallet class. Update `pycistopic topic_modeling mallet` CLI code to use the new LDAMallet class.
…ment parsing code. Move "create_mallet_corpus" argument parsing code above "mallet" argument parsing code, so when both get moved behind a new "mallet" subcommand, the diff will be less noisy.
…elated subcommands under it. Create `pycistopic topic_modeling mallet` subparser and move Mallet related subcommands under it: - `pycistopic topic_modeling create_mallet_corpus` => `pycistopic topic_modeling mallet create_corpus` - `pycistopic topic_modeling mallet` => `pycistopic topic_modeling mallet run`
…ut start from precalculated Mallet output. Rework `run_cgs_model_mallet` to `calculate_model_evaluation_stats` but start from precalculated Mallet output. Removing running Mallet from this function allows easier parallelization of Mallet topic modeling.
Add `pycistopic topic_modeling mallet stats` subcommand, using `calculate_model_evaluation_stats`.
Rename `binary_matrix` to `binary_accessibility_matrix`.
Retrieve also Ensembl gene ID in `get_tss_annotation_from_ensembl`.
Use dictionary for schema as required for Polars 1.xx.
…rsion. Add docstring to loglikelihood function to explain why we use that version.
… 1.2.0). Use`infer_schema=False` instead of `infer_schema_length=0` (polars >= 1.2.0).
Add `get_nonzero_row_indices` function, which is an optimized version of: np.nonzero(np.count_nonzero(matrix, axis=1))[0]
Fix "*" imports in diff_features.py.
…cc` function and helper functions. Add `calculate_per_region_mean_and_dispersion_on_normalized_imputed_acc` function and helper functions: - calculate_partial_imputed_acc_sums_per_cell_for_requested_regions - calculate_per_region_mean_and_dispersion_on_imputed_acc_normalized_chunk The output of `calculate_per_region_mean_and_dispersion_on_normalized_imputed_acc` will be used later as input for a modified `find_highly_variable_features` which will no longer need a normalized imputed accessibility matrix as input, but just list of region names, mean of normalized imputed accessibility per region and dispersion of normalized imputed accessibility per region. The equivalent heavy calculations are now done by `calculate_per_region_mean_and_dispersion_on_normalized_imputed_acc`, which replaces calling `impute_accessibility` and `normalize_scores` and calculating mean of normalized imputed accessibility per region and dispersion of normalized imputed accessibility per region inside current `find_highly_variable_features`. Those calculations happen now in chunks and don't require excessive amounts of RAM anymore. For example, just calculating `impute_accessibility` for 1292285 regions x 1483207 cells would require 7140.361 GiB (1292285 regions x 1483207 cells x 4 bytes per float32). Calling `normalize_scores` would require the same and getting variance of normalized imputed accessibility per region (needed for dispersion calculation would require another one. With this new implementation, the total memory usage is constant: chunk_size x n_cells x 4 bytes , where chunk_size is the number of regions that are processed at once. For example for 1292285 regions x 1483207 cells and different chunksizes: - chunk_size: 20000 ==> 110.508 GB - chunk_size: 2000 ==> 11.051 GB - chunk_size: 1000 ==> 5.525 GB - chunk_size: 500 ==> 2.763 GB
…_dispersion_on_normalized_imputed_acc`. Use more descriptive variable names in `calculate_per_region_mean_and_dispersion_on_normalized_imputed_acc` and related functions and improve comments and help test.
…calculate_per_region_mean_and_dispersion_on_normalized_imputed_acc`. Remove unneeded `normalize_scores` function, as it is superseded by `calculate_per_region_mean_and_dispersion_on_normalized_imputed_acc`.
…r_region_mean_and_dispersion_on_normalized_imputed_acc`. Update `find_highly_variable_features` to use output of `calculate_per_region_mean_and_dispersion_on_normalized_imputed_acc` also remove `save` argument and integrate in `plot` argument.
Add/update numba optimized version related to wilcoxon test: - Add/fix typing of functions. - Add dimension checks. - Add additional numba optimized functions from `pycistarget/motif_enrichment_dem.py`: https://github.com/aertslab/pycistarget/blob/8cd78f110f4eeeb38d8f7006a1100a13d636d984/src/pycistarget/motif_enrichment_dem.py#L19-L124 (should go in a separate package)
Change "wilcox" to "wilcoxon".
Add longer explanation for tss "--no-cache" option as it might not have been super clear that a crash is likely caused by an invalid/ incomplete cached response from BioMart.
…int. Catch likely BioMart request caching problems and give "--no-cache" hint.
use 1e-45 as pseudocount instead of 1e-100 to resolve rounding to 0
Add extra directories to ignore to gitignore.
…s_from_file". Use "read_fragments_to_pyranges" instead of deprecated "read_fragments_from_file".
…_heatmap". Fix `variables=None` and `color_dictionary=None` cases in "cell_topic_heatmap" by: - sorting the input data only if variables is set. - generating a random color mapping if `color_dictionary=None`.
Raise error when target is not "cell" or "region" in "find_clusters".
… IDs do not overlap with any fragments. Fix "create_fragment_matrix_from_fragments" in case last (few) region IDs do not overlap with any fragments by setting sparse matrix dimensions. Otherwise the sparse matrix could have to small dimensions for the number of returned region IDs. The same could happen for CBs if CBs are given that are not in the fragments file.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.