Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Polars 1xx #149

Open
wants to merge 65 commits into
base: main
Choose a base branch
from
Open

Polars 1xx #149

wants to merge 65 commits into from

Conversation

ghuls
Copy link
Member

@ghuls ghuls commented Jul 16, 2024

No description provided.

ghuls added 30 commits June 10, 2024 17:22
Format all python files with "ruff format".
Cleanup code of get_barcodes_passing_qc_for_sample.
Update Polars syntax to 1.0.0+ version and fix some type checking.
…polars_series".

Skip empty lines when reading barcode file in "read_barcodes_file_to_polars_series".
…tio can not be calculated.

Add workaround for very rare cases in QC where KDE for duplication ratio
can not be calculated as all duplication ratio values are 0.0% as the
fragment counts for each fragment were 1.
Add mypy stub generation and mypy numpy typing plugin.

Later stubs can be generated with:

    stubgen -o stubs -p pycisTopic
…_barcodes_file_to_polars_series".

Only keep one copy of each barcode when reading barcode file in "read_barcodes_file_to_polars_series".
… lazy Polars dataframes.

Change df.columns with df.collect_schema().names() so it will work on lazy Polars dataframes.
…ader" instead of "has_header".

Update "write_csv" usage to Polars syntax 1.0.0+ by using "include_header" instead of "has_header".
…renaming the grouped column to "by".

Do not use "by=" keyword in group_by as Polars 1.0.0+ treats that
as renaming the grouped column to "by".
Change "sort" and "partition_by" too, to not use the "by" keyword,
although those cases still work as before.
Reformat TSS profile chained code by adding no opt "clone" calls.
Restrict version numbers for some depencencies.
Expose "engine" option in "pycistopic qc".
…ell barcode files.

Support adding sample ID to cell barcodes when reading fragments or
cell barcode files by adding extra arguments to:
  - `read_fragments_to_polars_df`
  - `read_barcodes_file_to_polars_series`
Remove unused argument from `get_insert_size_distribution` docstring.
…ing_qc_for_sample".

Use greater than or equal for threshold filters in "get_barcodes_passing_qc_for_sample".
Change `pycistopic qc` to `pycistopic qc run`.
…n QC stats.

Add `pycistopic qc filter` to be able to filter cell barcodes based on QC stats.
…idx".

Fix "get_tss_profile" so it works both with "polars" and "polars-u64-idx".

Before it was not setting None values to 0.0 when run with "polars-u64-idx".
…h "polars" and "polars-u64-idx".

Fix some columns to pl.UInt32 so dataframes have same schema both
with "polars" and "polars-u64-idx".
…se fragment matrix.

Add "create_fragment_matrix_from_fragments" to create directly a
sparse fragment matrix from a fragments file for consensus peaks
of interest. The new code uses a lot less memory as it never builds
a full dense matrix.

    import pycisTopic.cistopic_class
    import pycisTopic.fragments

    # Create fragments matrix for fragment file for consensus regions and
    # for cell barcodes selected after pycistopic QC.
    counts_fragments_matrix, cbs, region_ids = pycisTopic.fragments.create_fragment_matrix_from_fragments(
        fragments_bed_filename="fragments_GSM7822226_MM_566.tsv.gz",
        regions_bed_filename="consensus_regions.bed",
        barcodes_tsv_filename="cbs_after_pycistopic_qc.tsv",
        blacklist_bed_filename="hg38-blacklist.v2.bed",  # Or None
    )

    # Define sample ID (project name).
    sample_id = "Sample1"

    # Create cisTopic object from sparse fragment matrix.
    cistopic_obj = pycisTopic.cistopic_class.create_cistopic_object(
        fragment_matrix=counts_fragments_matrix,
        cell_names=cbs,
        region_names=region_ids,
        path_to_fragments={sample_id: "fragments.tsv.gz"},
        project=sample_id
    )
…d_word_topics" in the future.

Add "create_regions_topics_frequency_matrix" function, which uses Mallet region topics
counts file as input, to replace "load_word_topics" and "get_topics" in the future.

"load_word_topics" uses Mallet state file, which can be huge if Mallet is run with
a lot of regions and cells as input.

Mallet state files are quite big (even several hunderds of GB) and take quite a bit
of time to be written to disk when Mallet is run with a lot of regions and cells.
Getting count values for each region-topic pair is quite memory intensive in the
current code too.
Mallet region topics counts files are much smaller and don't require much post
processing and time to be written to disk.
…pics_count_matrix".

Rename "create_regions_topics_frequency_matrix" to "create_regions_topics_count_matrix"
and keep the code for creating the frequency matrix in "get_topics" as both matrices
are needed.
…matrix.

Allow creating of mallet serialized corpus file directly from sparse matrix.
This roughly replaces LDAMallet.convert_corpus_to_mallet_corpus_file and
allows this new method to be used independently of the Mallet topic modeling
itself. As generating the Mallet serialized corpus only needs to be done once
before all topic modelings, being able to run it independently, is a huge plus.
Expose creation of Mallet corpus file from pycistopic CLI interface:

    pycistopic topic_modeling create_mallet_corpus

Usage:

  Create binary accessibility matrix in Matrix Market format:

    import pycisTopic.fragments
    import scipy

    counts_fragments_matrix, cbs, region_ids = pycisTopic.fragments.create_fragment_matrix_from_fragments(
        "fragments.tsv.gz",
        "consensus_regions.bed",
        "cbs.tsv"
    )

    # Create binary matrix:
    binary_matrix = counts_fragments_matrix.copy()
    binary_matrix.data.fill(1)

    # Write binary matrix in Matrix Market format.
    scipy.io.mmwrite("binary_accessibility.mtx", binary_matrix)

  Create Mallet corpus file from binary accessibility matrix in Matrix Market format:

    $ pycistopic topic_modeling create_mallet_corpus -i "binary_accessibility.mtx" -o "corpus.mallet"
…rpus_file".

Add some basic sanity checking to "convert_binary_matrix_to_mallet_corpus_file".
Pass correct alpha and eta to loglikelihood function so it is the
same than the one that was actually used during topic modeling.
Remove Mallet text corpus after conversion to serialised corpus file.
Add LDAMalletFilenames class to automatically generate filenames
which will be used later in new LDAMallet class.
…ing Mallet.

Use --word-topic-counts-file instead of --output-state file when running Mallet.
The first generates directly the counts we need, with minimal post processing.
The state file also could take ages to write and read for datasets with a lot
of regions and could increase the runtime of topic modeling by 24 hours.

Changed `LDAMallet.train` to `LDMallet.run_mallet_topic_modeling` and made
it a static method.

Reading of the output produced by Mallet will be added in further commits.
ghuls added 21 commits August 27, 2024 15:01
Add "--verbose" parameter to `pycistopic topic_modeling` subcommands.
…Mallet class.

Update `pycistopic topic_modeling mallet` CLI code to use the new LDAMallet class.
…ment parsing code.

Move "create_mallet_corpus" argument parsing code above "mallet"
argument parsing code, so when both get moved behind a new "mallet"
subcommand, the diff will be less noisy.
…elated subcommands under it.

Create `pycistopic topic_modeling mallet` subparser and move Mallet related
subcommands under it:
  - `pycistopic topic_modeling create_mallet_corpus` => `pycistopic topic_modeling mallet create_corpus`
  - `pycistopic topic_modeling mallet` => `pycistopic topic_modeling mallet run`
…ut start from precalculated Mallet output.

Rework `run_cgs_model_mallet` to `calculate_model_evaluation_stats`
but start from precalculated Mallet output.
Removing running Mallet from this function allows easier parallelization
of Mallet topic modeling.
Add `pycistopic topic_modeling mallet stats` subcommand, using
`calculate_model_evaluation_stats`.
Rename `binary_matrix` to `binary_accessibility_matrix`.
Retrieve also Ensembl gene ID in `get_tss_annotation_from_ensembl`.
Use dictionary for schema as required for Polars 1.xx.
…rsion.

Add docstring to loglikelihood function to explain why we use that version.
… 1.2.0).

Use`infer_schema=False` instead of `infer_schema_length=0` (polars >= 1.2.0).
Add `get_nonzero_row_indices` function, which is an optimized version of:

    np.nonzero(np.count_nonzero(matrix, axis=1))[0]
Fix "*" imports in diff_features.py.
…cc` function and helper functions.

Add `calculate_per_region_mean_and_dispersion_on_normalized_imputed_acc` function
and helper functions:
  - calculate_partial_imputed_acc_sums_per_cell_for_requested_regions
  - calculate_per_region_mean_and_dispersion_on_imputed_acc_normalized_chunk

The output of `calculate_per_region_mean_and_dispersion_on_normalized_imputed_acc`
will be used later as input for a modified `find_highly_variable_features`
which will no longer need a normalized imputed accessibility matrix as input,
but just list of region names, mean of normalized imputed accessibility per region and
dispersion of normalized imputed accessibility per region.

The equivalent heavy calculations are now done by `calculate_per_region_mean_and_dispersion_on_normalized_imputed_acc`,
which replaces calling `impute_accessibility` and `normalize_scores` and
calculating mean of normalized imputed accessibility per region and
dispersion of normalized imputed accessibility per region inside current
`find_highly_variable_features`.

Those calculations happen now in chunks and don't require excessive amounts
of RAM anymore.

For example, just calculating `impute_accessibility` for 1292285 regions x 1483207 cells
would require 7140.361 GiB (1292285 regions x 1483207 cells x 4 bytes per float32).
Calling `normalize_scores` would require the same and getting variance of
normalized imputed accessibility per region (needed for dispersion calculation
would require another one.

With this new implementation, the total memory usage is constant:
    chunk_size x n_cells x 4 bytes
, where chunk_size is the number of regions that are processed at once.

For example for 1292285 regions x 1483207 cells and different chunksizes:
  - chunk_size: 20000 ==> 110.508 GB
  - chunk_size: 2000 ==> 11.051 GB
  - chunk_size: 1000 ==> 5.525 GB
  - chunk_size: 500 ==> 2.763 GB
…_dispersion_on_normalized_imputed_acc`.

Use more descriptive variable names in `calculate_per_region_mean_and_dispersion_on_normalized_imputed_acc`
and related functions and improve comments and help test.
…calculate_per_region_mean_and_dispersion_on_normalized_imputed_acc`.

Remove unneeded `normalize_scores` function, as it is superseded
by `calculate_per_region_mean_and_dispersion_on_normalized_imputed_acc`.
…r_region_mean_and_dispersion_on_normalized_imputed_acc`.

Update `find_highly_variable_features` to use output of
`calculate_per_region_mean_and_dispersion_on_normalized_imputed_acc`
also remove `save` argument and integrate in `plot` argument.
Add/update numba optimized version related to wilcoxon test:
  - Add/fix typing of functions.
  - Add dimension checks.
  - Add additional numba optimized functions from `pycistarget/motif_enrichment_dem.py`:
    https://github.com/aertslab/pycistarget/blob/8cd78f110f4eeeb38d8f7006a1100a13d636d984/src/pycistarget/motif_enrichment_dem.py#L19-L124
    (should go in a separate package)
Change "wilcox" to "wilcoxon".
Add longer explanation for tss "--no-cache" option as it might not
have been super clear that a crash is likely caused by an invalid/
incomplete cached response from BioMart.
…int.

Catch likely BioMart request caching problems and give "--no-cache" hint.
SeppeDeWinter and others added 8 commits December 6, 2024 10:03
use 1e-45 as pseudocount instead of 1e-100 to resolve rounding to 0
Add extra directories to ignore to gitignore.
…s_from_file".

Use "read_fragments_to_pyranges" instead of deprecated "read_fragments_from_file".
…_heatmap".

Fix `variables=None` and `color_dictionary=None` cases in "cell_topic_heatmap" by:
  - sorting the input data only if variables is set.
  - generating a random color mapping if `color_dictionary=None`.
Raise error when target is not "cell" or "region" in "find_clusters".
… IDs do not overlap with any fragments.

Fix "create_fragment_matrix_from_fragments" in case last (few)
region IDs do not overlap with any fragments by setting sparse
matrix dimensions. Otherwise the sparse matrix could have to
small dimensions for the number of returned region IDs.
The same could happen for CBs if CBs are given that are not
in the fragments file.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants