New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Process jump single cell data #56

Merged

gwaybio merged 10 commits into WayScience:main from gwaybio:process-jump-single-cell

Feb 21, 2024

Member

gwaybio commented Feb 14, 2024

The notebook and associated files load in the JUMP single-cell results (KS tests) from https://github.com/WayScience/JUMP-single-cell/tree/main/3.analyze_data and performs three operations:

Briefly explores the data
Outputs the top 10 results per phenotype, per treatment type, per model type for a focused exploration and results reporting
Outputs a wide format phenotype profile per model type

These results are important for the manuscript and for adding to a visualization I started working on in #55

gwaybio added 3 commits

February 14, 2024 06:37


          add parquet support to env

b6df73b


          add notebook to process jump phenotype profiles

1ba89fa


          add jump results and profiles

095d2ec

review-notebook-app bot commented Feb 14, 2024

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

gwaybio requested a review from MattsonCam

February 14, 2024 13:41

Member Author

gwaybio commented Feb 14, 2024

@MattsonCam - I think you're the best to review this - you performed the KS test analysis and are familiar with the JUMP dataset. Please review when you are able. Thanks!

gwaybio added 2 commits

February 15, 2024 05:41


          integrate extended jump metadata

793ab43


          add updated jump results

e25b99f

Member Author

gwaybio commented Feb 15, 2024

FYI - I integrated the extended JUMP metadata in the two most recent commits. We probably should mention this in the JUMP-single-cell repo somewhere, but given that the time points and cell lines were all independent plates (and we performed our KS-test analysis per plate) we do not need to rerun any analysis in the JUMP-single-cell repo. Thanks!

gwaybio added 3 commits

February 19, 2024 13:20


          add umap

3a79577


          add umap coordinates and update other files after rerunning

232dc39


          add umap to env

eee6af6

Member Author

gwaybio commented Feb 19, 2024

I added a UMAP fit of phenotypic profile probabilities in the last set of commits. Thanks!

MattsonCam approved these changes

View reviewed changes

Member

MattsonCam left a comment

Great job @gwaybio! I left some comments. Overall, it LGTM!

3.evaluate_model/scripts/nbconverted/process_jump_phenotype_profiles.py

+              # ## Process JUMP phenotypic profiles
+              #
+              # We applied the AreaShape only class-balanced multiclass elastic net logistic regression model to all single-cell profiles in the JUMP dataset.

Member

MattsonCam Feb 20, 2024

Like the documentation here.

3.evaluate_model/scripts/nbconverted/process_jump_phenotype_profiles.py

+              # ## Process JUMP phenotypic profiles
+              #
+              # We applied the AreaShape only class-balanced multiclass elastic net logistic regression model to all single-cell profiles in the JUMP dataset.
+              #

Member

MattsonCam Feb 20, 2024

Side note: I know it's possible to reference part of another document in latex. Maybe this could also be accomplished in markdown to reference the README. Not sure how well it would apply to the nbconverted python file, just a thought.

Member Author

gwaybio Feb 21, 2024

Hmm, good point. To be more specific, I will reference the README section in the 3.analyze_data module: https://github.com/WayScience/JUMP-single-cell/tree/main/3.analyze_data#analyze-predicted-probabilities

3.evaluate_model/scripts/nbconverted/process_jump_phenotype_profiles.py Outdated

+              # 2) JUMP additional metadata needed to summarize/groupby results
+              jump_metadta_commit = "a18fd7719c05b638c731142b0d42a92c645e2b33"
+              jump_metadta_url = "https://github.com/jump-cellpainting/2023_Chandrasekaran_submitted/raw"

Member

MattsonCam Feb 20, 2024

This is nit picky, but you could also add the "a" in metadta

Member Author

gwaybio Feb 21, 2024

yes! thanks for catching this

3.evaluate_model/scripts/nbconverted/process_jump_phenotype_profiles.py Outdated

+              jump_metadta_url = "https://github.com/jump-cellpainting/2023_Chandrasekaran_submitted/raw"
+              jump_metadta_file = "benchmark/output/experiment-metadata.tsv"
+              jump_metadata_full_file = f"{jump_metadta_url}/{jump_metadta_commit}/{jump_metadta_file}"

Member

MattsonCam Feb 20, 2024

This is also nit picky, but could also combine the strings instead of combining the variables as a string

Member Author

gwaybio Feb 21, 2024

Thanks for the suggestion, but I prefer how readable the formatted string is.

3.evaluate_model/scripts/nbconverted/process_jump_phenotype_profiles.py Outdated

Comment on lines 101 to 108

+              # Merge dataframes and retain only informative columns
+              jump_pred_df = (
+                  jump_pred_df
+                  .merge(
+                      jump_metadata_df,
+                      left_on="Metadata_Plate",
+                      right_on="Assay_Plate_Barcode"
+                  )

Member

MattsonCam Feb 20, 2024

If preferred, you could also use the most recent jump probability comparisons to acquire the additional experimental metadata

Member Author

gwaybio Feb 21, 2024

Yes, I think it makes sense to use this version now - thanks for the pointer!

BTW, I took a look at your most recent PR to add this info. Nice work! I would also recommend adding a pointer indicating the provenance of the experiment-metadata.tsv https://github.com/WayScience/JUMP-single-cell/pull/21/files#diff-6f3ca646908f89153386be951563a61449e8f5df46549d224ff98b74d6aab859

You currently include this file in the repo, but someone might not know this files origin. I'm adding my details below (I'll remove then in the next commit) in case you decide to incorporate it in the JUMP-single-cell repo in some form (maybe in a README as a quick note)

# 2) JUMP additional metadata needed to summarize/groupby results
jump_metadata_commit = "a18fd7719c05b638c731142b0d42a92c645e2b33"

jump_metadata_url = "https://github.com/jump-cellpainting/2023_Chandrasekaran_submitted/raw"
jump_metadata_file = "benchmark/output/experiment-metadata.tsv"

# Load JUMP metadata for JUMP-CP Pilot
# For an explanation of these metadata columns see: 
# https://github.com/jump-cellpainting/2023_Chandrasekaran_submitted/blob/9edd26d60524a62f993d4df40a5d8908206714f5/README.md#batch-and-plate-metadata
jump_metadata_df = (
    pd.read_csv(jump_metadata_full_file, sep="\t")
    .query("Batch == '2020_11_04_CPJUMP1'")
)

print(jump_metadata_df.shape)
jump_metadata_df.head()

Member Author

gwaybio Feb 21, 2024 •

edited

Loading

Also, I noticed that there are 720 additional rows in the new JUMP-single-cell parquet file (compared to the way I add the metadata info in this PR). I couldn't track down why, but it probably has to do with my merge somehow dropping rows. I don't think there's anything to do with this info, just briefly noting here.

3.evaluate_model/scripts/nbconverted/process_jump_phenotype_profiles.py

Comment on lines +195 to +218

+              jump_wide_final_df = (
+                  jump_pred_df
+                  .query("Metadata_model_type == 'final'")
+                  .drop(columns=["p_value"])
+                  .pivot(index=metadata_columns, columns="phenotype", values="comparison_metric_value")
+                  .reset_index()
+              )
+              jump_wide_final_df.to_csv(final_jump_phenotype_file, sep="\t", index=False)
+              print(jump_wide_final_df.shape)
+              jump_wide_final_df.head()
+              # In[14]:
+              jump_wide_shuffled_df = (
+                  jump_pred_df
+                  .query("Metadata_model_type == 'shuffled'")
+                  .drop(columns=["p_value"])
+                  .pivot(index=metadata_columns, columns="phenotype", values="comparison_metric_value")
+                  .reset_index()
+              )

Member

MattsonCam Feb 20, 2024

Could also consider creating a function here

Member Author

gwaybio Feb 21, 2024

I generally avoid functions for pandas chaining if possible. Personally, when I'm reading code, I like to see the pandas chain directly in front of me rather than having to reference a function defined earlier (or even imported from a different file). If I were performing this operation more times (let's say over 3 times), then I would more strongly consider a function. However, since I'm only doing this twice, I will keep as is. Thanks!

3.evaluate_model/scripts/nbconverted/process_jump_phenotype_profiles.py Outdated

Comment on lines 241 to 267

+              # Initialize UMAP
+              umap_fit = umap.UMAP(random_state=umap_random_seed, n_components=umap_n_components)
+              # Fit UMAP and convert to pandas DataFrame
+              embeddings = pd.DataFrame(
+                  umap_fit.fit_transform(jump_wide_final_df.loc[:, feature_columns]),
+                  columns=[f"UMAP{x}" for x in range(0, umap_n_components)],
+              )
+              # Combine with metadata
+              umap_with_metadata_df = pd.concat([jump_wide_final_df.loc[:, metadata_columns], embeddings], axis=1).assign(model_type="final")
+              # In[17]:
+              # Initialize UMAP
+              umap_fit = umap.UMAP(random_state=umap_random_seed, n_components=umap_n_components)
+              # Fit UMAP and convert to pandas DataFrame
+              embeddings = pd.DataFrame(
+                  umap_fit.fit_transform(jump_wide_shuffled_df.loc[:, feature_columns]),
+                  columns=[f"UMAP{x}" for x in range(0, umap_n_components)],
+              )
+              # Combine with metadata
+              umap_shuffled_with_metadata_df = pd.concat([jump_wide_shuffled_df.loc[:, metadata_columns], embeddings], axis=1).assign(model_type="shuffled")

Member

MattsonCam Feb 20, 2024

Could create a function here as well

Member Author

gwaybio Feb 21, 2024

thanks! For this one I will create a function :) There is a lot of redundant code here which could easily be made into a function. Also, the inner workings of this function is less important to know about than with the pandas chaining example.

phenotypic_profiling_env.yml

Comment on lines +9 to +11

+                - conda-forge::umap-learn
+                - conda-forge::fastparquet
+                - conda-forge::pyarrow

Member

MattsonCam Feb 20, 2024

Not sure how this would affect the environment now as well as in the future, but could consider adding version limits

Member Author

gwaybio Feb 21, 2024

yes, this is a good point... this environment file is super fragile as is. Given the project's scope (proof of concept), type (analysis repository), and relatively uncertain future plans, I think the cost of solidifying the environment outweighs the benefits. If anything changes, we can revisit this decision in the future (which will be painful 😵 )

gwaybio added 2 commits

February 21, 2024 06:29


          response to PR comments

b50bb6c


          rerun notebook to update jump datasets

9645f53

also add comparison across cell types

Member Author

gwaybio commented Feb 21, 2024 •

edited

Loading

Thanks for the thorough review @MattsonCam ! I've addressed all of your comments. I've also added a new output file that summarizes the replicate information (using mean) and pivots the pandas dataframe to compare results across cell types and time points (a minor update). Merging now!

gwaybio merged commit b40e626 into WayScience:main

gwaybio deleted the process-jump-single-cell branch

February 21, 2024 13:32

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet