-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Process jump single cell data #56
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
@MattsonCam - I think you're the best to review this - you performed the KS test analysis and are familiar with the JUMP dataset. Please review when you are able. Thanks! |
FYI - I integrated the extended JUMP metadata in the two most recent commits. We probably should mention this in the JUMP-single-cell repo somewhere, but given that the time points and cell lines were all independent plates (and we performed our KS-test analysis per plate) we do not need to rerun any analysis in the JUMP-single-cell repo. Thanks! |
I added a UMAP fit of phenotypic profile probabilities in the last set of commits. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job @gwaybio! I left some comments. Overall, it LGTM!
|
||
# ## Process JUMP phenotypic profiles | ||
# | ||
# We applied the AreaShape only class-balanced multiclass elastic net logistic regression model to all single-cell profiles in the JUMP dataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like the documentation here.
# ## Process JUMP phenotypic profiles | ||
# | ||
# We applied the AreaShape only class-balanced multiclass elastic net logistic regression model to all single-cell profiles in the JUMP dataset. | ||
# |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Side note: I know it's possible to reference part of another document in latex. Maybe this could also be accomplished in markdown to reference the README. Not sure how well it would apply to the nbconverted python file, just a thought.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, good point. To be more specific, I will reference the README section in the 3.analyze_data
module: https://github.com/WayScience/JUMP-single-cell/tree/main/3.analyze_data#analyze-predicted-probabilities
# 2) JUMP additional metadata needed to summarize/groupby results | ||
jump_metadta_commit = "a18fd7719c05b638c731142b0d42a92c645e2b33" | ||
|
||
jump_metadta_url = "https://github.com/jump-cellpainting/2023_Chandrasekaran_submitted/raw" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is nit picky, but you could also add the "a" in metadta
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes! thanks for catching this
jump_metadta_url = "https://github.com/jump-cellpainting/2023_Chandrasekaran_submitted/raw" | ||
jump_metadta_file = "benchmark/output/experiment-metadata.tsv" | ||
|
||
jump_metadata_full_file = f"{jump_metadta_url}/{jump_metadta_commit}/{jump_metadta_file}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also nit picky, but could also combine the strings instead of combining the variables as a string
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion, but I prefer how readable the formatted string is.
# Merge dataframes and retain only informative columns | ||
jump_pred_df = ( | ||
jump_pred_df | ||
.merge( | ||
jump_metadata_df, | ||
left_on="Metadata_Plate", | ||
right_on="Assay_Plate_Barcode" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If preferred, you could also use the most recent jump probability comparisons to acquire the additional experimental metadata
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think it makes sense to use this version now - thanks for the pointer!
BTW, I took a look at your most recent PR to add this info. Nice work! I would also recommend adding a pointer indicating the provenance of the experiment-metadata.tsv
https://github.com/WayScience/JUMP-single-cell/pull/21/files#diff-6f3ca646908f89153386be951563a61449e8f5df46549d224ff98b74d6aab859
You currently include this file in the repo, but someone might not know this files origin. I'm adding my details below (I'll remove then in the next commit) in case you decide to incorporate it in the JUMP-single-cell repo in some form (maybe in a README as a quick note)
# 2) JUMP additional metadata needed to summarize/groupby results
jump_metadata_commit = "a18fd7719c05b638c731142b0d42a92c645e2b33"
jump_metadata_url = "https://github.com/jump-cellpainting/2023_Chandrasekaran_submitted/raw"
jump_metadata_file = "benchmark/output/experiment-metadata.tsv"
# Load JUMP metadata for JUMP-CP Pilot
# For an explanation of these metadata columns see:
# https://github.com/jump-cellpainting/2023_Chandrasekaran_submitted/blob/9edd26d60524a62f993d4df40a5d8908206714f5/README.md#batch-and-plate-metadata
jump_metadata_df = (
pd.read_csv(jump_metadata_full_file, sep="\t")
.query("Batch == '2020_11_04_CPJUMP1'")
)
print(jump_metadata_df.shape)
jump_metadata_df.head()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I noticed that there are 720 additional rows in the new JUMP-single-cell
parquet file (compared to the way I add the metadata info in this PR). I couldn't track down why, but it probably has to do with my merge somehow dropping rows. I don't think there's anything to do with this info, just briefly noting here.
jump_wide_final_df = ( | ||
jump_pred_df | ||
.query("Metadata_model_type == 'final'") | ||
.drop(columns=["p_value"]) | ||
.pivot(index=metadata_columns, columns="phenotype", values="comparison_metric_value") | ||
.reset_index() | ||
) | ||
|
||
jump_wide_final_df.to_csv(final_jump_phenotype_file, sep="\t", index=False) | ||
|
||
print(jump_wide_final_df.shape) | ||
jump_wide_final_df.head() | ||
|
||
|
||
# In[14]: | ||
|
||
|
||
jump_wide_shuffled_df = ( | ||
jump_pred_df | ||
.query("Metadata_model_type == 'shuffled'") | ||
.drop(columns=["p_value"]) | ||
.pivot(index=metadata_columns, columns="phenotype", values="comparison_metric_value") | ||
.reset_index() | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could also consider creating a function here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I generally avoid functions for pandas chaining if possible. Personally, when I'm reading code, I like to see the pandas chain directly in front of me rather than having to reference a function defined earlier (or even imported from a different file). If I were performing this operation more times (let's say over 3 times), then I would more strongly consider a function. However, since I'm only doing this twice, I will keep as is. Thanks!
# Initialize UMAP | ||
umap_fit = umap.UMAP(random_state=umap_random_seed, n_components=umap_n_components) | ||
|
||
# Fit UMAP and convert to pandas DataFrame | ||
embeddings = pd.DataFrame( | ||
umap_fit.fit_transform(jump_wide_final_df.loc[:, feature_columns]), | ||
columns=[f"UMAP{x}" for x in range(0, umap_n_components)], | ||
) | ||
|
||
# Combine with metadata | ||
umap_with_metadata_df = pd.concat([jump_wide_final_df.loc[:, metadata_columns], embeddings], axis=1).assign(model_type="final") | ||
|
||
|
||
# In[17]: | ||
|
||
|
||
# Initialize UMAP | ||
umap_fit = umap.UMAP(random_state=umap_random_seed, n_components=umap_n_components) | ||
|
||
# Fit UMAP and convert to pandas DataFrame | ||
embeddings = pd.DataFrame( | ||
umap_fit.fit_transform(jump_wide_shuffled_df.loc[:, feature_columns]), | ||
columns=[f"UMAP{x}" for x in range(0, umap_n_components)], | ||
) | ||
|
||
# Combine with metadata | ||
umap_shuffled_with_metadata_df = pd.concat([jump_wide_shuffled_df.loc[:, metadata_columns], embeddings], axis=1).assign(model_type="shuffled") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could create a function here as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks! For this one I will create a function :) There is a lot of redundant code here which could easily be made into a function. Also, the inner workings of this function is less important to know about than with the pandas chaining example.
- conda-forge::umap-learn | ||
- conda-forge::fastparquet | ||
- conda-forge::pyarrow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how this would affect the environment now as well as in the future, but could consider adding version limits
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, this is a good point... this environment file is super fragile as is. Given the project's scope (proof of concept), type (analysis repository), and relatively uncertain future plans, I think the cost of solidifying the environment outweighs the benefits. If anything changes, we can revisit this decision in the future (which will be painful 😵 )
also add comparison across cell types
Thanks for the thorough review @MattsonCam ! I've addressed all of your comments. I've also added a new output file that summarizes the replicate information (using mean) and pivots the pandas dataframe to compare results across cell types and time points (a minor update). Merging now! |
The notebook and associated files load in the JUMP single-cell results (KS tests) from https://github.com/WayScience/JUMP-single-cell/tree/main/3.analyze_data and performs three operations:
These results are important for the manuscript and for adding to a visualization I started working on in #55