Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process jump single cell data #56

Merged
merged 10 commits into from
Feb 21, 2024

Conversation

gwaybio
Copy link
Member

@gwaybio gwaybio commented Feb 14, 2024

The notebook and associated files load in the JUMP single-cell results (KS tests) from https://github.com/WayScience/JUMP-single-cell/tree/main/3.analyze_data and performs three operations:

  1. Briefly explores the data
  2. Outputs the top 10 results per phenotype, per treatment type, per model type for a focused exploration and results reporting
  3. Outputs a wide format phenotype profile per model type

These results are important for the manuscript and for adding to a visualization I started working on in #55

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@gwaybio gwaybio requested a review from MattsonCam February 14, 2024 13:41
@gwaybio
Copy link
Member Author

gwaybio commented Feb 14, 2024

@MattsonCam - I think you're the best to review this - you performed the KS test analysis and are familiar with the JUMP dataset. Please review when you are able. Thanks!

@gwaybio
Copy link
Member Author

gwaybio commented Feb 15, 2024

FYI - I integrated the extended JUMP metadata in the two most recent commits. We probably should mention this in the JUMP-single-cell repo somewhere, but given that the time points and cell lines were all independent plates (and we performed our KS-test analysis per plate) we do not need to rerun any analysis in the JUMP-single-cell repo. Thanks!

@gwaybio
Copy link
Member Author

gwaybio commented Feb 19, 2024

I added a UMAP fit of phenotypic profile probabilities in the last set of commits. Thanks!

Copy link
Member

@MattsonCam MattsonCam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job @gwaybio! I left some comments. Overall, it LGTM!


# ## Process JUMP phenotypic profiles
#
# We applied the AreaShape only class-balanced multiclass elastic net logistic regression model to all single-cell profiles in the JUMP dataset.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like the documentation here.

# ## Process JUMP phenotypic profiles
#
# We applied the AreaShape only class-balanced multiclass elastic net logistic regression model to all single-cell profiles in the JUMP dataset.
#
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side note: I know it's possible to reference part of another document in latex. Maybe this could also be accomplished in markdown to reference the README. Not sure how well it would apply to the nbconverted python file, just a thought.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, good point. To be more specific, I will reference the README section in the 3.analyze_data module: https://github.com/WayScience/JUMP-single-cell/tree/main/3.analyze_data#analyze-predicted-probabilities

# 2) JUMP additional metadata needed to summarize/groupby results
jump_metadta_commit = "a18fd7719c05b638c731142b0d42a92c645e2b33"

jump_metadta_url = "https://github.com/jump-cellpainting/2023_Chandrasekaran_submitted/raw"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nit picky, but you could also add the "a" in metadta

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes! thanks for catching this

jump_metadta_url = "https://github.com/jump-cellpainting/2023_Chandrasekaran_submitted/raw"
jump_metadta_file = "benchmark/output/experiment-metadata.tsv"

jump_metadata_full_file = f"{jump_metadta_url}/{jump_metadta_commit}/{jump_metadta_file}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also nit picky, but could also combine the strings instead of combining the variables as a string

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion, but I prefer how readable the formatted string is.

Comment on lines 101 to 108
# Merge dataframes and retain only informative columns
jump_pred_df = (
jump_pred_df
.merge(
jump_metadata_df,
left_on="Metadata_Plate",
right_on="Assay_Plate_Barcode"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If preferred, you could also use the most recent jump probability comparisons to acquire the additional experimental metadata

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think it makes sense to use this version now - thanks for the pointer!

BTW, I took a look at your most recent PR to add this info. Nice work! I would also recommend adding a pointer indicating the provenance of the experiment-metadata.tsv https://github.com/WayScience/JUMP-single-cell/pull/21/files#diff-6f3ca646908f89153386be951563a61449e8f5df46549d224ff98b74d6aab859

You currently include this file in the repo, but someone might not know this files origin. I'm adding my details below (I'll remove then in the next commit) in case you decide to incorporate it in the JUMP-single-cell repo in some form (maybe in a README as a quick note)

# 2) JUMP additional metadata needed to summarize/groupby results
jump_metadata_commit = "a18fd7719c05b638c731142b0d42a92c645e2b33"

jump_metadata_url = "https://github.com/jump-cellpainting/2023_Chandrasekaran_submitted/raw"
jump_metadata_file = "benchmark/output/experiment-metadata.tsv"

# Load JUMP metadata for JUMP-CP Pilot
# For an explanation of these metadata columns see: 
# https://github.com/jump-cellpainting/2023_Chandrasekaran_submitted/blob/9edd26d60524a62f993d4df40a5d8908206714f5/README.md#batch-and-plate-metadata
jump_metadata_df = (
    pd.read_csv(jump_metadata_full_file, sep="\t")
    .query("Batch == '2020_11_04_CPJUMP1'")
)

print(jump_metadata_df.shape)
jump_metadata_df.head()

Copy link
Member Author

@gwaybio gwaybio Feb 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I noticed that there are 720 additional rows in the new JUMP-single-cell parquet file (compared to the way I add the metadata info in this PR). I couldn't track down why, but it probably has to do with my merge somehow dropping rows. I don't think there's anything to do with this info, just briefly noting here.

Comment on lines +195 to +218
jump_wide_final_df = (
jump_pred_df
.query("Metadata_model_type == 'final'")
.drop(columns=["p_value"])
.pivot(index=metadata_columns, columns="phenotype", values="comparison_metric_value")
.reset_index()
)

jump_wide_final_df.to_csv(final_jump_phenotype_file, sep="\t", index=False)

print(jump_wide_final_df.shape)
jump_wide_final_df.head()


# In[14]:


jump_wide_shuffled_df = (
jump_pred_df
.query("Metadata_model_type == 'shuffled'")
.drop(columns=["p_value"])
.pivot(index=metadata_columns, columns="phenotype", values="comparison_metric_value")
.reset_index()
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could also consider creating a function here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generally avoid functions for pandas chaining if possible. Personally, when I'm reading code, I like to see the pandas chain directly in front of me rather than having to reference a function defined earlier (or even imported from a different file). If I were performing this operation more times (let's say over 3 times), then I would more strongly consider a function. However, since I'm only doing this twice, I will keep as is. Thanks!

Comment on lines 241 to 267
# Initialize UMAP
umap_fit = umap.UMAP(random_state=umap_random_seed, n_components=umap_n_components)

# Fit UMAP and convert to pandas DataFrame
embeddings = pd.DataFrame(
umap_fit.fit_transform(jump_wide_final_df.loc[:, feature_columns]),
columns=[f"UMAP{x}" for x in range(0, umap_n_components)],
)

# Combine with metadata
umap_with_metadata_df = pd.concat([jump_wide_final_df.loc[:, metadata_columns], embeddings], axis=1).assign(model_type="final")


# In[17]:


# Initialize UMAP
umap_fit = umap.UMAP(random_state=umap_random_seed, n_components=umap_n_components)

# Fit UMAP and convert to pandas DataFrame
embeddings = pd.DataFrame(
umap_fit.fit_transform(jump_wide_shuffled_df.loc[:, feature_columns]),
columns=[f"UMAP{x}" for x in range(0, umap_n_components)],
)

# Combine with metadata
umap_shuffled_with_metadata_df = pd.concat([jump_wide_shuffled_df.loc[:, metadata_columns], embeddings], axis=1).assign(model_type="shuffled")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could create a function here as well

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! For this one I will create a function :) There is a lot of redundant code here which could easily be made into a function. Also, the inner workings of this function is less important to know about than with the pandas chaining example.

Comment on lines +9 to +11
- conda-forge::umap-learn
- conda-forge::fastparquet
- conda-forge::pyarrow
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how this would affect the environment now as well as in the future, but could consider adding version limits

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this is a good point... this environment file is super fragile as is. Given the project's scope (proof of concept), type (analysis repository), and relatively uncertain future plans, I think the cost of solidifying the environment outweighs the benefits. If anything changes, we can revisit this decision in the future (which will be painful 😵 )

@gwaybio
Copy link
Member Author

gwaybio commented Feb 21, 2024

Thanks for the thorough review @MattsonCam ! I've addressed all of your comments. I've also added a new output file that summarizes the replicate information (using mean) and pivots the pandas dataframe to compare results across cell types and time points (a minor update). Merging now!

@gwaybio gwaybio merged commit b40e626 into WayScience:main Feb 21, 2024
@gwaybio gwaybio deleted the process-jump-single-cell branch February 21, 2024 13:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants