Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding several publication ready figures [Figure 2, Supplementary Figs 2 and 3] #38

Merged
merged 7 commits into from
Sep 26, 2023

Conversation

gwaybio
Copy link
Member

@gwaybio gwaybio commented Sep 21, 2023

Sorry for the long PR - it generates all figures (main and supplementary) for the first results subsection. I also needed to calculate all pairwise correlations between cells in order to plot pairwise correlation density curves.

Main figure

main_figure_2_umap_and_correlation

Supplementary Figures

supplementary_umap

supplementary_pairwise_correlations

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Copy link
Member

@jenna-tomkinson jenna-tomkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I left a bunch of questions for you to address prior to merging. This is mainly for my understanding and for the understanding of the audience viewing the code.

Let me know if I need to clarify any questions!


# Output to file
output_file = f"{output_basename}_{feature_space}.tsv.gz"
cp_tidy_corr_df.to_csv(output_file, sep="\t", index=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any of these outputs. Do you recommend not putting these in a GitHub repo? I think I have some PRs where I include the intermediate CSV files like these (which make them look huge). I am wondering what the best practice would be here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great discussion starter. Thank you!

I tend to think of including data based on three variables:

  1. Size
  2. Importance
  3. Reproducibility

For Size, there are some strict limits and thresholds to move from git to git-lfs to figshare/other. Another variable is Importance. Super important data need to be somewhere no matter the size. The last variable is reproducibility; is my analysis going to fail if I don't have this data.

There are also tradeoffs between these variables. For example, unimportant data don't belong anywhere, unless it is critical to reproducibility and it's small-ish.

I view this data as medium-ish size (~150MB) of relatively low importance that is not super critical to reproducibility because we have a notebook that can generate this data.

I probably should make a note in the figure generation notebook to make sure that a user run this notebook prior to generating the figure!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh okay! I will need to be better about this practice then. When generating figures for Durbin lab I put the small CSV intermediate files in the PR. I agree with everything you stated here so I will make sure to be better about this and make sure that most important files are added to the repo.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What did you use to decide the phenotypes to display in these plots? Were these the phenotypes with the highest number of labels or was there a different metric you used?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As well for section B, what does that plot tell us if cell pairs are not the same phenotype? Was the pairwise correlation between single cells so we do expect that there is a large density of cells that would be both, for example, Apoptosis and then also cells that are not both the same?

If so, it is interesting how interphase is the only phenotype with all models that doesn't have that much separation between distributions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What did you use to decide the phenotypes to display in these plots? Were these the phenotypes with the highest number of labels or was there a different metric you used?

Great question! I picked ones that were interesting to me :) We can always go back and refine this later easily. Do you have any phenotypes in mind you think we should focus on?

what does that plot tell us if cell pairs are not the same phenotype?

It tells us that on average, for most phenotypes, cells of different phenotypes have lower correlation.

Was the pairwise correlation between single cells so we do expect that there is a large density of cells that would be both, for example, Apoptosis and then also cells that are not both the same?

I'm not sure I understand this question. The curves are showing pairwise comparisons (so two cells; every cell compared to every other cell). A cell comparison can be either Apoptosis Cell vs. Different Apoptosis Cell or Apoptosis Cell vs. Not Apoptosis Cell.

If so, it is interesting how interphase is the only phenotype with all models that doesn't have that much separation between distributions.

Yes, this is interesting! What do you think it means?

Embedding_Value = "d"
)
) %>%
dplyr::select(!...1) %>%#7570b3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit confused what is going on in lines 30-36. Are you able to explain what is going on? Also I see in line 30 you have a code comment that looks like a hex code, is this being used?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you able to explain what is going on?

I sure hope so! :)

I will add more specific comments to a new commit, but essentially I am wrangling the data in a certain way to prepare for plotting.

hex code, is this being used?

Oops! Great catch. Will remove

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in d015e8b

dplyr::select(!...1) %>%#7570b3
tidyr::pivot_wider(names_from = UMAP_Embedding, values_from = Embedding_Value) %>%
dplyr::mutate(Mitocheck_Plot_Label = if_else(
Mitocheck_Phenotypic_Class %in% focus_phenotypes,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is "focus_phenotypes" defined in this code? I do not see it and it could be because I am missing it somewhere.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh I see now, it comes from source("themes.r"). I would suggest to add a code comment to clarify what this file does/contains. It is probably standard practice for someone more familiar with R, but it might be good for the general audience (and myself haha).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's standard practice, but it's not great standard practice! i will add a note near the library() load function.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in d015e8b

library(ggplot2)
library(dplyr)

focus_phenotypes <- c(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kinda goes along with my question for the main figure, but does it make sense to include in this file a short code comment on why these phenotypes are focused on?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this would be ideal. However, given our current manuscript state, I recommend that we skip adding a comment here for now, since our focus may shift slightly. We will definitely describe our rationale in the manuscript.

"Other" = "grey"
)

focus_phenotype_labels <- c(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my perspective, this seems a bit unnecessary since you aren't changing the labels for any of phenotype names except for adding other. Is this the only way that you can add the other label? If so then this makes sense, just seems like extra lines of code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is annoying extra code, I agree, but it is required for customizing colors in plots the way that I am.

    + scale_color_manual(
        "Phenotype",
        values = focus_phenotype_colors,
        labels = focus_phenotype_labels
    )

@gwaybio
Copy link
Member Author

gwaybio commented Sep 26, 2023

Please feel free to continue discussions in each thread, but I will merge for now!

@gwaybio gwaybio merged commit 0918a7b into WayScience:main Sep 26, 2023
@gwaybio gwaybio deleted the add-fig2 branch September 26, 2023 21:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants