-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding several publication ready figures [Figure 2, Supplementary Figs 2 and 3] #38
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I left a bunch of questions for you to address prior to merging. This is mainly for my understanding and for the understanding of the audience viewing the code.
Let me know if I need to clarify any questions!
|
||
# Output to file | ||
output_file = f"{output_basename}_{feature_space}.tsv.gz" | ||
cp_tidy_corr_df.to_csv(output_file, sep="\t", index=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any of these outputs. Do you recommend not putting these in a GitHub repo? I think I have some PRs where I include the intermediate CSV files like these (which make them look huge). I am wondering what the best practice would be here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great discussion starter. Thank you!
I tend to think of including data based on three variables:
- Size
- Importance
- Reproducibility
For Size, there are some strict limits and thresholds to move from git to git-lfs to figshare/other. Another variable is Importance. Super important data need to be somewhere no matter the size. The last variable is reproducibility; is my analysis going to fail if I don't have this data.
There are also tradeoffs between these variables. For example, unimportant data don't belong anywhere, unless it is critical to reproducibility and it's small-ish.
I view this data as medium-ish size (~150MB) of relatively low importance that is not super critical to reproducibility because we have a notebook that can generate this data.
I probably should make a note in the figure generation notebook to make sure that a user run this notebook prior to generating the figure!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh okay! I will need to be better about this practice then. When generating figures for Durbin lab I put the small CSV intermediate files in the PR. I agree with everything you stated here so I will make sure to be better about this and make sure that most important files are added to the repo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What did you use to decide the phenotypes to display in these plots? Were these the phenotypes with the highest number of labels or was there a different metric you used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As well for section B, what does that plot tell us if cell pairs are not the same phenotype? Was the pairwise correlation between single cells so we do expect that there is a large density of cells that would be both, for example, Apoptosis and then also cells that are not both the same?
If so, it is interesting how interphase is the only phenotype with all models that doesn't have that much separation between distributions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What did you use to decide the phenotypes to display in these plots? Were these the phenotypes with the highest number of labels or was there a different metric you used?
Great question! I picked ones that were interesting to me :) We can always go back and refine this later easily. Do you have any phenotypes in mind you think we should focus on?
what does that plot tell us if cell pairs are not the same phenotype?
It tells us that on average, for most phenotypes, cells of different phenotypes have lower correlation.
Was the pairwise correlation between single cells so we do expect that there is a large density of cells that would be both, for example, Apoptosis and then also cells that are not both the same?
I'm not sure I understand this question. The curves are showing pairwise comparisons (so two cells; every cell compared to every other cell). A cell comparison can be either Apoptosis Cell vs. Different Apoptosis Cell or Apoptosis Cell vs. Not Apoptosis Cell.
If so, it is interesting how interphase is the only phenotype with all models that doesn't have that much separation between distributions.
Yes, this is interesting! What do you think it means?
Embedding_Value = "d" | ||
) | ||
) %>% | ||
dplyr::select(!...1) %>%#7570b3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a bit confused what is going on in lines 30-36. Are you able to explain what is going on? Also I see in line 30 you have a code comment that looks like a hex code, is this being used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you able to explain what is going on?
I sure hope so! :)
I will add more specific comments to a new commit, but essentially I am wrangling the data in a certain way to prepare for plotting.
hex code, is this being used?
Oops! Great catch. Will remove
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in d015e8b
dplyr::select(!...1) %>%#7570b3 | ||
tidyr::pivot_wider(names_from = UMAP_Embedding, values_from = Embedding_Value) %>% | ||
dplyr::mutate(Mitocheck_Plot_Label = if_else( | ||
Mitocheck_Phenotypic_Class %in% focus_phenotypes, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is "focus_phenotypes" defined in this code? I do not see it and it could be because I am missing it somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh I see now, it comes from source("themes.r")
. I would suggest to add a code comment to clarify what this file does/contains. It is probably standard practice for someone more familiar with R, but it might be good for the general audience (and myself haha).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's standard practice, but it's not great standard practice! i will add a note near the library() load function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in d015e8b
library(ggplot2) | ||
library(dplyr) | ||
|
||
focus_phenotypes <- c( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kinda goes along with my question for the main figure, but does it make sense to include in this file a short code comment on why these phenotypes are focused on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this would be ideal. However, given our current manuscript state, I recommend that we skip adding a comment here for now, since our focus may shift slightly. We will definitely describe our rationale in the manuscript.
"Other" = "grey" | ||
) | ||
|
||
focus_phenotype_labels <- c( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my perspective, this seems a bit unnecessary since you aren't changing the labels for any of phenotype names except for adding other
. Is this the only way that you can add the other
label? If so then this makes sense, just seems like extra lines of code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is annoying extra code, I agree, but it is required for customizing colors in plots the way that I am.
+ scale_color_manual(
"Phenotype",
values = focus_phenotype_colors,
labels = focus_phenotype_labels
)
Please feel free to continue discussions in each thread, but I will merge for now! |
Sorry for the long PR - it generates all figures (main and supplementary) for the first results subsection. I also needed to calculate all pairwise correlations between cells in order to plot pairwise correlation density curves.
Main figure
Supplementary Figures