-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Response to Reviewers] Add Silhouette analysis #68
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Really nice, clean, and simple PR!
Lookis like this analysis does show that CellProfiler does have the most heterogeneity since it has 6 phenotypes with the top positive silhouette score. Very interesting results!
# In[2]: | ||
|
||
|
||
np.random.seed(1234) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recommend random seed 0
to be consistent with Way Lab standard in other projects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i will stick with 1234
# For consistent Silhouette input space dimensionality | ||
n_pca_components = 50 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does 50
come from? I know in UMAP we do 2
components, what is the difference when is comes to PCA?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also heads up, in the PR comment you say the number of components is 40
but in here it is 50
, recommend confirming which one is correct/most appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah! Great catch.
In my experience, 50 is more than enough to capture the majority of the variance in the dataset, which is what we're aiming for. It's more or less an arbitrary number
output_silhouette_results = pathlib.Path( | ||
eval_path, "silhouette_score_results.tsv" | ||
) | ||
output_silhouette_samples_results = pathlib.Path( | ||
eval_path, "silhouette_score_results_per_sample.tsv" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recommend making these compressed TSVs for saving space plus that might be the standard convention in this repo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they are super tiny, will stick with tsv so they can be rendered on github
Thanks for the review @jenna-tomkinson - I caught a couple things too, which I addressed in the recent commits. Merging now! |
This PR is in response to the following reviewer comment:
We think this is a good idea, and therefore performed the following analysis:
n_components=50
)We interpret the Silhouette scores how well cells of a given phenotype are clustered compared to other cells of the same phenotype. A positive score means cells of the same phenotype are more similar to other cells of the same phenotype (on average) compared to all other cells. A score of 1 indicates complete separation of similar phenotypes from other phenotypes.
New supplementary figure