-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding several publication ready figures [Figure 2, Supplementary Figs 2 and 3] #38
Merged
Merged
Changes from 6 commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
ce230a0
add figure 2 umap and pairwise correlation
gwaybio 61403fe
add notebook for calculating pairwise correlations
gwaybio 8db9a32
add notebook to generate supplementary figures
gwaybio 66181ca
remove pycytominer import
gwaybio fa3ae1c
add ggplot themes.r
gwaybio adfe7ca
add new line
gwaybio d015e8b
respond to Jenna PR comments
gwaybio File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,305 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"id": "df964982-435d-4eed-83db-94493bf1faeb", | ||
"metadata": {}, | ||
"source": [ | ||
"## Explore data\n", | ||
"\n", | ||
"- Calculate pairwise correlations between single-cells" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 1, | ||
"id": "b101d99f-846a-45c8-83c5-fb8814370d5f", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"import pathlib\n", | ||
"import numpy as np\n", | ||
"import pandas as pd\n", | ||
"from sklearn.model_selection import train_test_split\n", | ||
"\n", | ||
"import sys\n", | ||
"sys.path.append(\"../utils\")\n", | ||
"from split_utils import get_features_data\n", | ||
"from train_utils import get_X_y_data" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 2, | ||
"id": "872251ee-75f0-4f4d-811d-16d7daed449f", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"def create_tidy_corr_matrix(data_array, labels):\n", | ||
" # Calculate the pairwise correlation matrix\n", | ||
" correlation_matrix = np.corrcoef(data_array, rowvar=True)\n", | ||
" \n", | ||
" # Convert the correlation matrix to a DataFrame for easier manipulation\n", | ||
" df_corr = pd.DataFrame(correlation_matrix)\n", | ||
" \n", | ||
" # Melt the correlation matrix\n", | ||
" melted_corr = df_corr.stack().reset_index()\n", | ||
" melted_corr.columns = [\"Row_ID\", \"Pairwise_Row_ID\", \"Correlation\"]\n", | ||
" \n", | ||
" # Filter out the lower triangle including diagonal\n", | ||
" melted_corr = melted_corr[melted_corr[\"Row_ID\"] < melted_corr[\"Pairwise_Row_ID\"]]\n", | ||
" \n", | ||
" # Add labels for the rows and columns\n", | ||
" melted_corr[\"Row_Label\"] = melted_corr[\"Row_ID\"].apply(lambda x: labels[x])\n", | ||
" melted_corr[\"Pairwise_Row_Label\"] = melted_corr[\"Pairwise_Row_ID\"].apply(lambda x: labels[x])\n", | ||
" \n", | ||
" # Reorder columns\n", | ||
" melted_corr = melted_corr[[\"Row_ID\", \"Pairwise_Row_ID\", \"Correlation\", \"Row_Label\", \"Pairwise_Row_Label\"]]\n", | ||
" \n", | ||
" return melted_corr" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 3, | ||
"id": "ebb5d605-12b0-4ec1-b4ed-9285be35c584", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# Set constants\n", | ||
"feature_spaces = [\"CP\", \"DP\", \"CP_and_DP\"]\n", | ||
"\n", | ||
"output_dir = \"data\"\n", | ||
"output_basename = pathlib.Path(output_dir, \"pairwise_correlations\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 4, | ||
"id": "44f96cef-8718-4a74-b262-41d94c4fca9b", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"(2862, 1450)\n" | ||
] | ||
}, | ||
{ | ||
"data": { | ||
"text/html": [ | ||
"<div>\n", | ||
"<style scoped>\n", | ||
" .dataframe tbody tr th:only-of-type {\n", | ||
" vertical-align: middle;\n", | ||
" }\n", | ||
"\n", | ||
" .dataframe tbody tr th {\n", | ||
" vertical-align: top;\n", | ||
" }\n", | ||
"\n", | ||
" .dataframe thead th {\n", | ||
" text-align: right;\n", | ||
" }\n", | ||
"</style>\n", | ||
"<table border=\"1\" class=\"dataframe\">\n", | ||
" <thead>\n", | ||
" <tr style=\"text-align: right;\">\n", | ||
" <th></th>\n", | ||
" <th>Mitocheck_Phenotypic_Class</th>\n", | ||
" <th>Cell_UUID</th>\n", | ||
" <th>Location_Center_X</th>\n", | ||
" <th>Location_Center_Y</th>\n", | ||
" <th>Metadata_Plate</th>\n", | ||
" <th>Metadata_Well</th>\n", | ||
" <th>Metadata_Frame</th>\n", | ||
" <th>Metadata_Site</th>\n", | ||
" <th>Metadata_Plate_Map_Name</th>\n", | ||
" <th>Metadata_DNA</th>\n", | ||
" <th>...</th>\n", | ||
" <th>DP__efficientnet_1270</th>\n", | ||
" <th>DP__efficientnet_1271</th>\n", | ||
" <th>DP__efficientnet_1272</th>\n", | ||
" <th>DP__efficientnet_1273</th>\n", | ||
" <th>DP__efficientnet_1274</th>\n", | ||
" <th>DP__efficientnet_1275</th>\n", | ||
" <th>DP__efficientnet_1276</th>\n", | ||
" <th>DP__efficientnet_1277</th>\n", | ||
" <th>DP__efficientnet_1278</th>\n", | ||
" <th>DP__efficientnet_1279</th>\n", | ||
" </tr>\n", | ||
" </thead>\n", | ||
" <tbody>\n", | ||
" <tr>\n", | ||
" <th>0</th>\n", | ||
" <td>Large</td>\n", | ||
" <td>21da27ab-873a-41f4-ab98-49170cae9a2d</td>\n", | ||
" <td>397</td>\n", | ||
" <td>618</td>\n", | ||
" <td>LT0010_27</td>\n", | ||
" <td>173</td>\n", | ||
" <td>83</td>\n", | ||
" <td>1</td>\n", | ||
" <td>LT0010_27_173</td>\n", | ||
" <td>LT0010_27/LT0010_27_173_83.tif</td>\n", | ||
" <td>...</td>\n", | ||
" <td>1.526493</td>\n", | ||
" <td>-0.388909</td>\n", | ||
" <td>-0.715202</td>\n", | ||
" <td>-0.939279</td>\n", | ||
" <td>-0.077689</td>\n", | ||
" <td>1.965509</td>\n", | ||
" <td>18.685819</td>\n", | ||
" <td>0.061676</td>\n", | ||
" <td>2.641369</td>\n", | ||
" <td>-0.086854</td>\n", | ||
" </tr>\n", | ||
" <tr>\n", | ||
" <th>1</th>\n", | ||
" <td>Large</td>\n", | ||
" <td>82f7949b-4ea2-45c8-8dd9-7854caf49077</td>\n", | ||
" <td>359</td>\n", | ||
" <td>584</td>\n", | ||
" <td>LT0010_27</td>\n", | ||
" <td>173</td>\n", | ||
" <td>83</td>\n", | ||
" <td>1</td>\n", | ||
" <td>LT0010_27_173</td>\n", | ||
" <td>LT0010_27/LT0010_27_173_83.tif</td>\n", | ||
" <td>...</td>\n", | ||
" <td>-0.482883</td>\n", | ||
" <td>-1.354858</td>\n", | ||
" <td>-0.856680</td>\n", | ||
" <td>-0.934949</td>\n", | ||
" <td>0.725091</td>\n", | ||
" <td>2.255450</td>\n", | ||
" <td>-0.565433</td>\n", | ||
" <td>1.628086</td>\n", | ||
" <td>-0.605625</td>\n", | ||
" <td>-0.748135</td>\n", | ||
" </tr>\n", | ||
" <tr>\n", | ||
" <th>2</th>\n", | ||
" <td>Large</td>\n", | ||
" <td>cec7234f-fe35-4411-aded-f8112bb31219</td>\n", | ||
" <td>383</td>\n", | ||
" <td>685</td>\n", | ||
" <td>LT0010_27</td>\n", | ||
" <td>173</td>\n", | ||
" <td>83</td>\n", | ||
" <td>1</td>\n", | ||
" <td>LT0010_27_173</td>\n", | ||
" <td>LT0010_27/LT0010_27_173_83.tif</td>\n", | ||
" <td>...</td>\n", | ||
" <td>0.888706</td>\n", | ||
" <td>1.350431</td>\n", | ||
" <td>-0.648841</td>\n", | ||
" <td>0.264205</td>\n", | ||
" <td>0.131341</td>\n", | ||
" <td>0.678315</td>\n", | ||
" <td>0.171044</td>\n", | ||
" <td>0.342206</td>\n", | ||
" <td>-0.581597</td>\n", | ||
" <td>0.505556</td>\n", | ||
" </tr>\n", | ||
" </tbody>\n", | ||
"</table>\n", | ||
"<p>3 rows × 1450 columns</p>\n", | ||
"</div>" | ||
], | ||
"text/plain": [ | ||
" Mitocheck_Phenotypic_Class Cell_UUID \\\n", | ||
"0 Large 21da27ab-873a-41f4-ab98-49170cae9a2d \n", | ||
"1 Large 82f7949b-4ea2-45c8-8dd9-7854caf49077 \n", | ||
"2 Large cec7234f-fe35-4411-aded-f8112bb31219 \n", | ||
"\n", | ||
" Location_Center_X Location_Center_Y Metadata_Plate Metadata_Well \\\n", | ||
"0 397 618 LT0010_27 173 \n", | ||
"1 359 584 LT0010_27 173 \n", | ||
"2 383 685 LT0010_27 173 \n", | ||
"\n", | ||
" Metadata_Frame Metadata_Site Metadata_Plate_Map_Name \\\n", | ||
"0 83 1 LT0010_27_173 \n", | ||
"1 83 1 LT0010_27_173 \n", | ||
"2 83 1 LT0010_27_173 \n", | ||
"\n", | ||
" Metadata_DNA ... DP__efficientnet_1270 \\\n", | ||
"0 LT0010_27/LT0010_27_173_83.tif ... 1.526493 \n", | ||
"1 LT0010_27/LT0010_27_173_83.tif ... -0.482883 \n", | ||
"2 LT0010_27/LT0010_27_173_83.tif ... 0.888706 \n", | ||
"\n", | ||
" DP__efficientnet_1271 DP__efficientnet_1272 DP__efficientnet_1273 \\\n", | ||
"0 -0.388909 -0.715202 -0.939279 \n", | ||
"1 -1.354858 -0.856680 -0.934949 \n", | ||
"2 1.350431 -0.648841 0.264205 \n", | ||
"\n", | ||
" DP__efficientnet_1274 DP__efficientnet_1275 DP__efficientnet_1276 \\\n", | ||
"0 -0.077689 1.965509 18.685819 \n", | ||
"1 0.725091 2.255450 -0.565433 \n", | ||
"2 0.131341 0.678315 0.171044 \n", | ||
"\n", | ||
" DP__efficientnet_1277 DP__efficientnet_1278 DP__efficientnet_1279 \n", | ||
"0 0.061676 2.641369 -0.086854 \n", | ||
"1 1.628086 -0.605625 -0.748135 \n", | ||
"2 0.342206 -0.581597 0.505556 \n", | ||
"\n", | ||
"[3 rows x 1450 columns]" | ||
] | ||
}, | ||
"execution_count": 4, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
], | ||
"source": [ | ||
"# load x (features) and y (labels) dataframes\n", | ||
"labeled_data_path = pathlib.Path(\"../0.download_data/data/labeled_data.csv.gz\")\n", | ||
"labeled_data = get_features_data(labeled_data_path)\n", | ||
"\n", | ||
"print(labeled_data.shape)\n", | ||
"labeled_data.head(3)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 5, | ||
"id": "7011a3c5-dd70-43df-b14a-c82b45e181f2", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"for feature_space in feature_spaces:\n", | ||
" # Get specific feature sets\n", | ||
" cp_feature_df, cp_label_df = get_X_y_data(labeled_data, dataset=feature_space)\n", | ||
"\n", | ||
" # Calculate pairwise correlations between nuclei\n", | ||
" cp_tidy_corr_df = create_tidy_corr_matrix(cp_feature_df, cp_label_df)\n", | ||
"\n", | ||
" # Output to file\n", | ||
" output_file = f\"{output_basename}_{feature_space}.tsv.gz\"\n", | ||
" cp_tidy_corr_df.to_csv(output_file, sep=\"\\t\", index=False)" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3 (ipykernel)", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.9.18" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 5 | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,83 @@ | ||
#!/usr/bin/env python | ||
# coding: utf-8 | ||
|
||
# ## Explore data | ||
# | ||
# - Calculate pairwise correlations between single-cells | ||
|
||
# In[1]: | ||
|
||
|
||
import pathlib | ||
import numpy as np | ||
import pandas as pd | ||
from sklearn.model_selection import train_test_split | ||
|
||
import sys | ||
sys.path.append("../utils") | ||
from split_utils import get_features_data | ||
from train_utils import get_X_y_data | ||
|
||
|
||
# In[2]: | ||
|
||
|
||
def create_tidy_corr_matrix(data_array, labels): | ||
# Calculate the pairwise correlation matrix | ||
correlation_matrix = np.corrcoef(data_array, rowvar=True) | ||
|
||
# Convert the correlation matrix to a DataFrame for easier manipulation | ||
df_corr = pd.DataFrame(correlation_matrix) | ||
|
||
# Melt the correlation matrix | ||
melted_corr = df_corr.stack().reset_index() | ||
melted_corr.columns = ["Row_ID", "Pairwise_Row_ID", "Correlation"] | ||
|
||
# Filter out the lower triangle including diagonal | ||
melted_corr = melted_corr[melted_corr["Row_ID"] < melted_corr["Pairwise_Row_ID"]] | ||
|
||
# Add labels for the rows and columns | ||
melted_corr["Row_Label"] = melted_corr["Row_ID"].apply(lambda x: labels[x]) | ||
melted_corr["Pairwise_Row_Label"] = melted_corr["Pairwise_Row_ID"].apply(lambda x: labels[x]) | ||
|
||
# Reorder columns | ||
melted_corr = melted_corr[["Row_ID", "Pairwise_Row_ID", "Correlation", "Row_Label", "Pairwise_Row_Label"]] | ||
|
||
return melted_corr | ||
|
||
|
||
# In[3]: | ||
|
||
|
||
# Set constants | ||
feature_spaces = ["CP", "DP", "CP_and_DP"] | ||
|
||
output_dir = "data" | ||
output_basename = pathlib.Path(output_dir, "pairwise_correlations") | ||
|
||
|
||
# In[4]: | ||
|
||
|
||
# load x (features) and y (labels) dataframes | ||
labeled_data_path = pathlib.Path("../0.download_data/data/labeled_data.csv.gz") | ||
labeled_data = get_features_data(labeled_data_path) | ||
|
||
print(labeled_data.shape) | ||
labeled_data.head(3) | ||
|
||
|
||
# In[5]: | ||
|
||
|
||
for feature_space in feature_spaces: | ||
# Get specific feature sets | ||
cp_feature_df, cp_label_df = get_X_y_data(labeled_data, dataset=feature_space) | ||
|
||
# Calculate pairwise correlations between nuclei | ||
cp_tidy_corr_df = create_tidy_corr_matrix(cp_feature_df, cp_label_df) | ||
|
||
# Output to file | ||
output_file = f"{output_basename}_{feature_space}.tsv.gz" | ||
cp_tidy_corr_df.to_csv(output_file, sep="\t", index=False) | ||
|
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any of these outputs. Do you recommend not putting these in a GitHub repo? I think I have some PRs where I include the intermediate CSV files like these (which make them look huge). I am wondering what the best practice would be here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great discussion starter. Thank you!
I tend to think of including data based on three variables:
For Size, there are some strict limits and thresholds to move from git to git-lfs to figshare/other. Another variable is Importance. Super important data need to be somewhere no matter the size. The last variable is reproducibility; is my analysis going to fail if I don't have this data.
There are also tradeoffs between these variables. For example, unimportant data don't belong anywhere, unless it is critical to reproducibility and it's small-ish.
I view this data as medium-ish size (~150MB) of relatively low importance that is not super critical to reproducibility because we have a notebook that can generate this data.
I probably should make a note in the figure generation notebook to make sure that a user run this notebook prior to generating the figure!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh okay! I will need to be better about this practice then. When generating figures for Durbin lab I put the small CSV intermediate files in the PR. I agree with everything you stated here so I will make sure to be better about this and make sure that most important files are added to the repo.