Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding several publication ready figures [Figure 2, Supplementary Figs 2 and 3] #38

Merged
merged 7 commits into from
Sep 26, 2023
Merged
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
305 changes: 305 additions & 0 deletions 1.split_data/explore_data.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,305 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "df964982-435d-4eed-83db-94493bf1faeb",
"metadata": {},
"source": [
"## Explore data\n",
"\n",
"- Calculate pairwise correlations between single-cells"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "b101d99f-846a-45c8-83c5-fb8814370d5f",
"metadata": {},
"outputs": [],
"source": [
"import pathlib\n",
"import numpy as np\n",
"import pandas as pd\n",
"from sklearn.model_selection import train_test_split\n",
"\n",
"import sys\n",
"sys.path.append(\"../utils\")\n",
"from split_utils import get_features_data\n",
"from train_utils import get_X_y_data"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "872251ee-75f0-4f4d-811d-16d7daed449f",
"metadata": {},
"outputs": [],
"source": [
"def create_tidy_corr_matrix(data_array, labels):\n",
" # Calculate the pairwise correlation matrix\n",
" correlation_matrix = np.corrcoef(data_array, rowvar=True)\n",
" \n",
" # Convert the correlation matrix to a DataFrame for easier manipulation\n",
" df_corr = pd.DataFrame(correlation_matrix)\n",
" \n",
" # Melt the correlation matrix\n",
" melted_corr = df_corr.stack().reset_index()\n",
" melted_corr.columns = [\"Row_ID\", \"Pairwise_Row_ID\", \"Correlation\"]\n",
" \n",
" # Filter out the lower triangle including diagonal\n",
" melted_corr = melted_corr[melted_corr[\"Row_ID\"] < melted_corr[\"Pairwise_Row_ID\"]]\n",
" \n",
" # Add labels for the rows and columns\n",
" melted_corr[\"Row_Label\"] = melted_corr[\"Row_ID\"].apply(lambda x: labels[x])\n",
" melted_corr[\"Pairwise_Row_Label\"] = melted_corr[\"Pairwise_Row_ID\"].apply(lambda x: labels[x])\n",
" \n",
" # Reorder columns\n",
" melted_corr = melted_corr[[\"Row_ID\", \"Pairwise_Row_ID\", \"Correlation\", \"Row_Label\", \"Pairwise_Row_Label\"]]\n",
" \n",
" return melted_corr"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "ebb5d605-12b0-4ec1-b4ed-9285be35c584",
"metadata": {},
"outputs": [],
"source": [
"# Set constants\n",
"feature_spaces = [\"CP\", \"DP\", \"CP_and_DP\"]\n",
"\n",
"output_dir = \"data\"\n",
"output_basename = pathlib.Path(output_dir, \"pairwise_correlations\")"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "44f96cef-8718-4a74-b262-41d94c4fca9b",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(2862, 1450)\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Mitocheck_Phenotypic_Class</th>\n",
" <th>Cell_UUID</th>\n",
" <th>Location_Center_X</th>\n",
" <th>Location_Center_Y</th>\n",
" <th>Metadata_Plate</th>\n",
" <th>Metadata_Well</th>\n",
" <th>Metadata_Frame</th>\n",
" <th>Metadata_Site</th>\n",
" <th>Metadata_Plate_Map_Name</th>\n",
" <th>Metadata_DNA</th>\n",
" <th>...</th>\n",
" <th>DP__efficientnet_1270</th>\n",
" <th>DP__efficientnet_1271</th>\n",
" <th>DP__efficientnet_1272</th>\n",
" <th>DP__efficientnet_1273</th>\n",
" <th>DP__efficientnet_1274</th>\n",
" <th>DP__efficientnet_1275</th>\n",
" <th>DP__efficientnet_1276</th>\n",
" <th>DP__efficientnet_1277</th>\n",
" <th>DP__efficientnet_1278</th>\n",
" <th>DP__efficientnet_1279</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Large</td>\n",
" <td>21da27ab-873a-41f4-ab98-49170cae9a2d</td>\n",
" <td>397</td>\n",
" <td>618</td>\n",
" <td>LT0010_27</td>\n",
" <td>173</td>\n",
" <td>83</td>\n",
" <td>1</td>\n",
" <td>LT0010_27_173</td>\n",
" <td>LT0010_27/LT0010_27_173_83.tif</td>\n",
" <td>...</td>\n",
" <td>1.526493</td>\n",
" <td>-0.388909</td>\n",
" <td>-0.715202</td>\n",
" <td>-0.939279</td>\n",
" <td>-0.077689</td>\n",
" <td>1.965509</td>\n",
" <td>18.685819</td>\n",
" <td>0.061676</td>\n",
" <td>2.641369</td>\n",
" <td>-0.086854</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Large</td>\n",
" <td>82f7949b-4ea2-45c8-8dd9-7854caf49077</td>\n",
" <td>359</td>\n",
" <td>584</td>\n",
" <td>LT0010_27</td>\n",
" <td>173</td>\n",
" <td>83</td>\n",
" <td>1</td>\n",
" <td>LT0010_27_173</td>\n",
" <td>LT0010_27/LT0010_27_173_83.tif</td>\n",
" <td>...</td>\n",
" <td>-0.482883</td>\n",
" <td>-1.354858</td>\n",
" <td>-0.856680</td>\n",
" <td>-0.934949</td>\n",
" <td>0.725091</td>\n",
" <td>2.255450</td>\n",
" <td>-0.565433</td>\n",
" <td>1.628086</td>\n",
" <td>-0.605625</td>\n",
" <td>-0.748135</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Large</td>\n",
" <td>cec7234f-fe35-4411-aded-f8112bb31219</td>\n",
" <td>383</td>\n",
" <td>685</td>\n",
" <td>LT0010_27</td>\n",
" <td>173</td>\n",
" <td>83</td>\n",
" <td>1</td>\n",
" <td>LT0010_27_173</td>\n",
" <td>LT0010_27/LT0010_27_173_83.tif</td>\n",
" <td>...</td>\n",
" <td>0.888706</td>\n",
" <td>1.350431</td>\n",
" <td>-0.648841</td>\n",
" <td>0.264205</td>\n",
" <td>0.131341</td>\n",
" <td>0.678315</td>\n",
" <td>0.171044</td>\n",
" <td>0.342206</td>\n",
" <td>-0.581597</td>\n",
" <td>0.505556</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>3 rows × 1450 columns</p>\n",
"</div>"
],
"text/plain": [
" Mitocheck_Phenotypic_Class Cell_UUID \\\n",
"0 Large 21da27ab-873a-41f4-ab98-49170cae9a2d \n",
"1 Large 82f7949b-4ea2-45c8-8dd9-7854caf49077 \n",
"2 Large cec7234f-fe35-4411-aded-f8112bb31219 \n",
"\n",
" Location_Center_X Location_Center_Y Metadata_Plate Metadata_Well \\\n",
"0 397 618 LT0010_27 173 \n",
"1 359 584 LT0010_27 173 \n",
"2 383 685 LT0010_27 173 \n",
"\n",
" Metadata_Frame Metadata_Site Metadata_Plate_Map_Name \\\n",
"0 83 1 LT0010_27_173 \n",
"1 83 1 LT0010_27_173 \n",
"2 83 1 LT0010_27_173 \n",
"\n",
" Metadata_DNA ... DP__efficientnet_1270 \\\n",
"0 LT0010_27/LT0010_27_173_83.tif ... 1.526493 \n",
"1 LT0010_27/LT0010_27_173_83.tif ... -0.482883 \n",
"2 LT0010_27/LT0010_27_173_83.tif ... 0.888706 \n",
"\n",
" DP__efficientnet_1271 DP__efficientnet_1272 DP__efficientnet_1273 \\\n",
"0 -0.388909 -0.715202 -0.939279 \n",
"1 -1.354858 -0.856680 -0.934949 \n",
"2 1.350431 -0.648841 0.264205 \n",
"\n",
" DP__efficientnet_1274 DP__efficientnet_1275 DP__efficientnet_1276 \\\n",
"0 -0.077689 1.965509 18.685819 \n",
"1 0.725091 2.255450 -0.565433 \n",
"2 0.131341 0.678315 0.171044 \n",
"\n",
" DP__efficientnet_1277 DP__efficientnet_1278 DP__efficientnet_1279 \n",
"0 0.061676 2.641369 -0.086854 \n",
"1 1.628086 -0.605625 -0.748135 \n",
"2 0.342206 -0.581597 0.505556 \n",
"\n",
"[3 rows x 1450 columns]"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# load x (features) and y (labels) dataframes\n",
"labeled_data_path = pathlib.Path(\"../0.download_data/data/labeled_data.csv.gz\")\n",
"labeled_data = get_features_data(labeled_data_path)\n",
"\n",
"print(labeled_data.shape)\n",
"labeled_data.head(3)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "7011a3c5-dd70-43df-b14a-c82b45e181f2",
"metadata": {},
"outputs": [],
"source": [
"for feature_space in feature_spaces:\n",
" # Get specific feature sets\n",
" cp_feature_df, cp_label_df = get_X_y_data(labeled_data, dataset=feature_space)\n",
"\n",
" # Calculate pairwise correlations between nuclei\n",
" cp_tidy_corr_df = create_tidy_corr_matrix(cp_feature_df, cp_label_df)\n",
"\n",
" # Output to file\n",
" output_file = f\"{output_basename}_{feature_space}.tsv.gz\"\n",
" cp_tidy_corr_df.to_csv(output_file, sep=\"\\t\", index=False)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.18"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
83 changes: 83 additions & 0 deletions 1.split_data/scripts/nbconverted/explore_data.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
#!/usr/bin/env python
# coding: utf-8

# ## Explore data
#
# - Calculate pairwise correlations between single-cells

# In[1]:


import pathlib
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

import sys
sys.path.append("../utils")
from split_utils import get_features_data
from train_utils import get_X_y_data


# In[2]:


def create_tidy_corr_matrix(data_array, labels):
# Calculate the pairwise correlation matrix
correlation_matrix = np.corrcoef(data_array, rowvar=True)

# Convert the correlation matrix to a DataFrame for easier manipulation
df_corr = pd.DataFrame(correlation_matrix)

# Melt the correlation matrix
melted_corr = df_corr.stack().reset_index()
melted_corr.columns = ["Row_ID", "Pairwise_Row_ID", "Correlation"]

# Filter out the lower triangle including diagonal
melted_corr = melted_corr[melted_corr["Row_ID"] < melted_corr["Pairwise_Row_ID"]]

# Add labels for the rows and columns
melted_corr["Row_Label"] = melted_corr["Row_ID"].apply(lambda x: labels[x])
melted_corr["Pairwise_Row_Label"] = melted_corr["Pairwise_Row_ID"].apply(lambda x: labels[x])

# Reorder columns
melted_corr = melted_corr[["Row_ID", "Pairwise_Row_ID", "Correlation", "Row_Label", "Pairwise_Row_Label"]]

return melted_corr


# In[3]:


# Set constants
feature_spaces = ["CP", "DP", "CP_and_DP"]

output_dir = "data"
output_basename = pathlib.Path(output_dir, "pairwise_correlations")


# In[4]:


# load x (features) and y (labels) dataframes
labeled_data_path = pathlib.Path("../0.download_data/data/labeled_data.csv.gz")
labeled_data = get_features_data(labeled_data_path)

print(labeled_data.shape)
labeled_data.head(3)


# In[5]:


for feature_space in feature_spaces:
# Get specific feature sets
cp_feature_df, cp_label_df = get_X_y_data(labeled_data, dataset=feature_space)

# Calculate pairwise correlations between nuclei
cp_tidy_corr_df = create_tidy_corr_matrix(cp_feature_df, cp_label_df)

# Output to file
output_file = f"{output_basename}_{feature_space}.tsv.gz"
cp_tidy_corr_df.to_csv(output_file, sep="\t", index=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any of these outputs. Do you recommend not putting these in a GitHub repo? I think I have some PRs where I include the intermediate CSV files like these (which make them look huge). I am wondering what the best practice would be here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great discussion starter. Thank you!

I tend to think of including data based on three variables:

  1. Size
  2. Importance
  3. Reproducibility

For Size, there are some strict limits and thresholds to move from git to git-lfs to figshare/other. Another variable is Importance. Super important data need to be somewhere no matter the size. The last variable is reproducibility; is my analysis going to fail if I don't have this data.

There are also tradeoffs between these variables. For example, unimportant data don't belong anywhere, unless it is critical to reproducibility and it's small-ish.

I view this data as medium-ish size (~150MB) of relatively low importance that is not super critical to reproducibility because we have a notebook that can generate this data.

I probably should make a note in the figure generation notebook to make sure that a user run this notebook prior to generating the figure!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh okay! I will need to be better about this practice then. When generating figures for Durbin lab I put the small CSV intermediate files in the PR. I agree with everything you stated here so I will make sure to be better about this and make sure that most important files are added to the repo.


Loading