Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use 2015 data & remove holdout set #5

Merged
merged 23 commits into from
Dec 7, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 7 additions & 2 deletions 0.download_data/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Download Data

In this module, we present our method for downloading nucleus morphology data.
In this module, we present our method for downloading and combining nucleus morphology data.

### Download/Preprocess Data

Expand All @@ -10,8 +10,13 @@ Complete instructions for data download and preprocessing can be found at: https

In this repository, all training data is downloaded from a version controlled [mitocheck_data](https://github.com/WayScience/mitocheck_data).

An earlier (2006) and later (2015) dataset are both downloaded from `mitocheck_data` and combined by checking if any of the plate/well/frame/coordinates of the cells from the 2006 and 2015 datasets match.
If all of this information matches, this must be the same cell and is only added once to the final combined dataset.
This combination method avoids repeating cells in the combined dataset which could lead to biases in the final model.

The version of mitocheck_data used is specified by the hash corresponding to a current commit.
The current hash being used is `19bfa5b0959d6b7536f83e7bb85745ba3edf7ff9` which corresponds to [mitocheck_data/19bfa5b](https://github.com/WayScience/mitocheck_data/tree/19bfa5b0959d6b7536f83e7bb85745ba3edf7ff9).
The current hashes being used are `19bfa5b0959d6b7536f83e7bb85745ba3edf7ff9` for the 2006 dataset and `3ebd0ca7c288f608e9b23987a8ddbabd5476bd8f` for the 2015 dataset.
These correspond to [mitocheck_data/19bfa5b](https://github.com/WayScience/mitocheck_data/tree/19bfa5b0959d6b7536f83e7bb85745ba3edf7ff9) and [mitocheck_data/3ebd0ca](https://github.com/WayScience/mitocheck_data/tree/3ebd0ca7c288f608e9b23987a8ddbabd5476bd8f) respectively.
roshankern marked this conversation as resolved.
Show resolved Hide resolved
The `hash` variable can be set in [download_data.ipynb](download_data.ipynb) to change which version of mitocheck_data is being accessed.

## Step 1: Download Data
Expand Down
Binary file modified 0.download_data/data/training_data.csv.gz
Binary file not shown.
2,639 changes: 2,558 additions & 81 deletions 0.download_data/download_data.ipynb

Large diffs are not rendered by default.

44 changes: 38 additions & 6 deletions 0.download_data/scripts/nbconverted/download_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,29 +9,61 @@
import pandas as pd
import pathlib

import sys
sys.path.append("../utils")
from download_utils import combine_datasets


# ### Specify version of mitocheck_data to download from

# In[2]:


hash = "19bfa5b0959d6b7536f83e7bb85745ba3edf7ff9"
file_url = f"https://raw.github.com/WayScience/mitocheck_data/{hash}/3.normalize_data/normalized_data/training_data.csv.gz"
print(file_url)
hash_2006 = "19bfa5b0959d6b7536f83e7bb85745ba3edf7ff9"
file_url_2006 = f"https://raw.github.com/WayScience/mitocheck_data/{hash_2006}/3.normalize_data/normalized_data/training_data.csv.gz"

hash_2015 = "3ebd0ca7c288f608e9b23987a8ddbabd5476bd8f"
file_url_2015 = f"https://raw.github.com/WayScience/mitocheck_data/{hash_2015}/3.normalize_data/normalized_data/training_data.csv.gz"


# ### Load training data from github
# ### Load/combine training data from github

# In[3]:


training_data = pd.read_csv(file_url, compression="gzip", index_col=0)
training_data_2006 = pd.read_csv(file_url_2006, compression="gzip", index_col=0)
# remove unnecessary mitocheck object ID as this ID is not present for data repeated in 2015 dataset
training_data_2006 = training_data_2006.drop(columns=["Mitocheck_Object_ID"])

training_data_2015 = pd.read_csv(file_url_2015, compression="gzip", index_col=0)


# In[4]:


training_data_2006


# In[5]:


training_data = combine_datasets(training_data_2006, training_data_2015)


# ### Preview dataset

# In[6]:


# remove undefinedCondensed class with very low representation
training_data = training_data[training_data["Mitocheck_Phenotypic_Class"] != "UndefinedCondensed"]
training_data = training_data.reset_index(drop=True)
training_data


# ### Save training data

# In[4]:
# In[7]:


training_data_save_dir = pathlib.Path("data/")
Expand Down
13 changes: 6 additions & 7 deletions 1.split_data/README.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,16 @@
# 1. Split Data

In this module, we split the training data into training, testing, and holdout datasets.
In this module, we split the training data into training and testing datasets.

First, we split the data into training, test, and holdout subsets in [split_data.ipynb](split_data.ipynb).
The `get_representative_images()` function used to create the holdout dataset determines which images to holdout such that all phenotypic classes can be represented in these holdout images.
The test dataset is determined by taking a random number of samples (stratified by phenotypic class) from the dataset after the holdout images are removed.
The training dataset is the subset remaining after holdout/test samples are removed.
Sample indexes associated with training, test, and holdout subsets are stored in [data_split_indexes.tsv](indexes/data_split_indexes.tsv).
Data is split into subsets in [split_data.ipynb](split_data.ipynb).
The testing dataset is determined by randomly sampling 15% (stratified by phenotypic class) of the single-cell dataset.
The training dataset is the subset remaining after the testing samples are removed.
Sample indexes associated with training and testing subsets are stored in [data_split_indexes.tsv](indexes/data_split_indexes.tsv).
roshankern marked this conversation as resolved.
Show resolved Hide resolved
Sample indexes are later used to load subsets from [training_data.csv.gz](../0.download_data/data/training_data.csv.gz).

## Step 1: Split Data
roshankern marked this conversation as resolved.
Show resolved Hide resolved

Use the commands below to create indexes for training, testing, and holdout data subsets:
Use the commands below to create indexes for training and testing data subsets:

```sh
# Make sure you are located in 1.split_data
Expand Down
Loading