WayScience · roshankern · Dec 7, 2022 · Nov 2, 2022 · Nov 2, 2022 · Nov 2, 2022
diff --git a/0.download_data/README.md b/0.download_data/README.md
@@ -1,6 +1,6 @@
 # Download Data
 
-In this module, we present our method for downloading nucleus morphology data.
+In this module, we present our method for downloading and combining nucleus morphology data.
 
 ### Download/Preprocess Data
 
@@ -10,8 +10,13 @@ Complete instructions for data download and preprocessing can be found at: https
 
 In this repository, all training data is downloaded from a version controlled [mitocheck_data](https://github.com/WayScience/mitocheck_data).
 
+An earlier (2006) and later (2015) dataset are both downloaded from `mitocheck_data` and combined by checking if any of the plate/well/frame/coordinates of the cells from the 2006 and 2015 datasets match.
+If all of this information matches, this must be the same cell and is only added once to the final combined dataset.
+This combination method avoids repeating cells in the combined dataset which could lead to biases in the final model.
+
 The version of mitocheck_data used is specified by the hash corresponding to a current commit.
-The current hash being used is `19bfa5b0959d6b7536f83e7bb85745ba3edf7ff9` which corresponds to [mitocheck_data/19bfa5b](https://github.com/WayScience/mitocheck_data/tree/19bfa5b0959d6b7536f83e7bb85745ba3edf7ff9).
+The current hashes being used are `19bfa5b0959d6b7536f83e7bb85745ba3edf7ff9` for the 2006 dataset and `3ebd0ca7c288f608e9b23987a8ddbabd5476bd8f` for the 2015 dataset.
+These correspond to [mitocheck_data/19bfa5b](https://github.com/WayScience/mitocheck_data/tree/19bfa5b0959d6b7536f83e7bb85745ba3edf7ff9) and [mitocheck_data/3ebd0ca](https://github.com/WayScience/mitocheck_data/tree/3ebd0ca7c288f608e9b23987a8ddbabd5476bd8f) respectively.
 The `hash` variable can be set in [download_data.ipynb](download_data.ipynb) to change which version of mitocheck_data is being accessed.
 
 ## Step 1: Download Data

diff --git a/0.download_data/data/training_data.csv.gz b/0.download_data/data/training_data.csv.gz
diff --git a/0.download_data/download_data.ipynb b/0.download_data/download_data.ipynb
diff --git a/0.download_data/scripts/nbconverted/download_data.py b/0.download_data/scripts/nbconverted/download_data.py
@@ -9,29 +9,61 @@
 import pandas as pd
 import pathlib
 
+import sys
+sys.path.append("../utils")
+from download_utils import combine_datasets
+
 
 # ### Specify version of mitocheck_data to download from
 
 # In[2]:
 
 
-hash = "19bfa5b0959d6b7536f83e7bb85745ba3edf7ff9"
-file_url = f"https://raw.github.com/WayScience/mitocheck_data/{hash}/3.normalize_data/normalized_data/training_data.csv.gz"
-print(file_url)
+hash_2006 = "19bfa5b0959d6b7536f83e7bb85745ba3edf7ff9"
+file_url_2006 = f"https://raw.github.com/WayScience/mitocheck_data/{hash_2006}/3.normalize_data/normalized_data/training_data.csv.gz"
+
+hash_2015 = "3ebd0ca7c288f608e9b23987a8ddbabd5476bd8f"
+file_url_2015 = f"https://raw.github.com/WayScience/mitocheck_data/{hash_2015}/3.normalize_data/normalized_data/training_data.csv.gz"
 
 
-# ### Load training data from github
+# ### Load/combine training data from github
 
 # In[3]:
 
 
-training_data = pd.read_csv(file_url, compression="gzip", index_col=0)
+training_data_2006 = pd.read_csv(file_url_2006, compression="gzip", index_col=0)
+# remove unnecessary mitocheck object ID as this ID is not present for data repeated in 2015 dataset
+training_data_2006 = training_data_2006.drop(columns=["Mitocheck_Object_ID"])
+
+training_data_2015 = pd.read_csv(file_url_2015, compression="gzip", index_col=0)
+
+
+# In[4]:
+
+
+training_data_2006
+
+
+# In[5]:
+
+
+training_data = combine_datasets(training_data_2006, training_data_2015)
+
+
+# ### Preview dataset
+
+# In[6]:
+
+
+# remove undefinedCondensed class with very low representation
+training_data = training_data[training_data["Mitocheck_Phenotypic_Class"] != "UndefinedCondensed"]
+training_data = training_data.reset_index(drop=True)
 training_data
 
 
 # ### Save training data
 
-# In[4]:
+# In[7]:
 
 
 training_data_save_dir = pathlib.Path("data/")

diff --git a/1.split_data/README.md b/1.split_data/README.md
@@ -1,17 +1,16 @@
 # 1. Split Data
 
-In this module, we split the training data into training, testing, and holdout datasets.
+In this module, we split the training data into training and testing datasets.
 
-First, we split the data into training, test, and holdout subsets in [split_data.ipynb](split_data.ipynb).
-The `get_representative_images()` function used to create the holdout dataset determines which images to holdout such that all phenotypic classes can be represented in these holdout images.
-The test dataset is determined by taking a random number of samples (stratified by phenotypic class) from the dataset after the holdout images are removed.
-The training dataset is the subset remaining after holdout/test samples are removed.
-Sample indexes associated with training, test, and holdout subsets are stored in [data_split_indexes.tsv](indexes/data_split_indexes.tsv).
+Data is split into subsets in [split_data.ipynb](split_data.ipynb).
+The testing dataset is determined by randomly sampling 15% (stratified by phenotypic class) of the single-cell dataset.
+The training dataset is the subset remaining after the testing samples are removed.
+Sample indexes associated with training and testing subsets are stored in [data_split_indexes.tsv](indexes/data_split_indexes.tsv).
 Sample indexes are later used to load subsets from [training_data.csv.gz](../0.download_data/data/training_data.csv.gz).
 
 ## Step 1: Split Data
 
-Use the commands below to create indexes for training, testing, and holdout data subsets:
+Use the commands below to create indexes for training and testing data subsets:
 
 ```sh
 # Make sure you are located in 1.split_data