This project utilizes Machine Learning techniques to attempt to increase the accuracy of estimating cancer prognosis and survivability using demographic features as well as disease status, progression and genetic. If successful, it will aid in the improvement of the quality of life for patients and their loved ones.
This project will attempt to find a solution to the following:
"How can we use Machine Leaning to increase the precision of prognostic estimates for cancer patients?"
The motivation for this project is a a personal and professional connection to cancer. Like the majority of people I have been personally, indirectly affected by cancer. I also worked in clinical cancer diagnosis and detection for 7+ years and have a deep seeded interest in the field. Cancer prognosis and survivability affects patients, their friends and family, and their quality of life.
Accuracy in prognosis predictions is very important. When a patient is given a prognosis, they begin to map out the remainder of their time. When the prognosis is inaccurate, some are lucky and they are given more time but there are some that are not so lucky and are taken sooner than anticipated. Accurate prognosis can also help the currently overwhelmed and under-funded and under-staffed healthcare systems around the world. It can help with resource mangement to alloccate precious time, space and treatment resources to those who will benefit the most.
Cancer prognosis affects everyone involved from pateints to families to health care staff and I hope to develop a model that can better predict this metric to make living with cancer better for everyone.
The data used for this project was downloaded from the cBioPortal for Cancer genomics: https://www.cbioportal.org/study/summary?id=msk_met_2021 from the MSK MetTropism (MSK, Cell 2021) study.This public site is hosted by the Centre for Molecular Oncology at the Memorial Sloan Kettering Cancer Centre.
Each of the 25775 instances are a unique pateint in the study with each column a different attirbute of this patient and their disease. Including:
Column Name | Description |
---|---|
Study ID |
ID for the Study where the data is from |
Patient ID |
Unique patient identifier |
Sample ID |
Unique sample identifier |
Age at Death |
Age at which patient died (blank indicated patient is alive at time of study) |
Age at First Mets Dx |
Age at which patient was diagnosed with metastatic cancer |
Age at Last Contact |
Age at which the study made last contact with the patient |
Age at Sequencing |
Age at which the patients tumour was genetically sequenced |
Age at Surgical Procedure |
Age at which patentee underwent surgery to remove tumour |
Cancer Type |
Type of cancer at diagnosis |
Cancer Type Detailed |
Detailed description of cancer type |
Distant Mets: Adrenal Gland |
Presence or absence of distant Metastasis at Diagnosis in the Adrenal Gland |
Distant Mets: Biliary tract |
Presence or absence of distant Metastasis at Diagnosis in the Biliary tract |
Distant Mets: Bladder/UT |
Presence or absence of distant Metastasis at Diagnosis in the Bladder/Urinary Tract |
Distant Mets: Bone |
Presence or absence of distant Metastasis at Diagnosis in Bone |
Distant Mets: Bowel |
Presence or absence of distant Metastasis at Diagnosis in the Bowel |
Distant Mets: Breast |
Presence or absence of distant Metastasis at Diagnosis in the Breast tissue |
Distant Mets: CNS/Brain |
Presence or absence of distant Metastasis at Diagnosis in the Central Nervous System(spinal cord)/Brain |
Distant Mets: Distant LN |
Presence or absence of distant Metastasis at Diagnosis in distant Lymph Nodes |
Distant Mets: Female Genital |
Presence or absence of distant Metastasis at Diagnosis in female genitalia |
Distant Mets: Head and Neck |
Presence or absence of distant Metastasis at Diagnosis in the head or neck |
Distant Mets: Intra-Abdominal |
Presence or absence of distant Metastasis at Diagnosis in the intra abdominal area |
Distant Mets: Kidney |
Presence or absence of distant Metastasis at Diagnosis in the kidneys |
Distant Mets: Liver |
Presence or absence of distant Metastasis at Diagnosis in the liver |
Distant Mets: Lung |
Presence or absence of distant Metastasis at Diagnosis in the lungs |
Distant Mets: Male Genital |
Presence or absence of distant Metastasis at Diagnosis in the male genitalia |
Distant Mets: Mediastinum |
Presence or absence of distant Metastasis at Diagnosis in the mediastinum |
Distant Mets: Ovary |
Presence or absence of distant Metastasis at Diagnosis in the ovaries |
Distant Mets: Pleura |
Presence or absence of distant Metastasis at Diagnosis in pleural tissue |
Distant Mets: PNS |
Presence or absence of distant Metastasis at Diagnosis in the peripheral nervous system |
Distant Mets: Skin |
Presence or absence of distant Metastasis at Diagnosis in the skin |
Distant Mets: Unspecified |
Presence or absence of distant Metastasis at Diagnosis in unspecified regions |
FGA |
Fraction Genome Altered (rounded) |
Fraction Genome Altered |
Fraction Genome Altered indicates the fraction of the genome that is copy-number altered. Add the length of all copy-number segments with an absolute value greater than 0.1 and then divide that number by the length of the genome. The resulting number is a fraction. |
Gene Panel |
ID of the gene panel used for genetic sequencing |
Metastatic patient |
True or False: if patient had metastatic disease or not |
Metastatic Site |
The anatomic location where tumour has spread |
Met Count |
Number of metastasis found |
Met Site Count |
Number of different metastatic sites |
MSI Score |
Numerical value of the amount of Microsatellite Instability found in the tumour |
MSI Type |
Does tutor exhibit Microsatellite Instability |
Mutation Count |
Number of gene mutations found |
Oncotree Code |
The OncoTree is an open-source ontology that was developed at Memorial Sloan Kettering Cancer Center (MSK) for standardizing cancer type diagnosis from a clinical perspective by assigning each diagnosis a unique OncoTree code. |
Organ System |
Organ System where the cancer was found |
Overall Survival (Months) |
Number of months that patient survived from initial diagnosis |
Overall Survival Status |
Survival status: 0:LIVING, 1:DECEASED |
Primary Tumor Site |
The organ sub-division where the primary tumour was found |
Race Category |
Patient information about race |
Number of Samples Per Patientn |
Number of samples taken per patient in the study |
Sample coverage |
The number of unique sequencing reads that align to a region in a reference genome |
Sample Type |
Primary tumour or metastasis |
Sex |
Biological gender at birth |
Subtype |
Cancer subtype |
Subtype Abbreviation |
Cancer subtype abbreviation |
TMB (nonsynonymous) |
The number of non-synonymous mutations within coding regions across the genome. Non-synonymous mutations alter coding regions and change the resulting protein into dysfunctional or malformed protein products. |
Tumor Purity |
The proportion of tumour cells in the tumour micro environment (TME) |
Project progression:
Thus far I have completed the Data Collection
, the preliminary Data Cleaning
and the initial EDA
.
I have preformed my first iterations of feature selection and engineering.
I have perfomred my first iterations of the baseline ML models. I have determined this will be a classification problem, classifying patients into 1 of 4 survival duration categories:
- < 1 year: very poor prognosis
- =1 and < 2 years: poor prognosis
- = 2 and <4 years: intermediate prognosis
-
= 4 years: good prognosis
Given the constraints of my data, the above groupings is what I had to work with to have the best data distrbution. Ideally, I would want to have a wider data set so I could group surival as per industry standards. Given what I have to work with. this was the best I could do. For future modelling, I would like to collect data that has a much higher variance of survival.
I have fit baseline models for:
- Multiclass Logistic Regression
- SVM-OvA
- SVM-OvO
- K Nearest Neighbors
- Decision Tree Classifier
- Random Forest Classifier
Based on Accuracy, F1 and AUC under the ROC plot, I determined my best baseline models were:
- Logistic Regression
- XGBoost These 2 models performed similarily in the baseline modeling, and have good power to distinguish class 1(very poor prognosis) and class 4 (good prognosis) from the other classes. Next they have adequate power to distinguish class 3 and perform the most poor on class 2. This is ok and expected. We want the best power to get True Positives for class 1 and class 4 as these will impact the patientes quality of life the most.
The next step I performed was to optimize the Logistic Regression and XGBoost models using Kfold Cross Validation. I found the best model to be: Logistic Regression Classifier. This performed the best a predicting patients that were in the very poor prognosis category and the good prognosis category which can be the hardest to predict and the most important.
Stay tuned for updates!
Some learnings from the EDA and Feature Engineering (AHA! moments):
- Knowledge of the problem space is very important, was able to boost my model's performance by removing features I knew were not necessary.
- Originally trying to perform regression modelling, realized this does not work as a regression model but needs to be a classification model.
The model did well in predictiing patients with Very Poor and Good prognosis as compared to categories. Features such as Age, Metastatic Status, Metastatic Progression(number of metastases and number of metastatic sites), ... contribute the most to predicting prognosis category as defined above.
The Model trained had adequate predictive power, however next steps from here include:
- Designing a neural network (from scratch or using transfer learning) to get even more powerful predictions.
- Collecting more patient data and further engineering my feature space for more predictive power.
- Gathering a wider feature space to have more variance in the features present.
- Ideally including more descriptive features that could aid in predicitve power.
Features I believe would help create an even better model are:
- Tumor size
- Metastases size
- Specific mutated genes (there are a variety of genes that can be mutated that indicate prognosis status)
- Treatment (what treatment the patient is on can drastically change their prognosis)
- Tumor excision status: was the primary or metastatic tumors excised fully, partially, not at all? - if the tumor has been removed, this can change the patients prognosis.
Moving forward, I would like to continue to work on this project and make the model perform even better and maybe integrate more features and more patient data. I will also attempt to create a streamlit app and a dashboard to display my project and results thus far.