Skip to content

datapro4hire/weCAN-A-Cancer-Survivability-Predictor

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

47 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation


🧬 weCAN: A Cancer Survivability Predictor 🧬

Machine Learning and Cancer Survivability

Logo

🔎 Project Overview

This project utilizes Machine Learning techniques to attempt to increase the accuracy of estimating cancer prognosis and survivability using demographic features as well as disease status, progression and genetic. If successful, it will aid in the improvement of the quality of life for patients and their loved ones.

This project will attempt to find a solution to the following:

"How can we use Machine Leaning to increase the precision of prognostic estimates for cancer patients?"

📖 Table of Contents

  1. Project Motivation
  2. The Data
  3. Data Dictionary
  4. Project Roadmap
  5. Learnings
  6. Conclusions
  7. Next Steps

(back to top)

💪🏽 Project Motivation

The motivation for this project is a a personal and professional connection to cancer. Like the majority of people I have been personally, indirectly affected by cancer. I also worked in clinical cancer diagnosis and detection for 7+ years and have a deep seeded interest in the field. Cancer prognosis and survivability affects patients, their friends and family, and their quality of life.

Accuracy in prognosis predictions is very important. When a patient is given a prognosis, they begin to map out the remainder of their time. When the prognosis is inaccurate, some are lucky and they are given more time but there are some that are not so lucky and are taken sooner than anticipated. Accurate prognosis can also help the currently overwhelmed and under-funded and under-staffed healthcare systems around the world. It can help with resource mangement to alloccate precious time, space and treatment resources to those who will benefit the most.

Cancer prognosis affects everyone involved from pateints to families to health care staff and I hope to develop a model that can better predict this metric to make living with cancer better for everyone.

📊 The Data

The data used for this project was downloaded from the cBioPortal for Cancer genomics: https://www.cbioportal.org/study/summary?id=msk_met_2021 from the MSK MetTropism (MSK, Cell 2021) study.

This public site is hosted by the Centre for Molecular Oncology at the Memorial Sloan Kettering Cancer Centre.

Each of the 25775 instances are a unique pateint in the study with each column a different attirbute of this patient and their disease. Including:

📖 Data Dictionary

Column Name Description
Study ID ID for the Study where the data is from
Patient ID Unique patient identifier
Sample ID Unique sample identifier
Age at Death Age at which patient died (blank indicated patient is alive at time of study)
Age at First Mets Dx Age at which patient was diagnosed with metastatic cancer
Age at Last Contact Age at which the study made last contact with the patient
Age at Sequencing Age at which the patients tumour was genetically sequenced
Age at Surgical Procedure Age at which patentee underwent surgery to remove tumour
Cancer Type Type of cancer at diagnosis
Cancer Type Detailed Detailed description of cancer type
Distant Mets: Adrenal Gland Presence or absence of distant Metastasis at Diagnosis in the Adrenal Gland
Distant Mets: Biliary tract Presence or absence of distant Metastasis at Diagnosis in the Biliary tract
Distant Mets: Bladder/UT Presence or absence of distant Metastasis at Diagnosis in the Bladder/Urinary Tract
Distant Mets: Bone Presence or absence of distant Metastasis at Diagnosis in Bone
Distant Mets: Bowel Presence or absence of distant Metastasis at Diagnosis in the Bowel
Distant Mets: Breast Presence or absence of distant Metastasis at Diagnosis in the Breast tissue
Distant Mets: CNS/Brain Presence or absence of distant Metastasis at Diagnosis in the Central Nervous System(spinal cord)/Brain
Distant Mets: Distant LN Presence or absence of distant Metastasis at Diagnosis in distant Lymph Nodes
Distant Mets: Female Genital Presence or absence of distant Metastasis at Diagnosis in female genitalia
Distant Mets: Head and Neck Presence or absence of distant Metastasis at Diagnosis in the head or neck
Distant Mets: Intra-Abdominal Presence or absence of distant Metastasis at Diagnosis in the intra abdominal area
Distant Mets: Kidney Presence or absence of distant Metastasis at Diagnosis in the kidneys
Distant Mets: Liver Presence or absence of distant Metastasis at Diagnosis in the liver
Distant Mets: Lung Presence or absence of distant Metastasis at Diagnosis in the lungs
Distant Mets: Male Genital Presence or absence of distant Metastasis at Diagnosis in the male genitalia
Distant Mets: Mediastinum Presence or absence of distant Metastasis at Diagnosis in the mediastinum
Distant Mets: Ovary Presence or absence of distant Metastasis at Diagnosis in the ovaries
Distant Mets: Pleura Presence or absence of distant Metastasis at Diagnosis in pleural tissue
Distant Mets: PNS Presence or absence of distant Metastasis at Diagnosis in the peripheral nervous system
Distant Mets: Skin Presence or absence of distant Metastasis at Diagnosis in the skin
Distant Mets: Unspecified Presence or absence of distant Metastasis at Diagnosis in unspecified regions
FGA Fraction Genome Altered (rounded)
Fraction Genome Altered Fraction Genome Altered indicates the fraction of the genome that is copy-number altered. Add the length of all copy-number segments with an absolute value greater than 0.1 and then divide that number by the length of the genome. The resulting number is a fraction.
Gene Panel ID of the gene panel used for genetic sequencing
Metastatic patient True or False: if patient had metastatic disease or not
Metastatic Site The anatomic location where tumour has spread
Met Count Number of metastasis found
Met Site Count Number of different metastatic sites
MSI Score Numerical value of the amount of Microsatellite Instability found in the tumour
MSI Type Does tutor exhibit Microsatellite Instability
Mutation Count Number of gene mutations found
Oncotree Code The OncoTree is an open-source ontology that was developed at Memorial Sloan Kettering Cancer Center (MSK) for standardizing cancer type diagnosis from a clinical perspective by assigning each diagnosis a unique OncoTree code.
Organ System Organ System where the cancer was found
Overall Survival (Months) Number of months that patient survived from initial diagnosis
Overall Survival Status Survival status:
0:LIVING, 1:DECEASED
Primary Tumor Site The organ sub-division where the primary tumour was found
Race Category Patient information about race
Number of Samples Per Patientn Number of samples taken per patient in the study
Sample coverage The number of unique sequencing reads that align to a region in a reference genome
Sample Type Primary tumour or metastasis
Sex Biological gender at birth
Subtype Cancer subtype
Subtype Abbreviation Cancer subtype abbreviation
TMB (nonsynonymous) The number of non-synonymous mutations within coding regions across the genome. Non-synonymous mutations alter coding regions and change the resulting protein into dysfunctional or malformed protein products.
Tumor Purity The proportion of tumour cells in the tumour micro environment (TME)

(back to top)

🚙 Project Roadmap

(back to top)

Project progression:

Thus far I have completed the Data Collection, the preliminary Data Cleaning and the initial EDA.
I have preformed my first iterations of feature selection and engineering. I have perfomred my first iterations of the baseline ML models. I have determined this will be a classification problem, classifying patients into 1 of 4 survival duration categories:

  • < 1 year: very poor prognosis
  • =1 and < 2 years: poor prognosis
  • = 2 and <4 years: intermediate prognosis
  • = 4 years: good prognosis

Given the constraints of my data, the above groupings is what I had to work with to have the best data distrbution. Ideally, I would want to have a wider data set so I could group surival as per industry standards. Given what I have to work with. this was the best I could do. For future modelling, I would like to collect data that has a much higher variance of survival.

I have fit baseline models for:

  • Multiclass Logistic Regression
  • SVM-OvA
  • SVM-OvO
  • K Nearest Neighbors
  • Decision Tree Classifier
  • Random Forest Classifier

Based on Accuracy, F1 and AUC under the ROC plot, I determined my best baseline models were:

  • Logistic Regression
  • XGBoost These 2 models performed similarily in the baseline modeling, and have good power to distinguish class 1(very poor prognosis) and class 4 (good prognosis) from the other classes. Next they have adequate power to distinguish class 3 and perform the most poor on class 2. This is ok and expected. We want the best power to get True Positives for class 1 and class 4 as these will impact the patientes quality of life the most.

The next step I performed was to optimize the Logistic Regression and XGBoost models using Kfold Cross Validation. I found the best model to be: Logistic Regression Classifier. This performed the best a predicting patients that were in the very poor prognosis category and the good prognosis category which can be the hardest to predict and the most important.

Stay tuned for updates!

💡 Learnings

Some learnings from the EDA and Feature Engineering (AHA! moments):

  • Knowledge of the problem space is very important, was able to boost my model's performance by removing features I knew were not necessary.
  • Originally trying to perform regression modelling, realized this does not work as a regression model but needs to be a classification model.

(back to top)

🎬 Conclusions

In conclusion, I was able to create a machine learning model that helps to solve the problem: "How can we use Machine Leaning to increase the precision of prognostic estimates for cancer patients?"

The model did well in predictiing patients with Very Poor and Good prognosis as compared to categories. Features such as Age, Metastatic Status, Metastatic Progression(number of metastases and number of metastatic sites), ... contribute the most to predicting prognosis category as defined above.

(back to top)

⏭️ Next Steps

The Model trained had adequate predictive power, however next steps from here include:

  • Designing a neural network (from scratch or using transfer learning) to get even more powerful predictions.
  • Collecting more patient data and further engineering my feature space for more predictive power.
  • Gathering a wider feature space to have more variance in the features present.
  • Ideally including more descriptive features that could aid in predicitve power.

Features I believe would help create an even better model are:

  • Tumor size
  • Metastases size
  • Specific mutated genes (there are a variety of genes that can be mutated that indicate prognosis status)
  • Treatment (what treatment the patient is on can drastically change their prognosis)
  • Tumor excision status: was the primary or metastatic tumors excised fully, partially, not at all? - if the tumor has been removed, this can change the patients prognosis.

Moving forward, I would like to continue to work on this project and make the model perform even better and maybe integrate more features and more patient data. I will also attempt to create a streamlit app and a dashboard to display my project and results thus far.

(back to top)

About

Building a machine learning model to predict cancer survivability

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 99.7%
  • Python 0.3%