HCC-Survival

Introduction

Survival data was collected on patients of liver cancer (Hepatocellular Carcinoma, or HCC) from a University Hospital in Portugal. The response variable is survival at 1 year of initial diagnosis and is classified as lives = 1 and dies = 0. The dataset contains several demographic, risk factors, and laboratory data of 165 patients that have been diagnosed with HCC. The dataset is heterogeneous with 23 quantitative predictor variables and 26 qualitative predictor variables. Missing values account for 10.2% of the whole dataset with only 8 patients having complete data in all fields.

The problem to answer here is what demographic or clinical data contribute to a patient’s survival of HCC beyond 1 year. To solve the problem, exploratory analysis consisting of finding correlated variables, imputation of missing values, and characterizing the distribution via histograms and boxplots. The entire dataset is then analyzed using several predictive models, including logistic regression, linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), Gaussian finite mixture models using the mclust package, random forest, and support vector machines (SVM). Each model was run using the validation set approach (VSA) which splits the data into 50% training set and 50% test set, leave-one-out cross validation (LOOCV), and 5-fold cross validation. From these results, the best performing models (determined by test error rate) are run again on a subset of predictors which are chosen using forward and backward stepwise selection. These results are also reported and the best model is chosen.

Notes on the analysis models: Since this is survival data, special consideration is required for analysis; namely, that survival data is generally not normally distributed. By breaking the normality assumption, this dataset is not ideal for LDA and QDA; however these models are still run for comparison. Instead, I anticipate this dataset is ideal for either logistic regression, SVM, or non-parametric models, such as kNN. Which of these models will perform best depends on the shape of the decision boundary. If the decision boundary is linear, then logistic regression or linear SVM can be used. If it is not, kNN or polynomial SVM would be the model of choice. However, since there are only 165 rows, training data is limited which is not optimal for a kNN model. In that case and if the decision boundary is not linear, then more data collection will be required.

The HCC dataset can be found here.

Exploratory Analysis

Imputation of Missing Values

The below plot illustrates how many attributes contain missing values and what percentage of missing values make up those attributes. Three attributes in particular contain greater than 40% missing values. This percentage is relatively low compared to other datasets, and therefore none of the attributes are excluded based on missing values alone.

Percent Missing Values in HCC Survival Dataset

I also verify there are no missing values in the response variable, as these will be meaningless.

## [1] "Missing values in response variable (Survival): "

## NULL

Correlation Table

Correlated attributes are reported in the table below using a custom function which reports the highest correlated values (Pearson Correlation Coefficient of greater than absolute value of 0.7). Direct.Bilirubin, Oxygen.Saturation, Aspartate.transaminase and Grams.of.Alcohol.per.day are excluded from our analysis as these have more missing values than their counterparts. Surprisingly, the Pearson Correlation Coefficient for Smoking and Packs.of.cigarets.per.year is only 0.436. Nevertheless, Packs.of.cigarets.per.year is exclude as well since it makes sense this attribute is related to Smoking.

Correlated Variables for HCC Survival Dataset

	row	column	cor
1027	Total.Bilirubin	Direct.Bilirubin	0.978
1128	Iron	Oxygen.Saturation	0.783
741	Alanine.transaminase	Aspartate.transaminase	0.728
279	Alcohol	Grams.of.Alcohol.per.day	0.713

Histograms and BoxPlots

The following histograms and boxplots illustrate the distribution of each continuous and categorical predictor variable. Interestingly, at first glance survival does not seem to be affected by the variable Number of Nodules, which is counterintuitive. However, there might be differences in survival based on the variables Leukocytes, Albumin, Gamma Glutamyl Transferase, and Alkaline Phosphatase.

Histograms of Continuous Variables in HCC Survival Dataset

Boxplots of Categorical Variables in HCC Survival Dataset

Analysis using Entire Dataset

Preliminary Findings: Results for all models are reported in table 2. SVM performed the best with test error rates less than 25%. Logistic regression, LDA, and MclustDA with modelType=EDDA also performed well with test error rates between 25% and 30%. However, since survival data usually breaks the normality assumption, LDA will no longer be considered.

Model Comparison of Test Error Rates (as percent)

Method	VSA	LOOCV	Five.fold.CV
Logistic Regression	31.3	27.3	27.3
kNN	32.5	36.4	37.0
LDA	28.9	27.9	25.5
QDA	39.8	40.0	41.2
MclustDA	49.4	49.7	40.6
MclustDA, Model Type = EDDA	43.0	27.9	29.1
Random Forest	34.9	41.3	40.9
SVM Linear	30.0	26.7	23.0
SVM Radial	40.0	38.2	38.2
SVM Polynomial	30.0	36.4	23.6

Stepwise Selection

Since this dataset has many features, prediction accuracy might be improved by selecting for the most relevant features. A subset of predictors is chosen using forward and backward stepwise selection, and then the best performing models (test error rates below 30%) are run again.

Forward Stepwise Selection

Forward stepwise selection reduces the original 44 predictors to only 23. The new formula to becomes:

Survival = Alcohol + Hepatitis.B.Surface.Antigen + Hepatitis.C.Virus.Antibody + Smoking + Diabetes + Hemochromatosis + Arterial.Hypertension + Nonalcoholic.Steatohepatitis + Splenomegaly + Portal.Hypertension + Portal.Vein.Thrombosis + Age.at.diagnosis + Performance.Status + Encefalopathy.degree + Ascites.degree + AlphaFetoprotein + Haemoglobin + Total.Bilirubin + Alanine.transaminase + Alkaline.phosphatase + Major.dimension.of.nodule + Iron + Ferritin

Backward Stepwise Selection

Backward stepwise selection reduces the original 44 predictors to 22. The new formula to becomes:

Survival = Alcohol + Hepatitis.B.Surface.Antigen + Hepatitis.C.Virus.Antibody + Smoking + Diabetes + Hemochromatosis + Arterial.Hypertension + Nonalcoholic.Steatohepatitis + Splenomegaly + Portal.Hypertension + Portal.Vein.Thrombosis + Age.at.diagnosis + Performance.Status + Encefalopathy.degree + Ascites.degree + AlphaFetoprotein + Haemoglobin + Total.Bilirubin + Alanine.transaminase + Alkaline.phosphatase + Major.dimension.of.nodule + Ferritin

Analysis using Subset of Predictors

Results for all models using a subset of predictors are reported in table 3. Overall, we find a significant reduction in test error rate for all models, with forward stepwise selection performing better than backward stepwise selection, with two exceptions. In general, logistic regression using LOOCV and SVM performed better than other models, and four of those models had test error rates below 20%. SVM using a polynomial kernel and backward step selection performed the best with a test error rate of 18.2%. However, the polynomial kernel uses degree = 1 which is equivalent to a linear kernel (the difference in results being in the other parameters of the svm algorithm).

Top Performing Models with Subset of Predictors

Models	Forward_Selection	Backward_Selection
Logistic Regression LOOCV	19.4	20.6
Logistic Regression 5-fold CV	24.8	24.8
MclustDA, Model Type = EDDA LOOCV	28.5	27.3
MclustDA, Model Type = EDDA 5-fold CV	26.7	26.7
SVM Linear LOOCV	20.0	20.6
SVM Linear 5-fold CV	19.4	20.0
SVM Polynomial 5-fold CV	19.4	18.2

Final Conclusion

This dataset attempts to find a relationship between several variables in order to be able to predict a patient's survival of HCC beyond 1 year. In our analysis, we have narrowed down the list of 44 predictor variables to just 22 using backward stepwise selection. The proposed model is:

Survival = Alcohol + Hepatitis.B.Surface.Antigen + Hepatitis.C.Virus.Antibody + Smoking + Diabetes + Hemochromatosis + Arterial.Hypertension + Nonalcoholic.Steatohepatitis + Splenomegaly + Portal.Hypertension + Portal.Vein.Thrombosis + Age.at.diagnosis + Performance.Status + Encefalopathy.degree + Ascites.degree + AlphaFetoprotein + Haemoglobin + Total.Bilirubin + Alanine.transaminase + Alkaline.phosphatase + Major.dimension.of.nodule + Ferritin

There is indication that the shape of the decision boundary is in fact linear since the best performing models are SVM with a polynomial kernel and degree = 1, SVM with a linear kernel, and logistic regression. Additional data can potentially vastly improve the approximately 20% test error rate, and all three models should be reevaluated to determine the best performing model. In doing so, this data and associated prediction model can potentially help doctors determine a particular patient’s stage of HCC, and therefore determine best course of treatment.

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
images		images
HCC_Survival_v2.Rmd		HCC_Survival_v2.Rmd
README.md		README.md
hcc-data.txt		hcc-data.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HCC-Survival

Introduction

Exploratory Analysis

Imputation of Missing Values

Correlation Table

Histograms and BoxPlots

Analysis using Entire Dataset

Stepwise Selection

Forward Stepwise Selection

Backward Stepwise Selection

Analysis using Subset of Predictors

Final Conclusion

About

Releases

Packages

wisamb/HCC-Survival

Folders and files

Latest commit

History

Repository files navigation

HCC-Survival

Introduction

Exploratory Analysis

Imputation of Missing Values

Correlation Table

Histograms and BoxPlots

Analysis using Entire Dataset

Stepwise Selection

Forward Stepwise Selection

Backward Stepwise Selection

Analysis using Subset of Predictors

Final Conclusion

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages