The machine learning models to estimate tumor purity trained on TCGA RNA sequencing-based gene expression data. Bulk tumor samples used for high-throughput molecular profiling are often an admixture of cancer cells and non-cancerous cells. The proportion of tumor cells in the admixture is refer to as tumor purity. The mixed composition can confound the analysis and affect the biological interpretation of the results, and thus, accurate prediction of tumor purity is critical.
The machine learning models with file sizes of 25 MB or less were uploaded to this repository.
Other models are available in https://doi.org/10.6084/m9.figshare.14045330.v1.
To use the models, log-transformed values of quantified FPKM (log2(FPKM+1)) are required. The FPKM values shoud be calculated through the mRNA analysis pipeline of the GDC. For mRNA quantification, gencode.v22.annotation.gtf is required, not v36.
In addition, the order of genes should be arranged in the same order as the example data. (The gene lists are uploaded in GeneList directory.)
The example ipython notebook (ipynb) file is in the example directory. Please refer it.
scikit-learn (<= 0.23.2) is recommended to scale input data.
import pandas as pd
import joblib
from keras.models import load_model # when using MLP model
# Load your gene expression (log2-transformed FPKM) data as numpy array (sample x gene).
example_data = pd.read_csv('example_data.tsv', sep='\t', index_col='Sample ID')
X = example_data.values
# Data scaling is needed except for RFR model
Scaler = joblib.load('../models/Scaler/Scaler.joblib')
Scaler.clip = False # If you use scikit-learn > 0.23.2
X_scaled = Scaler.transform(X)
# Load model to use
Ridge = joblib.load('../models/Ridge/Ridge.joblib')
RFR = joblib.load('../models/RFR/RFR.joblib')
MLP = load_model('../models/MLP/MLP.h5') # When using the MLP models, use function 'load_model' for loading the model.
# Predict tumor purity
Ridge_purity = Ridge.predict(X_scaled)
RFR_purity = RFR.predict(X) # When using the RFR models, use not scaled data.
MLP_purity = MLP.predict(X_scaled).reshape(-1) # When using the MLP models, reshaping the array is recommended for easy use.
Koo, Bonil, and Je-Keun Rhee. "Prediction of tumor purity from gene expression data using machine learning." Briefings in Bioinformatics 22.6 (2021): bbab163. (https://doi.org/10.1093/bib/bbab163)
@article{koo2021prediction,
title={Prediction of tumor purity from gene expression data using machine learning},
author={Koo, Bonil and Rhee, Je-Keun},
journal={Briefings in Bioinformatics},
volume={22},
number={6},
pages={bbab163},
year={2021},
publisher={Oxford University Press}
}