Cyclum is a package to tackle cell cycle. It provides methods to recover cell cycle information and remove cell cycle factor from the scRNA-seq data. The methodology is to rely on the circular manifold, instead of the marker genes. We provide an Auto-Encoder based realization at this time, and we are adding Gaussian Process Latent Variable Model soon.
We provide a one-click self-contained demo ships with its dataset, which shows how to start with an expression matrix, then decide the optimal dimensionality, and finally calculate the circular pseudotime.
More examples are available in test/notebooks, where there is a detailed table of contents.
Our paper was published: Liang S, Wang F, Han J, Chen K. Latent periodic process inference from single-cell RNA-seq data. Nat Commun 11(1):1441, 3/2020. e-Pub 3/2020. PMID: 32188848. old preprint was able @ BioRxiv. Documentation for submodules and classes in the Cyclum Python module is available as a website. Explanations of other files in test are available as README.md
in the corresponding folders.
You can install cyclum by running the following commands, in a directory you desire.
conda create -n cyclum python=3.7 pip
conda activate cyclum
git clone https://github.com/KChen-lab/Cyclum.git
cd Cyclum
pip install .
You can then run Jupyter notebook simple using the following command.
jupyter notebook
A browser window will open showing the directory, where you can go to tests/notebokks
to view/run the exmaples.
install_requires=[...]
in setup.py
to install_requires=['keras==2.2.4', 'numpy==1.16.5', 'pandas==0.25.2', 'scikit-learn==0.21.3', 'h5py==2.9.0', 'jupyter==1.0.0', 'matplotlib==3.1.1', 'tensorflow==1.14.0']
.
You can also use cyclum as a portable software, without installing. All the notebooks contains code that add cyclum to the path, so that you can run them directly. However, please make sure the dependencies are fulfilled.
Cyclum was tested on these package versions. Please make sure that you have TensorFlow 1.x. Cyclum is compatible with newer versions shown in the "Latest tested" column.
Software | Version | Latest tested |
---|---|---|
python | 3.7.4 | 3.7.6 |
keras | 2.2.4 | 2.3.1 |
tensorflow | 1.14.0 | 1.15.2 |
numpy | 1.16.5 | 1.18.1 |
pandas | 0.25.2 | 1.0.1 |
scikit-learn | 0.21.3 | 0.22.1 |
h5py | 2.9.0 | 2.10.0 |
jupyter | 1.0.0 | 1.0.0 |
matplotlib | 3.1.1 | 3.1.3 |
We recommend Miniconda to manage the packages. The code should work on packages of newer versions, but in case it fails, you can return to the specific version by, for example, conda install python=3.7.4
.
The code is on Debian GNU/Linux 10 (buster) with both CPU and GPU. The code should run on most mainstream systems (Linux, Mac, Windows) supporting Tensorflow.
If you have labels from other sources such as scanpy.tl.score_genes_cell_cycle
or Seurat::CellCycleScoring
,
you can use Cyclum to refine the result along the inferred pseudotime.
from cyclum.postproc import refine_labels
refine_labels(pseudotime=[4, 5, 1, 2, 3, 6, 7, 8], original_labels=['G1', 'G1', 'G1', 'G1', 'S', 'S', 'G2M', 'G2M'])
It will output ['G1', 'G1', 'G1', 'G1', 'G1', 'S', 'G2M', 'G2M']
as the refined result.
Note that an S
is replaced by G1
as it is surrounded by G1
s.
Although Python is a good data analysis tool in addition to a general programing language, researchers may want to use R, which is more focused on statistics. Cyclum is implemented in python, but in order to help use both languages, we implemented mat2hdf
and hdf2mat
in both Python and R, to help transferring data back and forth rapidly. In general, the correspondence of data structures in R and Python are: unnamed matrices -- 2D numpy.array, named matrices -- pandas.DataFrame, data.frame -- pandas.DataFrame. (Prerequisites: hdf5r
in R, h5py
in python.)
GSEA is a powerful tool to perform downstream gene enrichment analysis. We implemented in R...
mat2txt
, which writes a expression matrix to a GSEA compatible.txt
file (Prerequisite:data.table
, for much faster writing thanwrite.table
),vec2cls
, which writes phenotypes (either discrete, e.g., cell type, or continuous, e.g., pseudotime) to a GSEA compatible.cls
file,mat2cls
, which writes multiple sets of phenotypes (continuous only, e.g., multiple PCs) to a GSEA compatible.cls
file.
We revised almost everything, except for the concept of using sinusoidal function in an autoencoder to find circular biological processes ab initio. The autoencoder is now rewritten using keras, in a more readable way. We hope this will help researchers who want to experiment similar network structures. We also implemented class cyclum.tuning.CyclumAutoTune
, which automatically select the proper number of linear components to help find the "most circular" manifold. The old version is kept in old-version
.