Appendix

Appendix: Dependent data

Get dependent data

ChiLin support all species listed on UCSC website, which includes dependent data as we list in species:

(Must) genome index for your species, we recommended bwa index.
(Must) chromosome length information text file.
(Must) standard RefSeq files.
(Optionally)PhastCons conservation bigwiggle files.
(Optionally) genome directory containing chromosome separated sequence fasta files
(Optionally) Union DHS and blacklist regions

We have packaged all dependent data for hg19, hg38, mm9, mm10.

Data details

First is large disk usage data:

Data Name	Used by	Data Source
genome_index	bwa/bowtie/star	raw fasta indexed files
genome_dir	bwa/bowtie/star	genome fasta files
conservation	conservation_plot.py	wiggle files

Genome version	Raw genome sequence	Masked genome sequence
hg19	hg19_raw	hg19_mask
hg38	hg38_raw	hg38_mask
mm9	mm9_raw	mm9_mask
mm10	mm10_raw	mm10_mask

Second is small pieces of reference files:

Data Name	Used by	Data Source
chrom_len	samtools	UCSC table browser
dhs	bedtools	Union DHS regions from Cistrome DB
velcro	bedrolls	blacklist regions
geneTable	bedAnnotate	UCSC table browser
contamination	bwa	Mycoplasma genome index(set by --mapper)

Followings is how we generate these reference files, if you have any species other than hg19/hg38/mm9/mm10, you can find the reference files with the similar ways.

Mycoplasma genome

It seems that Mycoplasma contamination would be a major source of contamination, so we recommended downloading the Mycoplasma fasta for indexing, data is in the link of the mycoplasma genome.

Then index with bwa index -a is mycoplasma.fasta.

BWA Index

download raw genome sequence data, and tar xvfz them and cat *fa > genome.fa. Use the following to index them:

bwa index -a bwtsw genome.fasta

UCSC table browser

Use Browser step by step

To get refseq files, open UCSC table browser
Go to the UCSC table browser.
Select desired species and assembly, such as hg19
Select group: Genes and Gene Prediction Tracks
Select track: RefSeq Genes
Select table: refGene
Select region: genome
Select output format: all fields from selected table
Enter output file: species.refgene
Hit the 'get output' button
d*ownload and remove the header line with command,

sed 1d species.refgene > sp.refgene

Conservation score

(Optional) get Phaston conservation, for most common species version, hg19_conserv, hg38_conserv, mm10_conserv, mm9_conserv and use wigToBigWig to convert them into bigwig, we provide hg19/mm9 conservation score on our server, for other species, just left the chilin.conf conservation section blank. Take hg19 as an example:

wget -r -np -nd --accept=gz http://hgdownload-test.cse.ucsc.edu/goldenPath/hg19/phastCons46way/placentalMammals/
for c in chr*wig*gz
do
bw=${c%phastCons46way.placental.wigFix.gz}bw
echo $bw
gunzip -c $c | wigToBigWig stdin chrom_len $bw  ## chrom_len is where you put your reference chromosome information file
done

Provide feedback

Saved searches

Use saved searches to filter your results more quickly