-
Notifications
You must be signed in to change notification settings - Fork 12
Appendix
QinQian edited this page Jun 5, 2016
·
2 revisions
ChiLin support all species listed on UCSC website, which includes dependent data as we list in species:
- (Must) genome index for your species, we recommended bwa index.
- (Must) chromosome length information text file.
- (Must) standard RefSeq files.
- (Optionally)PhastCons conservation bigwiggle files.
- (Optionally) genome directory containing chromosome separated sequence fasta files
- (Optionally) Union DHS and blacklist regions
We have packaged all dependent data for hg19, hg38, mm9, mm10.
- First is large disk usage data:
Data Name | Used by | Data Source |
---|---|---|
genome_index | bwa/bowtie/star | raw fasta indexed files |
genome_dir | bwa/bowtie/star | genome fasta files |
conservation | conservation_plot.py | wiggle files |
Genome version | Raw genome sequence | Masked genome sequence |
---|---|---|
hg19 | hg19_raw | hg19_mask |
hg38 | hg38_raw | hg38_mask |
mm9 | mm9_raw | mm9_mask |
mm10 | mm10_raw | mm10_mask |
- Second is small pieces of reference files:
Data Name | Used by | Data Source |
---|---|---|
chrom_len | samtools | UCSC table browser |
dhs | bedtools | Union DHS regions from Cistrome DB |
velcro | bedrolls | blacklist regions |
geneTable | bedAnnotate | UCSC table browser |
contamination | bwa | Mycoplasma genome index(set by --mapper) |
- Followings is how we generate these reference files, if you have any species other than hg19/hg38/mm9/mm10, you can find the reference files with the similar ways.
It seems that Mycoplasma contamination would be a major source of contamination, so we recommended downloading the Mycoplasma fasta for indexing, data is in the link of the mycoplasma genome.
Then index with bwa index -a is mycoplasma.fasta.
download raw genome sequence data, and tar xvfz them and cat *fa > genome.fa. Use the following to index them:
bwa index -a bwtsw genome.fasta
Use Browser step by step
- To get refseq files, open UCSC table browser
- Go to the UCSC table browser.
- Select desired species and assembly, such as hg19
- Select group: Genes and Gene Prediction Tracks
- Select track: RefSeq Genes
- Select table: refGene
- Select region: genome
- Select output format: all fields from selected table
- Enter output file: species.refgene
- Hit the 'get output' button
- d*ownload and remove the header line with command,
sed 1d species.refgene > sp.refgene
- (Optional) get Phaston conservation, for most common species version, hg19_conserv, hg38_conserv, mm10_conserv, mm9_conserv and use wigToBigWig to convert them into bigwig, we provide hg19/mm9 conservation score on our server, for other species, just left the chilin.conf conservation section blank. Take hg19 as an example:
wget -r -np -nd --accept=gz http://hgdownload-test.cse.ucsc.edu/goldenPath/hg19/phastCons46way/placentalMammals/
for c in chr*wig*gz
do
bw=${c%phastCons46way.placental.wigFix.gz}bw
echo $bw
gunzip -c $c | wigToBigWig stdin chrom_len $bw ## chrom_len is where you put your reference chromosome information file
done