Skip to content
QinQian edited this page Jun 5, 2016 · 2 revisions

Appendix: Dependent data

Get dependent data

ChiLin support all species listed on UCSC website, which includes dependent data as we list in species:

  • (Must) genome index for your species, we recommended bwa index.
  • (Must) chromosome length information text file.
  • (Must) standard RefSeq files.
  • (Optionally)PhastCons conservation bigwiggle files.
  • (Optionally) genome directory containing chromosome separated sequence fasta files
  • (Optionally) Union DHS and blacklist regions

We have packaged all dependent data for hg19, hg38, mm9, mm10.

Data details

  • First is large disk usage data:
Data Name Used by Data Source
genome_index bwa/bowtie/star raw fasta indexed files
genome_dir bwa/bowtie/star genome fasta files
conservation conservation_plot.py wiggle files
Genome version Raw genome sequence Masked genome sequence
hg19 hg19_raw hg19_mask
hg38 hg38_raw hg38_mask
mm9 mm9_raw mm9_mask
mm10 mm10_raw mm10_mask
  • Second is small pieces of reference files:
Data Name Used by Data Source
chrom_len samtools UCSC table browser
dhs bedtools Union DHS regions from Cistrome DB
velcro bedrolls blacklist regions
geneTable bedAnnotate UCSC table browser
contamination bwa Mycoplasma genome index(set by --mapper)
  • Followings is how we generate these reference files, if you have any species other than hg19/hg38/mm9/mm10, you can find the reference files with the similar ways.

Mycoplasma genome

It seems that Mycoplasma contamination would be a major source of contamination, so we recommended downloading the Mycoplasma fasta for indexing, data is in the link of the mycoplasma genome.

Then index with bwa index -a is mycoplasma.fasta.

BWA Index

download raw genome sequence data, and tar xvfz them and cat *fa > genome.fa. Use the following to index them:

bwa index -a bwtsw genome.fasta

UCSC table browser

Use Browser step by step

  • To get refseq files, open UCSC table browser
  • Go to the UCSC table browser.
  • Select desired species and assembly, such as hg19
  • Select group: Genes and Gene Prediction Tracks
  • Select track: RefSeq Genes
  • Select table: refGene
  • Select region: genome
  • Select output format: all fields from selected table
  • Enter output file: species.refgene
  • Hit the 'get output' button
  • d*ownload and remove the header line with command,
sed 1d species.refgene > sp.refgene

Conservation score

  • (Optional) get Phaston conservation, for most common species version, hg19_conserv, hg38_conserv, mm10_conserv, mm9_conserv and use wigToBigWig to convert them into bigwig, we provide hg19/mm9 conservation score on our server, for other species, just left the chilin.conf conservation section blank. Take hg19 as an example:
wget -r -np -nd --accept=gz http://hgdownload-test.cse.ucsc.edu/goldenPath/hg19/phastCons46way/placentalMammals/
for c in chr*wig*gz
do
bw=${c%phastCons46way.placental.wigFix.gz}bw
echo $bw
gunzip -c $c | wigToBigWig stdin chrom_len $bw  ## chrom_len is where you put your reference chromosome information file
done