-
Notifications
You must be signed in to change notification settings - Fork 9
/
Copy pathDOCUMENTATION
167 lines (129 loc) · 8.35 KB
/
DOCUMENTATION
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
SHIMMER(1) User Contributed Perl Documentation SHIMMER(1)
NAME
shimmer.pl - call somatic single base changes from matched tumor and normal next gen-
eration sequences.
SYNOPSIS
Tally counts of alleles in two BAM files, and write out data for sites with unexpect-
edly large deviations in allele frequencies, controlling the false discovery rate
(FDR) using the Benjamini-Hochberg procedure:
shimmer.pl [options] <normal_bam_file> <tumor_bam_file> --ref <ref_fasta_file>
Run shimmer.pl -man for a detailed description of options and the output files.
DESCRIPTION
The script creates a randomly named directory (run_shimmer_XXXXXX) in the user's cur-
rent working directory, and reads through the two BAM files provided as options with
samtools mpileup, recording the normal sample's genotype, as well as the counts of the
two most frequently seen alleles at each site. It then uses the R "statmod" library
to calculate p-values with the Fisher's exact test on each site where a minimum
threshold of alternate allele copies are seen (see --min_som_reads option).
Once all p-values have been calculated, shimmer.pl uses the Benjamini-Hochberg proce-
dure to report only changes with a false discovery rate below the specified maximum
FDR (see --max_q option). The single-nucleotide variants are reported in VarSifter
and VCF formats in the files "somatic_diffs.vs" and "somatic_diffs.vcf", respectively.
INPUT
The first and second arguments to shimmer are the paths of two BAM-formatted files of
aligned sequencing reads. These files must be sorted using samtools prior to running
shimmer, and indexed if the --region option wil be used.
The path of a valid, fasta-formatted file for the reference sequence must be passed
with the option --ref. This fasta file must have a corresponding samtools index file
with the same name except for an appended ".fai" if the --region option will be used.
OPTIONS
--region chr or
--region chr:start-end
This option specifies a region as a reference entry (chromosome), optionally fol-
lowed by a position range, and causes the program to limit somatic call to only
that region. By default, the program calls variants in all regions that covered
by reads in both BAM files.
--ref reference_fasta_file
This option specifies the reference file to which the reads in the BAM files were
aligned. It is a required option.
--minqual min_base_quality_score
This option specifies a minimum phred quality score to be required for read bases
to be included in the counts for the Fisher's exact tests. By default, all bases
are included.
--mapqual min_mapping_quality_score
This option specifies a minimum mapping quality score to be required for all reads
to be included in the counts for the Fisher's exact tests. By default, all reads'
bases are included.
--max_q max_acceptable_FDR
This option specifies the maximum FDR level to be set for the Benjamini-Hochberg
procedure for multiple testing correction. A value of 0 will cause shimmer to
report all tests including those with q values equal to 1. (Default=0.05)
--annovardb path_to_annovar_db_directory
This option allows the user to specify an ANNOVAR database directory against
which to annotate variants (see Wang et al, "ANNOVAR: functional annotations of
genetic variants from high-throughput sequencing data". Nucl. Acids Res. 38,
2010).
--buildver <hg18, hg19, etc.>
This option is passed directly to ANNOVAR.
--annovar path_to_annovar
This option allows the user to specify a particular path to the "annotate_vari-
ants.pl" script from ANNOVAR. As a default, if the annovardb option has been
specified (see above), Shimmer will call the first copy of "annotate_varia-
tion.pl" in the user's path.
--outdir <path_to_output_directory>
This option specifies a directory in which to place result files for this run.
If the directory doesn't exist, it will be created. By default, Shimmer will
create a randomly named directory called "run_shimmer_XXXXXX" within the current
working directory (where "XXXXXX" is a random string of length 6).
OUTPUT
The single nucleotide variants (sSNVs) are written in both VarSifter and VCF formats.
The fields in the VarSifter file (see Teer et al., "VarSifter: Visualizing and analyz-
ing exome-scale sequence variation data on a desktop computer". Bioinformatics 28,
2012) are as follows:
Index
A numerical identifier for each variant.
Chr The entry name of the reference sequence in the BAM files and reference fasta
file.
LeftFlank
Position one base to the left of the variant base.
RightFlank
Position one base to the right of the variant base.
ref_allele
Reference base.
var_allele
Variant base.
muttype
Type of mutation. In this version of shimmer.pl, all variants are of type "SNP".
Future versions of shimmer.pl will also call somatic variants of type "INDEL".
normal_covg
Number of reads covering this position in the normal BAM file. If the --minqual
option has been used, only reads with the required base quality at this position
will be counted.
tumor_covg
Number of reads covering this position in the tumor BAM file. If the --minqual
option has been used, only reads with the required base quality at this position
will be counted.
normal_ratio
Ratio of reads with the alternate base to total reads covering this position in
the normal BAM file. If the --minqual option has been used, only reads with the
required base quality at this position will be counted.
tumor_ratio
Ratio of reads with the alternate base to total reads covering this position in
the tumor BAM file. If the --minqual option has been used, only reads with the
required base quality at this position will be counted.
q_value
The expected value of the false discovery rate if all variants with q_value val-
ues higher than this value were excluded.
The VCF file conforms to the standards in VCFv4.0, but doesn't include as much infor-
mation as the VarSifter file does. In particular, for each somatic SNV predicted, it
reports the reference and alternate allele, the q-value (as described above in the
VarSifter file description), and the genotype (always 0/0 for the normal and 0/1 for
the tumor) and depth of coverage for each sample. Suggestions are welcome for how to
best utilize VCF format for these data, as it's a work in progress!
AUTHOR
Nancy F. Hansen - [email protected]
LEGAL
This software/database is "United States Government Work" under the terms of the
United States Copyright Act. It was written as part of the authors' official duties
for the United States Government and thus cannot be copyrighted. This soft-
ware/database is freely available to the public for use without a copyright notice.
Restrictions cannot be placed on its present or future use.
Although all reasonable efforts have been taken to ensure the accuracy and reliability
of the software and data, the National Human Genome Research Institute (NHGRI) and the
U.S. Government does not and cannot warrant the performance or results that may be
obtained by using this software or data. NHGRI and the U.S. Government disclaims all
warranties as to performance, merchantability or fitness for any particular purpose.
In any work or product derived from this material, proper attribution of the authors
as the source of the software or data should be made, using "NHGRI Genome Technology
Branch" as the citation.
perl v5.8.8 2012-10-27 SHIMMER(1)