-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathbss-slides.Rmd
462 lines (315 loc) · 12.7 KB
/
bss-slides.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
---
title: "Mass spectrometry and proteomics"
subtitle: "Using R/Bioconductor"
author: "[Laurent Gatto](#laurent-gatto)"
date: "[BSS2019](https://uclouvain-cbio.github.io/BSS2019/) - UCLouvain - 5 July 2019"
output:
xaringan::moon_reader:
css: ["default", "default-fonts", "my.css"]
nature:
countIncrementalSlides: false
---
class: middle
name: cc-by
These slides are available under a **creative common
[CC-BY license](http://creativecommons.org/licenses/by/4.0/)**. You are
free to share (copy and redistribute the material in any medium or
format) and adapt (remix, transform, and build upon the material) for
any purpose, even commercially
<img height="20px" alt="CC-BY" src="./img/cc1.jpg" />.
---
class: middle
## On the menu
**Overall goal** (with [Lieven
Clement](https://statomics.github.io/pages/about.html)): quanitative
proteomics data analysis done right using Bioconductor tools and (some
of) the best statistical methods.
1. Proteomics in Bioconductor
2. How does mass spectrometry-based proteomics work?
3. Quantitative proteomics
4. Introduction to `r BiocStyle::Biocpkg("MSnbase")`
---
class: middle center
![](./Figures/overview0.png)
???
From a 2011 study that compared the expression profiles of 3 cell
lines using RNA-Seq, MS-based proteomics and immunofluorescence
(protein-specific antibodies).
They observed an overall Spearman correlation of 0.63.
**In what ways to these summaries differ?**
Using a common gene-centric identifier, but
- What do we have along the rows, what are our features? Transcripts
on the left. Protein groups on the right.
- How are these intensities produced?
## Take-home message 1
These data tables are less similar than they appear.
---
class: middle, center
## 1. [Proteomics](http://bioconductor.org/packages/release/BiocViews.html#___Proteomics) and [mass spectrometry](http://bioconductor.org/packages/release/BiocViews.html#___MassSpectrometry) packages and [workflow](https://rawgit.com/lgatto/bioc-ms-prot/master/lab.html) in Bioconductor.
---
class: middle, center
## 2. How does mass spectrometry work?
### (applies to proteomics and metabolomics)
---
class: middle
### Overview
![](./Figures/overview1.jpg)
---
### How does MS work?
1. Digestion of proteins into peptides - as will become clear later,
the features we measure in shotgun (or bottom-up) *proteomics* are
peptides, **not** proteins.
--
2. On-line liquid chromatography (LC-MS)
--
3. Mass spectrometry (MS) is a technology that **separates** charged
molecules (ions, peptides) based on their mass to charge ratio
(M/Z).
---
### Chromatography
MS is generally coupled to chromatography (liquid LC, but can also be
gas-based GC). The time an analytes takes to elute from the
chromatography column is the **retention time**.
![A chromatogram, illustrating the total amount of analytes over the retention time.](./Figures/chromatogram.png)
???
- This is an acquisition. There can be one per sample (with muliple
fractions), of samples can be combined/multiplexed and acquired
together.
---
class: middle
An mass spectrometer is composed of three components:
1. The *source*, that ionises the molecules: examples are Matrix-assisted
laser desorption/ionisation (MALDI) or electrospray ionisation
(ESI).
2. The *analyser*, that separates the ions: Time of flight (TOF) or Orbitrap.
3. The *detector* that quantifies the ions.
Ions typically go through that cylce at least twice (MS2, tandem MS,
or MSMS). Before the second cycle, individual *precursor* ions a
selected and broken into *fragment* ions.
---
class: middle center
![Schematics of a mass spectrometer and two rounds of MS.](./Figures/SchematicMS2.png)
???
Before MS:
- Restriction with enzyme, typically trypsine.
- Off-line fractionation.
An mass spectrometer is composed of three components:
1. The *source*, that ionises the molecules: examples are Matrix-assisted
laser desorption/ionisation (MALDI) or electrospray ionisation
(ESI).
2. The *analyser*, that separates the ions: Time of flight (TOF) or Orbitrap.
3. The *detector* that quantifies the ions.
Ions typically go through that cylce at least twice (MS2, tandem MS or
MSMS). Before the second cycle, individual *precursor* ions a
selected, broken into fragment ions.
---
class: middle center
![Separation and detection of ions in a mass spectrometer.](./Figures/mstut.gif)
---
class: middle center
![Parent ions in the MS1 spectrum (left) and two sected fragment ions MS2 spectra (right).](./Figures/MS1-MS2-spectra.png)
???
Highlight
- Semi-stochastic nature of MS
- Data dependent acquisition (DDA)
---
class: middle center
.left-col-50[
![MS1 spectra over retention time.](./Figures/F02-3D-MS1-scans-400-1200-lattice.png)
]
.right-col-50[
![MS2 spectra interleaved between two MS1 spectra.](./Figures/F02-3D-MS1-MS2-scans-100-1200-lattice.png)
]
???
- Please keep this MS space and the semi-stochastic nature of MS
acquisition in mind , as I will come back to it later.
---
### Identification: fragment ions
![Peptide fragment ions.](./Figures/frag.png)
---
### Identification: Peptide-spectrum matching (PSM)
Matching **expected** and *observed* spectra:
```
> MSnbase::calculateFragments("SIGFEGDSIGR")
mz ion type pos z seq
1 88.03931 b1 b 1 1 S
2 201.12337 b2 b 2 1 SI
3 258.14483 b3 b 3 1 SIG
4 405.21324 b4 b 4 1 SIGF
5 534.25583 b5 b 5 1 SIGFE
6 591.27729 b6 b 6 1 SIGFEG
7 706.30423 b7 b 7 1 SIGFEGD
8 793.33626 b8 b 8 1 SIGFEGDS
9 906.42032 b9 b 9 1 SIGFEGDSI
10 963.44178 b10 b 10 1 SIGFEGDSIG
11 175.11895 y1 y 1 1 R
12 232.14041 y2 y 2 1 GR
13 345.22447 y3 y 3 1 IGR
14 432.25650 y4 y 4 1 SIGR
15 547.28344 y5 y 5 1 DSIGR
16 604.30490 y6 y 6 1 GDSIGR
[ reached getOption("max.print") -- omitted 16 rows ]
```
---
class: middle
### Identification: Peptide-spectrum matching (PSM)
Matching *expected* and **observed** spectra:
![Peptide fragment matching](./Figures/annotated-spectrum.png)
???
Performed by **search engines** such as Mascot, MSGF+, Andromeda, ...
---
### Identification: database
![Uniprot human proteome](./Figures/uniprot1.png)
???
## [Human proteome](https://www.uniprot.org/help/human_proteome)
In 2008, a draft of the complete human proteome was released from
UniProtKB/Swiss-Prot: the approximately 20,000 putative human
protein-coding genes were represented by one UniProtKB/Swiss-Prot
entry, tagged with the keyword 'Complete proteome' and later linked to
proteome identifier UP000005640. This UniProtKB/Swiss-Prot H. sapiens
proteome (manually reviewed) can be considered as complete in the
sense that it contains one representative (**canonical**) sequence for
each currently known human gene. Close to 40% of these 20,000 entries
contain manually annotated alternative isoforms representing over
22,000 additional sequences
## [What is the canonical sequence?](https://www.uniprot.org/help/canonical_and_isoforms)
To reduce redundancy, the UniProtKB/Swiss-Prot policy is to describe
all the protein products encoded by one gene in a given species in a
single entry. We choose for each entry a canonical sequence based on
at least one of the following criteria:
1. It is the most prevalent.
2. It is the most similar to orthologous sequences found in other species.
3. By virtue of its length or amino acid composition, it allows the
clearest description of domains, isoforms, polymorphisms,
post-translational modifications, etc.
4. In the absence of any information, we choose the longest sequence.
## Are all isoforms described in one UniProtKB/Swiss-Prot entry?
Whenever possible, all the protein products encoded by one gene in a
given species are described in a single UniProtKB/Swiss-Prot entry,
including isoforms generated by alternative splicing, alternative
promoter usage, and alternative translation initiation (*). However,
some alternative splicing isoforms derived from the same gene share
only a few exons, if any at all, the same for some 'trans-splicing'
events. In these cases, the divergence is obviously too important to
merge all protein sequences into a single entry and the isoforms have
to be described in separate 'external' entries.
## UniProt [downloads](https://www.uniprot.org/downloads)
---
class: middle
### Identification
![Decoy databases and peptide scoring](./Figures/pr-2007-00739d_0004.gif)
From Käll *et al.* [Posterior Error Probabilities and False Discovery
Rates: Two Sides of the Same Coin](https://pubs.acs.org/doi/abs/10.1021/pr700739d).
???
- (global) FDR = B/(A+B)
- (local) fdr = PEP = b/(a+b)
---
### Identification: Protein inference
- Keep only reliable peptides
- From these peptides, infer proteins
- If proteins can't be resolved due to shared peptides, merge them
into **protein groups** of indistinguishable or non-differentiable
proteins.
---
class: middle
![Peptide evidence classes](./Figures/nbt0710-647-F2.gif)
From [Qeli and Ahrens
(2010)](http://www.ncbi.nlm.nih.gov/pubmed/20622826).
---
class: middle, center
## 3. Quantitative proteomics
---
class: middle
| |Label-free |Labelled |
|:---|:----------|:----------|
|MS1 |XIC |SILAC, 15N |
|MS2 |Counting |iTRAQ, TMT |
.left-col-50[
![MS1 spectra over retention time.](./Figures/F02-3D-MS1-scans-400-1200-lattice.png)
]
.right-col-50[
![MS2 spectra interleaved between two MS1 spectra.](./Figures/F02-3D-MS1-MS2-scans-100-1200-lattice.png)
]
---
class: middle center
### Label-free MS2: Spectral counting
![](./Figures/pbase.png)
From the [Pbase](https://bioconductor.org/packages/release/bioc/html/Pbase.html) package.
???
Comments of **spectral counting**:
- easy, but *semi-quantitative*
- **Note depth/coverage and compare to RNA-seq.**
---
class: middle center
### Labelled MS2: Isobaric tagging
![](./Figures/itraq.png)
???
Comments of **isobaric tagging**:
- multiplexing, albeit limited
- limited problem with missing values
- easy to process (fewer parameters to set)
---
class: middle center
### Label-free MS1: extracted ion chromatograms
.left-col-50[
![MS1 spectra over retention time.](./Figures/F02-3D-MS1-scans-400-1200-lattice.png)
]
.right-col-50[
![](./Figures/chrom_peaks.png)
]
Credit: [Johannes Rainer](https://github.com/jotsetung/)
???
Comments on **label-free**:
- Data processing
- Independent runs, NAs
---
class: middle center
### Labelled MS1: SILAC
![](./Figures/Silac.gif)
Credit: Wikimedia Commons.
???
Comments of **isotope labelling**:
- like red/green microarrays
- can be extended to 3, but risk of overlap of light, medium and heavy
peaks
---
class: middle, center
## 4. The `r BiocStyle::Biocpkg("MSnbase")` package
<img src="https://raw.githubusercontent.com/Bioconductor/BiocStickers/master/MSnbase/MSnbase.png" alt="MSnbase" style="width: 150px; padding: 10px"/>
---
class: middle
### Practical/demo
- Files in mass spectrometry and proteomics
- handling raw MS data using MSnExp objects
- loading identification data
- from raw to quantitative data as MSnSet objects
- load spreadsheet to MSnSet
- Quantitative proteomics data processing
- filtering (independent filtering)
- missing values
- log transformation
- normalisation
- summarisation
See the [lab](https://htmlpreview.github.io/?https://github.com/lgatto/bioc-ms-prot/blob/master/bss-lab.html) vignette.
---
class: middle
name: laurent-gatto
.left-col-50[
<img src="./img/lgatto3b.png" width = "180px"/>
### Laurent Gatto
<i class="fas fa-flask"></i> [Computational Biology Group](https://lgatto.github.io/cbio-lab/)<br />
<i class="fas fa-map-marker-alt"></i> de Duve Institute, UCLouvain<br />
<i class="fas fa-envelope"></i> [email protected]<br />
<i class="fas fa-home"></i> https://lgatto.github.io<br />
<i class="fab fa-twitter"></i> [@lgatto](https://twitter.com/lgatt0/)<br />
<i class="fab fa-github"></i> [lgatto](https://github.com/lgatto/)<br />
<img width="20px" align="top" alt="orcid" src="./img/orcid_64x64.png" /> [0000-0002-1520-2268](https://orcid.org/0000-0002-1520-2268)<br />
<img width="20px" align="top" alt="Impact story" src="./img/keybase.png"/> [lgatto](https://keybase.io/lgatto)<br />
<img width="20px" align="top" alt="Google scholar" src="./img/gscholar.png"/> [Google scholar](https://scholar.google.co.uk/citations?user=k5DrB74AAAAJ&hl=en)<br />
<img width="20px" align="top" alt="Impact story" src="./img/impactstory-logo.png"/> [Impact story](https://profiles.impactstory.org/u/0000-0002-1520-2268)<br />
<i class="fas fa-pencil-alt"></i> [dissem.in](https://dissem.in/r/6231/laurent-gatto)<br />
<!-- <i class="fab fa-linkedin"></i> https://www.linkedin.com/in/lgatto/<br /> -->
]
.rigth-col-50[
## Thank you for your attention
]