By default the plot contains the following information:
-
Transmembrane regions (TM) are shown in yellow. Extracellular regions are indicated by the dashed line above the sequences while intracellular regions are indicated by the dashed line bellow the sequence (prediction obtained by get_phobius()).
-
-
Signal peptides (prediction obtained by get_signalp()) are indicated by the thick red line on the N-terminal side.
-
Hydroxyprolines (as predicted by predict_hyp()) are indicated by vertical dark gray lines.
-
-
Arabinogalactan glycomodule spans are indicated by the light grey background (motifs are found using scan_ag()).
-
-
Domains (prediction obtained by get_hmm()) are indicated by rectangles with an appropriate fill as indicated in the legend.
+
Transmembrane regions (TM) are shown in yellow. Extracellular regions are indicated by the dashed line above the sequences while intracellular regions are indicated by the dashed line bellow the sequence. For TM prediction ragp can query Phobius or TMHMM via the functions get_phobius() and get_tmhmm(). The plot_prot() argument tm (can be: “phobius”, “tmhmm” or “none”) instructs which method should be used. Default is “phobius”.
+
Signal peptides are indicated by the thick red line on the N-terminal side. plot_prot() offers options for N-sp prediction, SignalP 4.1 and SignalP-5.0. The plot_prot() argument nsp (can be: “signalp”, “signalp5” or “none”) instructs which method should be used. Default is “signalp5”.
Hydroxyprolines (as predicted by predict_hyp()) are indicated by vertical dark gray lines.
+
Arabinogalactan glycomodule spans are indicated by the light grey background (motifs are found using scan_ag()).
+
Domains are indicated by rectangles with an appropriate fill as indicated in the legend. Domain annotation for the plot can be obtained using get_hmm() or get_cdd() depending on the domain argument (can be: “cdd”, “hmm” or “none”). Default is to use get_cdd().
-
Plotting can take a while since plot_prot() first obtains the appropriate predictions using get_hmm(), get_signalp(), get_phobius(), scan_ag() and predict_hyp() and then generates the plot. To check what actions the function is performing set the argument progress to TRUE.
Plotting can take a while since plot_prot() first obtains the appropriate predictions using and then generates the plot. To check what actions the function is performing set the argument progress to TRUE.
-domains<-get_hmm(sequence =seqs,
+domains<-get_cdd(sequence =seqs,
id =ids)tm<-get_phobius(sequence =seqs,
id =ids)
-nsp<-get_signalp(sequence =seqs,
+nsp<-get_signalp5(sequence =seqs,
id =ids)
-gpi<-get_big_pi(sequence =seqs,
+gpi<-get_netGPI(sequence =seqs,
id =ids)p1_df<-plot_prot(sequence =seqs,
@@ -179,7 +174,30 @@
gpi =gpi)p1_df
-
+
Mixing the two types of input is also a possibility. In the below call the data frames nsp and gpi are used to tell prot_plot() the length of the N-sp and the positions of omega sites while domain = "hmm" and tm = "tmhmm" instruct it to use get_hmm() to obtain domain predictions and get_tmhmm() to obtain TM predictions:
From the above plot it can be observed TMHMM tends to predict TM regions where there are N-sp’s and that get_cdd() and get_hmm() differ in the type of detected domains. For instance the Q9C7T7 is a leucine-rich repeat (LRR) receptor-like protein kinase - the Conserved domain database has an entry for exactly the mentioned class PLN00113 while the Pfam database has more general domain models so it predicts a protein kinase domain and several LRR domains.
+
To plod disordered regions as predicted by Espritz vie the function get_espritz():
The longest predicted disordered region overlaps with the AG motif span in Q9ZQ23
@@ -188,32 +206,35 @@
determining what features to plot: to turn off some elements from being plotted use the appropriate argument, for instance hyp = FALSE will result in hydroxyproline predictions to not be plotted (check full list of options: ?plot_prot):
-
+
p1<-plot_prot(sequence =seqs,
id =ids,
hyp =FALSE, #do not plot hydroxyprolines
ag =FALSE, #do not plot arabinogalactan motifs
- gpi ="predgpi")#use `get_pred_gpi()` to obtain GPI predictions
+ gpi ="predgpi", #use `get_pred_gpi()` to obtain GPI predictions
+ domain ="hmm")#use `get_hmm()` to obtain domain predictionsp1
-
+
feature specific options: passing appropriate function arguments to functions called by plot_prot() such as dim and div for scan_ag() or tprob for predict_hyp() to change the rules of arabinogalactan motif scan and the threshold for decision if the proline is hydroxylated is supported:
-
+
p2<-plot_prot(sequence =seqs,
id =ids,
dim =5, #`scan_ag()` argument, minimum number of dipeptides for a motif
div =5, #`scan_ag()` argument, maximum number of amino acids between dipeptides
- tprob =0.9)#`predict_hyp()` argument, threshold probability for prediction
+ tprob =0.9,
+ domain ="hmm",
+ ievalue =1e-10)#`get_hmm()` argument, remove domains with independent E-value higher than this valuep2
-
+
change feature colors:
-
+
p3<-plot_prot(sequence =seqs,
id =ids,
hyp_col ="chocolate2",
@@ -222,37 +243,38 @@
ag_col ="bisque2",
tm_col ="aquamarine")p3
-
+
-change the order of domain plotting: in the above images it can be observed there are more entries in the legend then can be seen in the plot. Some domains cover other domains. By default the domains with the lowest I-evalue (independent e-value) are plotted on top. To change this the argument dom_sort can be used. This argument can take the following values: "ievalue" (default), "abc" (alphabetically sorted by domain names), "cba" (reverse of "abc").
+change the order of domain plotting: in the above figures where plot_prot() was called with domain = "hmm" it can be observed there are more entries in the legend then can be seen in the plot. Some domains cover other domains. By default the domains with the lowest I-evalue (independent e-value) are plotted on top. To change this the argument dom_sort can be used. This argument can take the following values: "ievalue" (default), "abc" (alphabetically sorted by domain names), "cba" (reverse of "abc").
The y axis is continuous - 1 on the plot corresponds to Q9C7T7, 2 corresponds to Q9ZPR7 and 3 to Q9ZQ23. The x axis represents amino acids in the sequence. This enables adding annotations anywhere on the plot.
change plot ratio: the ratio of the axes is determined automatically by plot_prot() and scales with the longest sequence plotted. To change this use the ratio argument to coord_equal():
diff --git a/docs/articles/pkgdown/PSV_files/header-attrs-2.11/header-attrs.js b/docs/articles/pkgdown/PSV_files/header-attrs-2.11/header-attrs.js
new file mode 100644
index 0000000..dd57d92
--- /dev/null
+++ b/docs/articles/pkgdown/PSV_files/header-attrs-2.11/header-attrs.js
@@ -0,0 +1,12 @@
+// Pandoc 2.9 adds attributes on both header and div. We remove the former (to
+// be compatible with the behavior of Pandoc < 2.8).
+document.addEventListener('DOMContentLoaded', function(e) {
+ var hs = document.querySelectorAll("div.section[class*='level'] > :first-child");
+ var i, h, a;
+ for (i = 0; i < hs.length; i++) {
+ h = hs[i];
+ if (!/^h[1-6]$/i.test(h.tagName)) continue; // it should be a header h1-h6
+ a = h.attributes;
+ while (a.length > 0) h.removeAttribute(a[0].name);
+ }
+});
diff --git a/docs/articles/pkgdown/analyse.html b/docs/articles/pkgdown/analyse.html
index 13859d4..a701378 100644
--- a/docs/articles/pkgdown/analyse.html
+++ b/docs/articles/pkgdown/analyse.html
@@ -31,7 +31,7 @@
ragp
- 0.3.2.0002
+ 0.3.5.9000
The following section explains how to perform these tasks on a set of 398 Arabidopsis protein sequences obtained in Filtering hydroxyproline rich glycoproteins tutorial (the at_nsp_3hyp data frame).
+
The following section explains how to perform these tasks on a set of 365 Arabidopsis protein sequences obtained in Filtering hydroxyproline rich glycoproteins tutorial (the at_nsp_3hyp data frame).
The column sequence contains the input sequence where only the amino acids in potential AG glycomodules are in uppercase while all other are in lower case. The column AG_aa contains the amino acid count found in potential AG glycomodules, while columns total_length and longest correspond to the matched sequence spans. From the output it can be observed that 100 amino acids were found in AG glycomodules, that the longest span of AG glycomodules is 144 amino acids long (AG glycomodule dipeptides in one span are separated by less then 10 amino acids, since div argument which determines this is set per default to 10) and that it is the only span in the sequence.
To obtain a more detailed data frame the function can be called with simplify = FALSE and tidy = TRUE:
@@ -160,10 +158,10 @@
scaned_at2%>%mutate(sequence =strtrim(sequence, 55))%>%#take the first 55 amino acids from the sequencesfilter(id=="AT2G14890.1")
-#> sequence id
-#> 1 marsfaiavicivliagvtgqAPTSPPTaTPAPPTPTTPpPAaTPpPVsAPpPVt AT2G14890.1
-#> location.start location.end P_pos length AG_aa
-#> 1 22 165 23, 26, .... 144 100
Output obtained in this way contains the start and end location of each AG glycomodul span (in this example there is only one) as well as a vector of proline positions in AG glycomodules.
To incorporate hydroxyproline position knowledge when searching AG glycomodules the sequence output of predict_hyp() can be provided to scan_ag():
@@ -187,10 +185,10 @@
scaned_at3%>%mutate(sequence =strtrim(sequence, 55))%>%#take the first 55 amino acids from the sequencesfilter(id=="AT2G14890.1")
-#> sequence id
-#> 1 marsfaiavicivliagvtgqAOTSOOTaTOAOOTOTTOoOAaTOoOVsAOoOVt AT2G14890.1
-#> location.start location.end P_pos length AG_aa
-#> 1 22 165 23, 26, .... 144 98
It is clear almost all prolines (P) were predicted to be hydroxylated (O) so the start and end location as well as the length of the amino acid span containing AG glycomoduls are unchanged for this sequence.
Many of the found motifs in AT2G14890.1 (shown in uppercase) correspond to extensin motifs (PPP+ or OOO+). This might not be desired when searching for AG glycomodules, and scan_ag() offers the argument exclude_ext which enables the function to omit extensin motifs either of the common sequence SP{3,} or SO{3,} (exclude_ext = "yes") or of the form P[3,} or O{3,} (exclude_ext = "all"):
@@ -205,12 +203,12 @@
scaned_at4%>%mutate(sequence =strtrim(sequence, 55))%>%#take the first 55 amino acids from the sequencesfilter(id=="AT2G14890.1")
-#> sequence id
-#> 1 marsfaiavicivliagvtgqAOTSOOTaTOAOOTOTtoooaatooovsaooovt AT2G14890.1
-#> 2 marsfaiavicivliagvtgqAOTSOOTaTOAOOTOTtoooaatooovsaooovt AT2G14890.1
-#> location.start location.end P_pos length AG_aa
-#> 1 22 37 23, 26, .... 16 15
-#> 2 114 165 115, 119.... 52 34
When extensin motifs are masked two AG glycomodule spans are found in AT2G14890.1, from position 22 to 37 and from position 114 to 165.
Next we will compare how many sequences are identified as arabinogalactans with and without prediction of proline hydroxylation when extensin motifs are masked:
@@ -220,7 +218,7 @@
Transcript.id,
exclude_ext ="all")sum(scaned_at5$total_length!=0)
-#> [1] 685
+#> [1] 627#using the output of `predict_hyp()scaned_at6<-scan_ag(data =at_hyp$sequence,
@@ -229,7 +227,7 @@
exclude_ext ="all")#> sequence vector contains O, O will be considered instead of Psum(scaned_at6$total_length!=0)
-#> [1] 217
+#> [1] 208
Incorporating hydroxyproline predictions in the search for AGPs reduced the pool of identified sequences by threefold. scan_ag() is capable of identifying potential chimeric and short AGPs without pronounced amino acid bias and with few characteristic motifs. In combination with predict_hyp() the specificity of the found sequences is greatly increased since the majority of sequences unlikely to contain hydroxyprolines are excluded.
Since the function was not provided knowledge on the GPI presence, both possible classes are indicated as in: “1/4” or “2/9.” To remove ambiguities in classes get_gpi argument can be set to "bigpi", "predgpi" or “netgpi” this will run get_big_pi(), get_pred_gpi() or get_netGPI() only on the sequences that belong to one of the HRGP classes:
To compare how many sequences are classified to one of the HRGP classes when proline hydroxylation is not considered:
#when considering only sequences with at least 3 predicted hydroxyprolinsmaab_at%>%filter(maab_class!="0")%>%nrow
-#> [1] 71
+#> [1] 69#when considering all sequences predicted to be secretedmaab_at2<-maab(at_nsp_filtered,
@@ -308,24 +292,22 @@
maab_at2%>%filter(maab_class!="0")%>%#remove maab_class 0 - not HRGPnrow
-#> [1] 72
-
Out of the 2126 sequences in at_nsp_filtered, 72 are predicted to belong to any MAAB class, one less compared to when proline hydroxylation prediction was incorporated. To check which sequence differs between maab_at and maab_at2:
+#> [1] 70
+
Out of the 1890 sequences in at_nsp_filtered, 70 are predicted to belong to any MAAB class, one less compared to when proline hydroxylation prediction was incorporated. To check which sequence differs between maab_at and maab_at2:
The differing sequence belongs to MAAB class 24 - sequences with HRGP bias and low motif coverege (<15%).
Domain annotation and GO enrichment
-
Domain annotation is a fundamental research tool in protein sequence analyses. In many occasions domain annotation has already been performed prior to HRGP mining, either by using hmmer(Eddy 2011), InterProScan(Jones et al. 2014) or some other resource. In such circumstances the annotation can be imported into R and joined with the filtered sequences. When this is not the case ragp offers the function get_hmm() which queries the hmmscan web server (Finn, Clements, and Eddy 2011). Domain annotation is not necessary for AGP prediction although it enhances interpretation, but it is a fundamental part of the MAAB classification scheme. As with other ragp functions the first argument is the data, followed by the names of the columns containing the protein sequences and corresponding identifiers:
+
Domain annotation is a fundamental research tool in protein sequence analyses. In many occasions domain annotation has already been performed prior to HRGP mining, either by using hmmer(Eddy 2011), InterProScan(Jones et al. 2014) or some other resource. In such circumstances the annotation can be imported into R and joined with the filtered sequences. When this is not the case ragp offers the functions get_hmm() which queries the hmmscan web server (Finn, Clements, and Eddy 2011) and get_cdd() which queries the Conserved Domain Database. Domain annotation is not necessary for AGP prediction although it enhances interpretation, but it is a fundamental part of the MAAB classification scheme. As with other ragp functions the first argument is the data, followed by the names of the columns containing the protein sequences and corresponding identifiers:
The resulting data frame resembles the web server’s output, or for those familiar with the hmmer software the domain hits table produced by using the argument --domtblout. First few rows:
head(at_hmm)
-#> id name acc desc clan
-#> 1 AT2G43620.1 Glyco_hydro_19 PF00182.19 Chitinase class I CL0037
-#> 2 AT2G43620.1 Glyco_hydro_19 PF00182.19 Chitinase class I CL0037
-#> 3 AT2G43620.1 Chitin_bind_1 PF00187.19 Chitin recognition protein <NA>
-#> 4 AT2G43620.1 Chitin_bind_1 PF00187.19 Chitin recognition protein <NA>
-#> 5 AT2G43945.1 <NA> <NA> <NA> <NA>
-#> 6 AT2G30933.1 X8 PF07983.13 X8 domain <NA>
-#> align_start align_end model_start model_end ievalue cevalue bitscore reported
-#> 1 94 228 2 155 1.3e-37 1.5e-41 129.9 TRUE
-#> 2 237 283 185 232 1.6e-06 1.8e-10 28.2 TRUE
-#> 3 30 64 5 38 1.9e-08 2.1e-12 34.7 TRUE
-#> 4 164 168 28 32 8.5e+03 9.5e-01 -2.7 FALSE
-#> 5 NA NA NA NA NA NA NA FALSE
-#> 6 81 151 1 75 2.6e-21 1.4e-25 76.1 TRUE
+#> id name acc desc clan align_start align_end model_start model_end
+#> 1 AT2G43620.1 Glyco_hydro_19 PF00182.19 Chitinase class I CL0037 94 228 2 155
+#> 2 AT2G43620.1 Glyco_hydro_19 PF00182.19 Chitinase class I CL0037 237 283 185 232
+#> 3 AT2G43620.1 Chitin_bind_1 PF00187.19 Chitin recognition protein <NA> 30 64 5 38
+#> 4 AT2G43620.1 Chitin_bind_1 PF00187.19 Chitin recognition protein <NA> 164 168 28 32
+#> 5 AT2G43945.1 <NA> <NA> <NA> <NA> NA NA NA NA
+#> 6 AT2G30933.1 X8 PF07983.13 X8 domain <NA> 81 151 1 75
+#> ievalue cevalue bitscore reported
+#> 1 1.3e-37 1.5e-41 129.9 TRUE
+#> 2 1.6e-06 1.8e-10 28.2 TRUE
+#> 3 1.9e-08 2.1e-12 34.7 TRUE
+#> 4 8.5e+03 9.5e-01 -2.7 FALSE
+#> 5 NA NA NA FALSE
+#> 6 2.6e-21 1.4e-25 76.1 TRUE
To check if any of the sequence with MAAB classes contain domains the output of get_hmm(), in this case the at_hmm data frame can be merged (dplyr::left_join()) with the output of maab() - the maab_at data frame. After some data cleaning we can plot the frequency of domains using ggplot2::geom_col().
maab_at%>%
@@ -396,32 +378,25 @@
go_enriched<-pfam2go(at_hmm)head(go_enriched)#first few rows
-#> id name acc desc clan align_start
-#> 1 AT2G43620.1 Glyco_hydro_19 PF00182.19 Chitinase class I CL0037 94
-#> 2 AT2G43620.1 Glyco_hydro_19 PF00182.19 Chitinase class I CL0037 94
-#> 3 AT2G43620.1 Glyco_hydro_19 PF00182.19 Chitinase class I CL0037 94
-#> 4 AT2G43620.1 Glyco_hydro_19 PF00182.19 Chitinase class I CL0037 237
-#> 5 AT2G43620.1 Glyco_hydro_19 PF00182.19 Chitinase class I CL0037 237
-#> 6 AT2G43620.1 Glyco_hydro_19 PF00182.19 Chitinase class I CL0037 237
-#> align_end model_start model_end ievalue cevalue bitscore reported
-#> 1 228 2 155 1.3e-37 1.5e-41 130 TRUE
-#> 2 228 2 155 1.3e-37 1.5e-41 130 TRUE
-#> 3 228 2 155 1.3e-37 1.5e-41 130 TRUE
-#> 4 283 185 232 1.6e-06 1.8e-10 28 TRUE
-#> 5 283 185 232 1.6e-06 1.8e-10 28 TRUE
-#> 6 283 185 232 1.6e-06 1.8e-10 28 TRUE
-#> Pfam_name GO_name GO_acc
-#> 1 Glyco_hydro_19 GO:chitin catabolic process GO:0006032
-#> 2 Glyco_hydro_19 GO:cell wall macromolecule catabolic process GO:0016998
-#> 3 Glyco_hydro_19 GO:chitinase activity GO:0004568
-#> 4 Glyco_hydro_19 GO:chitin catabolic process GO:0006032
-#> 5 Glyco_hydro_19 GO:cell wall macromolecule catabolic process GO:0016998
-#> 6 Glyco_hydro_19 GO:chitinase activity GO:0004568
+#> id name acc desc clan align_start align_end model_start model_end ievalue
+#> 1 AT2G43620.1 Glyco_hydro_19 PF00182.19 Chitinase class I CL0037 94 228 2 155 1.3e-37
+#> 2 AT2G43620.1 Glyco_hydro_19 PF00182.19 Chitinase class I CL0037 94 228 2 155 1.3e-37
+#> 3 AT2G43620.1 Glyco_hydro_19 PF00182.19 Chitinase class I CL0037 94 228 2 155 1.3e-37
+#> 4 AT2G43620.1 Glyco_hydro_19 PF00182.19 Chitinase class I CL0037 237 283 185 232 1.6e-06
+#> 5 AT2G43620.1 Glyco_hydro_19 PF00182.19 Chitinase class I CL0037 237 283 185 232 1.6e-06
+#> 6 AT2G43620.1 Glyco_hydro_19 PF00182.19 Chitinase class I CL0037 237 283 185 232 1.6e-06
+#> cevalue bitscore reported Pfam_name GO_name GO_acc
+#> 1 1.5e-41 130 TRUE Glyco_hydro_19 GO:chitin catabolic process GO:0006032
+#> 2 1.5e-41 130 TRUE Glyco_hydro_19 GO:cell wall macromolecule catabolic process GO:0016998
+#> 3 1.5e-41 130 TRUE Glyco_hydro_19 GO:chitinase activity GO:0004568
+#> 4 1.8e-10 28 TRUE Glyco_hydro_19 GO:chitin catabolic process GO:0006032
+#> 5 1.8e-10 28 TRUE Glyco_hydro_19 GO:cell wall macromolecule catabolic process GO:0016998
+#> 6 1.8e-10 28 TRUE Glyco_hydro_19 GO:chitinase activity GO:0004568
GPI prediction
-
HRGPs, and especially AGPs are often linked to the membrane by a glycosylphosphatidylinositol (GPI) anchor (Ellis et al. 2010). Attachment of the GPI moiety to the carboxyl terminus (omega-site) of the polypeptide occurs after proteolytic cleavage of a C-terminal propeptide. In order to predict the presence of the omega sites in the filtered sequences two functions are available: get_big_pi() which queries the big-PI Plant Predictor(Eisenhaber et al. 2003) server, and get_pred_gpi() which queries the PredGPI(Pierleoni, Martelli, and Casadio 2008) server and get_netGPI() which queries the NetGPI server.
+
HRGPs, and especially AGPs are often linked to the membrane by a glycosylphosphatidylinositol (GPI) anchor (Ellis et al. 2010). Attachment of the GPI moiety to the carboxyl terminus (omega-site) of the polypeptide occurs after proteolytic cleavage of a C-terminal propeptide. In order to predict the presence of the omega sites in the filtered sequences two functions are available: get_big_pi() which queries the big-PI Plant Predictor(Eisenhaber et al. 2003) server, and get_pred_gpi() which queries the PredGPI(Pierleoni, Martelli, and Casadio 2008) server and get_netGPI() which queries the NetGPI server (Gíslason et al. 2021).
A key structural feature of HRGPs, based on abundance of disorder-promoting residues, is that they are intrinsically disordered proteins. To identify potential disordered regions in proteins ragp contains get_espritz() which queries ESpritz(Walsh et al. 2012) web server. Three models trained on different data sets are available and can be selected via the argument model:
‘X-Ray’ - based on missing atoms from the Protein Data Bank (PDB) X-ray solved structures. If this option is chosen then the predictors with short disorder options are executed.
@@ -485,18 +456,18 @@
Finn, Robert D., Jody Clements, and Sean R. Eddy. 2011. “HMMER Web Server: Interactive Sequence Similarity Searching.”Nucleic Acids Research 39 (Web Server issue): W29–37. https://doi.org/10.1093/nar/gkr367.
+
+Gíslason, Magnús Halldór, Henrik Nielsen, José Juan Almagro Armenteros, and Alexander Rosenberg Johansen. 2021. “Prediction of GPI-Anchored Proteins with Pointer Neural Networks.”Current Research in Biotechnology 3: 6–13. https://doi.org/https://doi.org/10.1016/j.crbiot.2021.01.001.
+
Johnson, Kim L., Andrew M. Cassin, Andrew Lonsdale, Antony Bacic, Monika S. Doblin, and Carolyn J. Schultz. 2017. “Pipeline to IdentifyHydroxyproline-RichGlycoproteins.”Plant Physiology 174 (2): 886–903. https://doi.org/10.1104/pp.17.00294.
diff --git a/docs/articles/pkgdown/analyse_files/header-attrs-2.11/header-attrs.js b/docs/articles/pkgdown/analyse_files/header-attrs-2.11/header-attrs.js
new file mode 100644
index 0000000..dd57d92
--- /dev/null
+++ b/docs/articles/pkgdown/analyse_files/header-attrs-2.11/header-attrs.js
@@ -0,0 +1,12 @@
+// Pandoc 2.9 adds attributes on both header and div. We remove the former (to
+// be compatible with the behavior of Pandoc < 2.8).
+document.addEventListener('DOMContentLoaded', function(e) {
+ var hs = document.querySelectorAll("div.section[class*='level'] > :first-child");
+ var i, h, a;
+ for (i = 0; i < hs.length; i++) {
+ h = hs[i];
+ if (!/^h[1-6]$/i.test(h.tagName)) continue; // it should be a header h1-h6
+ a = h.attributes;
+ while (a.length > 0) h.removeAttribute(a[0].name);
+ }
+});
diff --git a/docs/articles/pkgdown/filter.html b/docs/articles/pkgdown/filter.html
index fe0d496..189d82c 100644
--- a/docs/articles/pkgdown/filter.html
+++ b/docs/articles/pkgdown/filter.html
@@ -31,7 +31,7 @@
ragp
- 0.3.2.0002
+ 0.3.5.9000
Predicting N-terminal secretory signal sequences (N-sp)
-
Hydroxyproline rich glycoproteins (HRGPs) are secreted proteins, therefore they are expected to contain a secretory signal sequence on the N-terminus (N-sp). ragp incorporates N-sp prediction by querying SignalP 4.1, TargetP 1.1(Emanuelsson et al. 2000) and Phobius(Käll, Krogh, and Sonnhammer 2007) in an efficient manner via the functions: get_signalp(), get_targetp() and get_phobius().
Hydroxyproline rich glycoproteins (HRGPs) are secreted proteins, therefore they are expected to contain a secretory signal sequence on the N-terminus (N-sp). ragp incorporates N-sp prediction by querying SignalP 5(armenteros_signalp5?), SignalP 4.1, TargetP 1.1(Emanuelsson et al. 2000) and Phobius(Käll, Krogh, and Sonnhammer 2007) via the functions: get_signalp5(), get_signalp(), get_targetp() and get_phobius().
+
In the ragp versions prior 0.3.5 we advised a majority vote between SignalP 4.1, TargetP 1.1 and Phobius should be used to determine which sequences should be flagged as secreted. Since the newest SignalP (version 5) utilizes a more sophisticated prediction method and a larger training set compared to the older methods we trust it provides a more accurate prediction of secreted protein sequences compared to the before mentioned majority vote.
-nsp_signalp<-get_signalp(data =at_nsp, #data frame name
- sequence =sequence, #column containing the sequences
- id =Transcript.id)#column containing protein identifiers
-
The predictions for the 2700 sequences contained in at_nsp data frame should be available in around 1 minute. The returned object nsp_signalp is a data frame resembling the web servers output:
+nsp_signalp<-get_signalp5(data =at_nsp, #data frame name
+ sequence =sequence, #column containing the sequences
+ id =Transcript.id)#column containing protein identifiers
+
The predictions for the 2700 sequences contained in at_nsp data frame should be available after several minutes. The returned object nsp_signalp is a data frame resembling the web servers output:
The three servers do not agree completely. The disagreement is even higher when all Arabidopsis protein sequences are checked:
-
-
-
-
Therefore we recommend using the majority vote as the first filtering step:
-
-bind_cols(nsp_signalp, nsp_targetp, nsp_phobius)%>%#bind the prediction data frames
- select(is.phobius, is.targetp, is.signalp, id)%>%#select the logical columns from each, as well as the id
- mutate(vote =rowSums(.[,1:3]))%>%#calculate the majority vote
- filter(vote>=2)%>%#filter by majority vote
- pull(id)->id_nsp#pull the filtered ids
+nsp_signalp%>%
+ filter(is.signalp)%>%#filter rows where is.signalp column is TRUE
+ pull(id)->id_nsp#pull the id column into a vectorat_nsp_filtered<-filter(at_nsp,
Transcript.id%in%id_nsp)#filter at_nsp by pulled ids
-
This will create at_nsp_filtered data frame with 2126 sequences.
+
This will create at_nsp_filtered data frame with 1890 sequences.
Predicting proline hydroxylation
The key feature of HRGPs is the presence of hydroxyprolines (Hyp, O) which represent glycosylation sites (Showalter et al. 2010). While many HRGPs can be found based on biased amino acid composition, or the presence of certain amino acid motifs, there exists an abundance of chimeric proteins comprised from specific domains and HRGP motifs which are much harder to identify based on the mentioned features. In an attempt to identify these types of sequences ragp incorporates a model specifically built to predict the probability of proline hydroxylation in plant proteins. This model is incorporated in the predict_hyp() function and can be utilized to filter potential HRGPs.
-
+
at_hyp<-predict_hyp(data =at_nsp_filtered,
sequence =sequence,
id =Transcript.id)
The object returned by predict_hyp is a list with two elements prediction and sequence. The prediction element is a data frame containing the probability of hydroxylation for each P in each sequence along with the amino acid span used for prediction. First few rows of this element:
#> id substr P_pos prob HYP
#> 1 AT2G43600.1 FSQNCMDTSCPGLKECCSRWG 31 0.015 No
@@ -207,13 +175,13 @@
#> 6 AT2G43600.1 YYGAGKHLGLPLLKDPDLVSR 194 0.015 No
The sequence (at_hyp$sequence) element is a data frame with the same sequences as provided in the function call in which prolines (P) are replaced by hydroxyprolines (O) at positions predicted to be hydroxylated. This element can be useful as input to scan_ag() as explained in HRGP analysis tutorial.
The predictions are based on a probability threshold of 0.224 which offers the best trade-off between specificity and sensitivity. To increase specificity at the cost of sensitivity the threshold can be increased using the tprob argument:
-
+
at_hyp2<-predict_hyp(data =at_nsp_filtered,
sequence =sequence,
id =Transcript.id,
tprob =0.6)
The at_hyp object can be used to filter the sequences that contain more than a certain number of hydroxyprolines. The default threshold will be used. To filter sequences with three or more O:
-
+
at_hyp$prediction%>%#the prediction elementgroup_by(id)%>%#group by idsummarise(n =sum(HYP=="Yes"))%>%#calculate the number of hydroxyprolines per sequence
@@ -222,7 +190,7 @@
at_nsp_filtered%>%filter(Transcript.id%in%at_3hyp)->at_nsp_3hyp#filter sequences from `at_nsp_filtered`
-
There are 398 sequences that satisfy this condition.
+
There are 365 sequences that satisfy this condition.
After the filtering step the sequences can be analyzed by scan_ag() to identify arabinogalactan motifs, by maab() to classify prototypical HRGPs, by get_big_pi() and get_pred_gpi() to predict GPI anchor sites in sequences and by get_hmm() and get_espritz() to identify domains and disordered regions. This is explained in HRGP analysis tutorial.
diff --git a/docs/articles/pkgdown/filter_files/header-attrs-2.11/header-attrs.js b/docs/articles/pkgdown/filter_files/header-attrs-2.11/header-attrs.js
new file mode 100644
index 0000000..dd57d92
--- /dev/null
+++ b/docs/articles/pkgdown/filter_files/header-attrs-2.11/header-attrs.js
@@ -0,0 +1,12 @@
+// Pandoc 2.9 adds attributes on both header and div. We remove the former (to
+// be compatible with the behavior of Pandoc < 2.8).
+document.addEventListener('DOMContentLoaded', function(e) {
+ var hs = document.querySelectorAll("div.section[class*='level'] > :first-child");
+ var i, h, a;
+ for (i = 0; i < hs.length; i++) {
+ h = hs[i];
+ if (!/^h[1-6]$/i.test(h.tagName)) continue; // it should be a header h1-h6
+ a = h.attributes;
+ while (a.length > 0) h.removeAttribute(a[0].name);
+ }
+});
diff --git a/docs/articles/pkgdown/start.html b/docs/articles/pkgdown/start.html
index 41bcaca..b9508eb 100644
--- a/docs/articles/pkgdown/start.html
+++ b/docs/articles/pkgdown/start.html
@@ -31,7 +31,7 @@
ragp
- 0.3.2.0002
+ 0.3.5.9000
Most ragp functions require single letter protein sequences and the corresponding identifiers as input. These can be provided in the form of basic R data types such as vectors or data frames. Additionally ragp imports the seqinr package for the manipulation of .FASTA files, so the input objects can be a list of SeqFastaAA objects returned by the seqinr::read.fasta(). The location of a .FASTA file is also possible as a type of input.
+
Most ragp functions require single letter protein sequences and the corresponding identifiers as input. These can be provided in the form of basic R data types such as vectors or data frames. Additionally ragp imports the seqinr package for the manipulation of .FASTA files, so the input objects can be a list of SeqFastaAA objects returned by the seqinr::read.fasta(). The location of a .FASTA file is also possible as a type of input. As of ragp version 0.3.5 objects of class AAStringSet are also supported.
Input options will be illustrated using scan_ag() function:
provide a character vector of protein sequences to the sequence argument and a character vector of protein identifiers to the id argument:
@@ -168,8 +168,14 @@
input5<-scan_ag(data ="at_nsp.fasta")#file at_nsp.fasta is in the working directory
The only exceptions to this design are plotting functions plot_prot() and plot_signalp() which require protein sequences to be supplied in the form of string vectors (input1 in the above example) and pfam2go() which does not take sequences as input.
+
The only exceptions to this design are the plotting function plot_prot() which requires protein sequences to be supplied in the form of string vectors (input1 in the above example) and pfam2go() which does not take sequences as input.
diff --git a/docs/articles/pkgdown/start_files/header-attrs-2.11/header-attrs.js b/docs/articles/pkgdown/start_files/header-attrs-2.11/header-attrs.js
new file mode 100644
index 0000000..dd57d92
--- /dev/null
+++ b/docs/articles/pkgdown/start_files/header-attrs-2.11/header-attrs.js
@@ -0,0 +1,12 @@
+// Pandoc 2.9 adds attributes on both header and div. We remove the former (to
+// be compatible with the behavior of Pandoc < 2.8).
+document.addEventListener('DOMContentLoaded', function(e) {
+ var hs = document.querySelectorAll("div.section[class*='level'] > :first-child");
+ var i, h, a;
+ for (i = 0; i < hs.length; i++) {
+ h = hs[i];
+ if (!/^h[1-6]$/i.test(h.tagName)) continue; // it should be a header h1-h6
+ a = h.attributes;
+ while (a.length > 0) h.removeAttribute(a[0].name);
+ }
+});
diff --git a/docs/articles/ragp-vignette.html b/docs/articles/ragp-vignette.html
index f25b964..ece4e4d 100644
--- a/docs/articles/ragp-vignette.html
+++ b/docs/articles/ragp-vignette.html
@@ -5,20 +5,20 @@
-ragp vignette • ragp
-
-
-
-
-
-
-
+Introduction to ragp • ragp
+
+
+
+
+
+
+
-
+
This vignette provides an introduction into the functions and workflow in ragp R package. ragp is designed for mining and analysis of plant Hydroxyproline rich glycoproteins but it can also be used for protein annotation and for obtaining predictions for several protein features based on sequence (secretory signals, transmembrane regions, domains and glycosylphosphatidylinositol attachment sites). Additionally ragp provides tools for visualization of the mentioned attributes.
-
##Introduction
+
+
+What is ragp?
Hydroxyproline rich glycoproteins (HRGPs) are one of the most complex families of macromolecules found in plants, due to the diversity of glycans decorating the protein backbone, as well as the heterogeneity of the protein backbones. While this diversity is responsible for the wide array of physiological functions associated with HRGPs, it hinders attempts for homology based identification. Current approaches, based on identifying sequences with characteristic motifs and biased amino acid composition, are limited to prototypical sequences.
-
ragp is an R package for mining and analyses of HRGPs, with emphasis on arabinogalactan protein sequences (AGPs). The ragp filtering pipeline exploits one of the HRGPs key features, the presence of hydroxyprolines which represent glycosylation sites. Main package features include prediction of proline hydroxylation sites, amino acid motif and bias analyses, efficient communication with web servers for prediction of N-terminal signal peptides, glycosylphosphatidylinositol modification sites and protein disorder, as well as the ability to annotate sequences through hmmscan and subsequent GO enrichment, based on predicted pfam domains.
+
ragp is an R package for mining and analyses of HRGPs, with emphasis on arabinogalactan protein sequences (AGPs). The ragp filtering pipeline exploits one of the HRGPs key features, the presence of hydroxyprolines which represent glycosylation sites. Main package features include prediction of proline hydroxylation sites, amino acid motif and bias analyses, efficient communication with web servers for prediction of N-terminal signal peptides and glycosylphosphatidylinositol modification sites, as well as the ability to annotate sequences through CDD or hmmscan and subsequent GO enrichment, based on predicted Pfam domains.
The workflow in ragp is illustrated with the following diagram (ragp functions to be used for each of the tasks are boxed grey):
The filtering layer:
-
predict the presence of secretory signals (N-sp) and filter sequences containing them. Since several prediction algorithms are utilized (functions get_signalp, get_targetp and get_phobius) this decision is reached by a majority vote.
+
predict the presence of secretory signals (N-sp) and filter sequences containing them. Several prediction algorithms are available. Currently recommended is to use get_signalp5() which queries SignalP5(Almagro Armenteros et al. 2019).
-
predict proline hydroxylation (function predict_hyp) and filter sequences containing at least three potential hydroxyprolines.
+
predict proline hydroxylation (predict_hyp()) and filter sequences containing several (for example three or more) potential hydroxyprolines.
The analysis layer:
-
find localized clusters of characteristic arabinogalactan motifs (AG glycomodules) to identify potential AGPs (function scan_ag) among the hydroxyproline containing sequences.
-
perform motif and amino acid bias (MAAB, (Johnson et al. 2017)) classification scheme (function maab) to classify hydroxyproline containing sequences into 23 distinct HRGP groups.
+
find localized clusters of characteristic arabinogalactan motifs (AG glycomodules) to identify potential AGPs (scan_ag()) among the hydroxyproline containing sequences.
+
perform motif and amino acid bias (MAAB, (Johnson et al. 2017)) classification scheme (maab()) to classify hydroxyproline containing sequences.
+
annotate domains using Pfam (get_hmm()) and enrich with gene ontology (GO) terms (pfam2go()). Alternatively annotate domains using CDD.
-
annotate domains using Pfam (function get_hmm) and enrich with gene ontology (GO) terms (function pfam2go).
+
Additionally ragp provides tools for visualization of the mentioned attributes via plot_prot().
+
+
+Installation
+
There are several ways to install R packages hosted on git-hub, however the simplest is to use devtools::install_github() which will perform all the required steps automatically.
Most ragp functions require single letter protein sequences and the corresponding identifiers as input. These can be provided in the form of basic R data types such as vectors or data frames. Additionally ragp imports the seqinr package for the manipulation of .FASTA files, so the input objects can be a list of SeqFastaAA objects returned by the seqinr::read.fasta(). The location of a .FASTA file is also possible as a type of input. As of ragp version 0.3.5 objects of class AAStringSet are also supported.
+
Input options will be illustrated using scan_ag() function:
+
+
provide a character vector of protein sequences to the sequence argument and a character vector of protein identifiers to the id argument:
-
Additionally ragp provides tools for visualization of the mentioned attributes via the function plot_prot.
-
##Data import
-
To proceed with this guide the following R packages are required:
The examples will be performed on a set of 2706 Arabidopsis protein sequences included in ragp package as the at_nsp data frame which contains four columns:
ragp functions require single letter protein sequences and the corresponding identifiers as input. These can be provided in the form of basic R data types such as vectors or data frames. Additionally ragp imports the seqinr package for the manipulation of .FASTA files, so the input objects can be a list of SeqFastaAA objects returned by the seqinr function read.fasta. The location of a .FASTA file is also possible as a type of input.
-
Input options will be illustrated using scan_ag function:
-
data(at_nsp) #a data frame of 2700 Arabidopsis protein sequences
+
provide a data.frame to data argument, and names of columns containing the protein sequences and corresponding identifiers to sequence and id arguments:
and lastly the location of a .FASTA file to be analyzed as string:
-
input5 <- scan_ag(data = "at_nsp.fasta")
-
#Filtering HRGPs
-
The filtering layer in ragp consists of two sequential tasks:
-
-
predict the presence of secretory signals (N-sp) and filter sequences containing them.
-
-
predict proline hydroxylation and filter sequences containing at least several potential hydroxyprolines.
-
-
The following section explains how to perform these tasks.
-
##Predicting N-terminal secretory signal sequences (N-sp)
-
HRGPs are secreted proteins, therefore they are expected to contain a secretory signal sequence on the N-terminus (N-sp). ragp incorporates N-sp prediction by querying SignalP, TargetP(Emanuelsson et al. 2000) and Phobius(Käll, Krogh, and Sonnhammer 2007) in an efficient manner via the functions: get_signalp, get_targetp and get_phobius.
The predictions for the 2700 sequences contained in at_nsp data frame should be available in around 1 minute. When handling large FASTA files with many sequences it is recommended to split them into smaller files each containing 10000 - 20000 sequences, using the ragp function split_fasta and to obtain predictions in a loop. The returned object nsp_signalp is a data frame resembling the web servers output:
This will create at_nsp2 data frame with 2126 sequences.
-
##Predicting proline hydroxylation
-
The key feature of HRGPs is the presence of hydroxyprolines (Hyp, O) which represent glycosylation sites (Showalter et al. 2010). While many HRGPs can be found based on biased amino acid composition, or the presence of certain amino acid motifs, there exists an abundance of chimeric proteins comprised from specific domains and HRGP motifs which are much harder to identify based on the mentioned features. In an attempt to identify these types of sequences ragp incorporates a model specifically built to predict the probability of proline hydroxylation in plant proteins. This model is incorporated in the predict_hyp function and can be utilized to filter potential HRGPs.
The object returned by predict_hyp is a list with two elements prediction and sequence. The prediction element is a data frame containing the probability of hydroxylation for each P in each sequence along with the amino acid span used for prediction. First few rows of this element:
-
head(at_hyp$prediction)
-#> id substr P_pos prob HYP
-#> 1 AT2G43600.1 FSQNCMDTSCPGLKECCSRWG 31 0.01492346 No
-#> 2 AT2G43600.1 EYCGFFCFSGPCNIKGKSYGY 58 0.01603031 No
-#> 3 AT2G43600.1 YGYDYNVDAGPRGKIETVITS 76 0.01827260 No
-#> 4 AT2G43600.1 ERYCSKSKKYPCEPGKNYYGR 163 0.01534502 No
-#> 5 AT2G43600.1 CSKSKKYPCEPGKNYYGRGLL 166 0.01449253 No
-#> 6 AT2G43600.1 YYGAGKHLGLPLLKDPDLVSR 194 0.01456114 No
-
These predictions are based on a probability threshold of 0.224 which offers the best trade-off between specificity and sensitivity. To increase specificity at the cost of sensitivity the threshold can be increased using the tprob argument:
The at_hyp object can be used to filter the sequences that contain more than a certain number of hydroxyprolines. The default threshold will be used. To filter sequences with three or more O:
To check which of the filtered sequences are potential arabinogalactans the function scan_ag can be used. This function scans the protein sequences for AG glycomodule clusters, by constructing regular expressions based on user input.
-
By default scan_ag searches for at least three dipeptides PA, PS, PT, PG, PV, AP, SP, TP, GP and VP which are separated by maximally 10 amino acids. The minimal number of dipeptides to be considered can be changed by the argument dim while the maximum number of separating amino acids can be tweaked using the argument div. To consider only PA, PS, PT, AP, SP and TP dipeptides the type argument can be changed to "conservative". In cases when the dipeptides are ambiguous, such as SPT, PTP etc. all three amino acids will be counted as one dipeptide, while cases such as PTPA are considered as two dipeptides separated by zero amino acids. When the sequence object from predict_hyp function is passed to scan_ag only prolines predicted to be hydroxylated are considered when searching for dipeptides:
The output object scaned_At can be a list, or a data frame depending on the simplify/tidy arguments. The default output is a data frame one row per sequences with five columns. Here are the 55 N-terminal amino acids from the Classical arabinogalactan protein 9 (AT2G14890.1) as an example:
The column “sequence” contains the input sequence where only the amino acids in potential AG glycomodules are in uppercase while all other are in lower case. The column “AG_aa” contains the amino acid count found in potential AG glycomodules, while columns “total_length” and “longest” correspond to the matched sequence spans.
-
True AG glycomodules contain hydroxyproline instead of proline in the dipeptides: OA, OS, OT, AO, SO and TO (and in some cases OG, OV, GO and VO) but the above output considers P instead of O since most of the time the positions of O are unknown. However if the sequence argument to scan_ag contains O’s at some positions, scan_ag will consider only them. To do this the sequence element of predict_hyp function output can be used:
With default settings scan_ag identifies 234 sequences. To compare it with the number of sequences that would be filtered if the proline hydroxylation prediction was omitted the function can be run on at_nsp2 data:
Thus the prediction of hydroxyprolines greatly reduced the pool of potential arabinogalactans. scan_ag is capable of identifying potential chimeric and short AGPs without pronounced amino acid bias and with few characteristic motifs. In combination with predict_hyp the specificity of the found sequences is greatly increased since the majority of sequences unlikely to contain hydroxyprolines are excluded.
-
##Motif and amino acid bias classification - MAAB
-
The MAAB pipeline is explained in detail in Johnson et al. (2017). In short, after removal of sequences without a predicted N-sp and removal of sequences with predicted domains the sequences are first evaluated for amino acid bias. Sequences with > 45% of PAST (AGP bias) or PSKY (EXT bias) or PVKY (proline-rich proteins (PRP) bias) are grouped further into 24 distinct groups (23 of which represent HRGPs) based on amino acid bias, presence of GPIs and the following motifs (as regular expression):
+input4 <- scan_ag(data = At_seq_fas)
-
EXT: the SPn motif SP{3,5} and the tyr motifs [FY].Y, KHY, VY[HKDE], V.Y and YY.
-
-
PRP: PPV.[KT], PPV[QK] and KKPCPP.
-
-
AGP: [AVTG]P{1,3} and [ASVTG]P{1,2}.
+
provide the location of a .FASTA file to be analyzed as string:
-
The motifs are counted in a specific order ext > tyr > prp > agp, and overlapping motifs are not counted twice. Based on the final motif count and the overall coverage of the motifs HRGPs are clustered in the mentioned 24 groups. This classification can be performed using the ragp function maab:
Of the 398 sequences in at_nsp_3hyp, 71 are predicted to belong to any of the HRGP classes (1 - 24). The output of the maab function is a data frame containing columns: id - sequence identifier; ext_sp, ext_tyr, prp and agp - corresponding motif count; past_percent, pvyk_percent, psky_percent, p_percent - percentage of the corresponding amino acids in sequence; coverage - the motif coverage; maab_class - the MAAB class. The first several hits:
Since the function was not provided knowledge on the GPI presence, both possible classes are indicated as in: “1/4” or “2/9”. GPI presence can be indicated with the argument gpi which accepts a Boolean vector of the same length as the sequences, like the is.bigpi/is.gpi columns from the outputs of get_big_pi/get_pred_gpi functions. Another way to remove ambiguities in classes if GPI presence is unknown is to set get_gpi argument to "bigpi" or "predgpi":
This will run get_big_pi or get_pred_gpi only on the sequences that belong to one of the HRGP classes thus resolving class ambiguities that depend on GPI knowledge.
-
##Domain annotation and GPI prediction
-
Domain annotation is a fundamental research tool in protein sequence analyses. In many occasions domain annotation has already been performed prior to HRGP mining, either by using hmmer(Eddy 2011), InterProScan(Jones et al. 2014) or some other resource. In such circumstances the annotation can be imported into R and joined with the filtered sequences. When this is not the case ragp offers the function get_hmm which queries the hmmscan web server (Finn, Clements, and Eddy 2011). Domain annotation is not necessary for AGP prediction although it enhances interpretation, but it is a fundamental part of the MAAB classification scheme. As with other ragp functions the first argument is the data, followed by the names of the columns containing the protein sequences and corresponding identifiers:
The resulting data frame resembles the web server’s output, or for those familiar with the hmmer software the domain hits table produced by using the argument --domtblout. First few rows with several columns omitted:
HRGPs, and especially AGPs are often linked to the membrane by a glycosylphosphatidylinositol (GPI) anchor (Ellis et al. 2010). Attachment of the GPI moiety to the carboxyl terminus (omega-site) of the polypeptide occurs after proteolytic cleavage of a C-terminal propeptide. In order to predict the presence of the omega sites in the filtered sequences two functions are available: get_big_pi which queries the big-PI Plant Predictor(Eisenhaber et al. 2003) server and get_pred_gpi which queries the PredGPI(Pierleoni, Martelli, and Casadio 2008) server.
This function provides detailed per sequence output when the argument simplify is set to FALSE, by default simplify is set to TRUE and the resulting data frame has one row per sequence:
A key structural feature of HRGPs, based on abundance of disorder-promoting residues, is that they are intrinsically disordered proteins. To identify potential disordered regions in proteins ragp contains get_espritz function which queries ESpritz(Walsh et al. 2012) web server. Three models trained on different data sets are available and can be selected via the argument model:
+
input5 <- scan_ag(data = "at_nsp.fasta") #file at_nsp.fasta is in the working directory
-
‘X-Ray’ - based on missing atoms from the Protein Data Bank (PDB) X-ray solved structures. If this option is chosen then the predictors with short disorder options are executed.
-
-
‘Disprot’ - contains longer disorder segments compared to x-ray. In particular, disprot is a manually curated database which is often based on functional attributes of the disordered region was used for this definition. Disorder residues are defined if the disprot curators consider the residue to be disordered at least once. All other residues are considered ordered. If this option is chosen then the predictors with long disorder options are executed.
-
-
‘NMR’ - based on NMR mobility. NMR flexibility is calculated using the Mobi server(Martin, Walsh, and Tosatto 2010) optimized to replicate the ordered-disordered NMR definition used in CASP8.
-These models provide quite different predictions. To query ESpritz:
The output of the function depends on the simplify argument. If simplify = TRUE (default) the output is a data.frame (one row per disordered region) with columns start and end which indicate the start and end of the predicted disordered region as well as the column id which is a protein identifier. When simplify = FALSE the output contains predicted probabilities of disorder for each amino acid.
-
#Protein structure diagram
-
All of the above mentioned protein features can be visualized using the function plot_prot. It accepts protein sequences in the form of strings, calls the above mentioned functions to obtain sequence features and returns a ggplot object:
-
ind <- c(20, 23, 34, 52, 80, 127, 345) #some indexes which will be used to filter sequences
-p1 <- plot_prot(sequence = at_nsp$sequence[ind],
- id = at_nsp$Transcript.id[ind],
- dom_sort = "ievalue") #to plot domains by independent e-value (lower on top)
-
-p1
-
-
The diagram contains the following elements:
-
-
Transmembrane regions (TM) are shown in yellow, extracellular regions are indicated by the dashed line above the sequences while intracellular regions are indicated by the dashed line bellow the sequence (as predicted by Phobius).
-
Signal peptides (as predicted by SignalP) are indicated by the thick red line on the N-terminal side.
-
GPI attachment sites (as predicted by Big Pi or PredGPI depending on the argument gpi) are indicated by blue rhomboids.
-
Hydroxyprolines (as predicted by predict_hyp) are indicated by vertical dark gray lines.
-
AG glycomodul spans are indicated by the light grey background.
-
Domains (as predicted by hmmscan) are indicated by rectangles with an appropriate fill as indicated in the legend.
-
-
To change the order of domain plotting:
-
p1 <- plot_prot(sequence = at_nsp$sequence[ind],
- id = at_nsp$Transcript.id[ind],
- dom_sort = "cba") #to plot domains by name, first on top, also check "abc"
-
-p1
-
-
To change the axis ratio:
-
p1 +
- coord_equal(ratio = 50)
-
-
To change the domain colors use another fill palette:
-
p1 +
- scale_fill_brewer(type = "div")
-
-
To change feature colors set them as arguments (hyp_col, gpi_col, nsp_col, ag_col and tm_col) to plot_prot:
Add annotations to the plot (y axis is continuous):
-
p1 +
- annotate(geom = "text",
- y = 6,
- x = 100,
- label = "look at all these TMs") +
- annotate(geom = "point",
- y = 4,
- x = c(237, 240, 298),
- shape = 21,
- fill = "dodgerblue",
- size = 3)
-
-
##Acknowledgements
-
This software was developed with funding from the Ministry of Education, Science and Technological Development of the Republic of Serbia (Projects TR31019 and OI173024).
Eisenhaber, Birgit, Michael Wildpaner, Carolyn J. Schultz, Georg H. H. Borner, Paul Dupree, and Frank Eisenhaber. 2003. “Glycosylphosphatidylinositol Lipid Anchoring of Plant Proteins. Sensitive Prediction from Sequence- and Genome-Wide Studies for Arabidopsis and Rice.” Plant Physiology 133 (4): 1691–1701. https://doi.org/10.1104/pp.103.023580.
-
-
-
Ellis, Miriam, Jack Egelund, Carolyn J. Schultz, and Antony Bacic. 2010. “Arabinogalactan-Proteins: Key Regulators at the Cell Surface?” Plant Physiology 153 (2): 403–19. https://doi.org/10.1104/pp.110.156000.
-
-
-
Emanuelsson, O., H. Nielsen, S. Brunak, and G. von Heijne. 2000. “Predicting Subcellular Localization of Proteins Based on Their N-Terminal Amino Acid Sequence.” Journal of Molecular Biology 300 (4): 1005–16. https://doi.org/10.1006/jmbi.2000.3903.
-
-
-
Finn, Robert D., Jody Clements, and Sean R. Eddy. 2011. “HMMER Web Server: Interactive Sequence Similarity Searching.” Nucleic Acids Research 39 (Web Server issue): W29–W37. https://doi.org/10.1093/nar/gkr367.
+
dat <- Biostrings::readAAStringSet("at_nsp.fasta") #file at_nsp.fasta is in the working directory
+input6 <- scan_ag(data = dat)
The only exceptions to this design are the plotting function plot_prot() which requires protein sequences to be supplied in the form of string vectors (input1 in the above example) and pfam2go() which does not take sequences as input.
-
-
Johnson, Kim L., Andrew M. Cassin, Andrew Lonsdale, Antony Bacic, Monika S. Doblin, and Carolyn J. Schultz. 2017. “Pipeline to Identify Hydroxyproline-Rich Glycoproteins.” Plant Physiology 174 (2): 886–903. https://doi.org/10.1104/pp.17.00294.
-
-
Jones, Philip, David Binns, Hsin-Yu Chang, Matthew Fraser, Weizhong Li, Craig McAnulla, Hamish McWilliam, et al. 2014. “InterProScan 5: Genome-Scale Protein Function Classification.” Bioinformatics 30 (9): 1236–40. https://doi.org/10.1093/bioinformatics/btu031.
-
-
-
Käll, Lukas, Anders Krogh, and Erik L. L. Sonnhammer. 2007. “Advantages of Combined Transmembrane Topology and Signal Peptide Prediction–the Phobius Web Server.” Nucleic Acids Research 35 (Web Server issue): W429–432. https://doi.org/10.1093/nar/gkm256.
+
+
+Further reading
+
All ragp functions return basic R data structures such as data frames, lists of vectors and lists of data frames, making them convenient for manipulation to anyone familiar with R. An especially effective way to manipulate these objects is by utilizing the tidyverse collection of packages, especially dplyr and ggplot2. Several dplyr functions that will be especially handy for data wrangling are:
Examples on usage of these functions on objects returned by ragp functions are provided in HRGP filtering and HRGP analysis tutorials. Additionally there are extensive examples on the internet on usage of the mentioned functions.
+
Obtaining pretty visualizations is usually the goal of the above mentioned data manipulations. The golden standard of R graphics at present is the ggplot2 package and we recommend it to graphically summarize the data. Additionally ragp contains plot_prot() function which is a wrapper for ggplot2, and while plot_prot() can be used without knowing ggplot2 syntax, to tweak the plot style at least a basic knowledge of ggplot2 is required. Examples are provided in protein sequence visualization tutorial.
-
-
Martin, Alberto J. M., Ian Walsh, and Silvio C. E. Tosatto. 2010. “MOBI: A Web Server to Define and Visualize Structural Mobility in NMR Protein Ensembles.” Bioinformatics 26 (22): 2916–7. https://doi.org/10.1093/bioinformatics/btq537.
+
+
+Acknowledgements
+
This software was developed with funding from the Ministry of Education, Science and Technological Development of the Republic of Serbia (Projects TR31019 and OI173024).
-
-
Pierleoni, Andrea, Pier Luigi Martelli, and Rita Casadio. 2008. “PredGPI: A GPI-Anchor Predictor.” BMC Bioinformatics 9 (September): 392. https://doi.org/10.1186/1471-2105-9-392.
+
+
+References
+
+
+Almagro Armenteros, José Juan, Konstantinos D. Tsirigos, Casper Kaae Sønderby, Thomas Nordahl Petersen, Ole Winther, Søren Brunak, Gunnar von Heijne, and Henrik Nielsen. 2019. “SignalP 5.0 Improves Signal Peptide Predictions Using Deep Neural Networks.”Nature Biotechnology 37: 420–23. https://doi.org/0.1038/s41587-019-0036-z.
-
-
Showalter, Allan M., Brian Keppler, Jens Lichtenberg, Dazhang Gu, and Lonnie R. Welch. 2010. “A Bioinformatics Approach to the Identification, Classification, and Analysis of Hydroxyproline-Rich Glycoproteins.” Plant Physiology 153 (2): 485–513. https://doi.org/10.1104/pp.110.156554.
+
+Johnson, Kim L., Andrew M. Cassin, Andrew Lonsdale, Antony Bacic, Monika S. Doblin, and Carolyn J. Schultz. 2017. “Pipeline to IdentifyHydroxyproline-RichGlycoproteins.”Plant Physiology 174 (2): 886–903. https://doi.org/10.1104/pp.17.00294.
-
-
Walsh, Ian, Alberto J. M. Martin, Tomàs Di Domenico, and Silvio C. E. Tosatto. 2012. “ESpritz: Accurate and Fast Prediction of Protein Disorder.” Bioinformatics 28 (4): 503–9. https://doi.org/10.1093/bioinformatics/btr682.
Hydroxyproline rich glycoproteins (HRGPs) are one of the most complex families of macromolecules found in plants, due to the diversity of glycans decorating the protein backbone, as well as the heterogeneity of the protein backbones. While this diversity is responsible for the wide array of physiological functions associated with HRGPs, it hinders attempts for homology based identification. Current approaches, based on identifying sequences with characteristic motifs and biased amino acid composition, are limited to prototypical sequences.
-
ragp is an R package for mining and analyses of HRGPs, with emphasis on arabinogalactan protein sequences (AGPs). The ragp filtering pipeline exploits one of the HRGPs key features, the presence of hydroxyprolines which represent glycosylation sites. Main package features include prediction of proline hydroxylation sites, amino acid motif and bias analyses, efficient communication with web servers for prediction of N-terminal signal peptides and glycosylphosphatidylinositol modification sites, as well as the ability to annotate sequences through hmmscan and subsequent GO enrichment, based on predicted Pfam domains.
+
ragp is an R package for mining and analyses of HRGPs, with emphasis on arabinogalactan protein sequences (AGPs). The ragp filtering pipeline exploits one of the HRGPs key features, the presence of hydroxyprolines which represent glycosylation sites. Main package features include prediction of proline hydroxylation sites, amino acid motif and bias analyses, efficient communication with web servers for prediction of N-terminal signal peptides and glycosylphosphatidylinositol modification sites, as well as the ability to annotate sequences through CDD or hmmscan and subsequent GO enrichment, based on predicted Pfam domains.
The workflow in ragp is illustrated with the following diagram (ragp functions to be used for each of the tasks are boxed grey):
The filtering layer:
-
predict the presence of secretory signals (N-sp) and filter sequences containing them. Since several prediction algorithms are utilized (get_signalp(), get_targetp() and get_phobius()) this decision is reached by a majority vote.
+
predict the presence of secretory signals (N-sp) and filter sequences containing them. Several prediction algorithms are available. Currently recommended is to use get_signalp5() which queries SignalP5 (Almagro Armenteros et al. 2019).
-
predict proline hydroxylation (predict_hyp()) and filter sequences containing at least three potential hydroxyprolines.
+
predict proline hydroxylation (predict_hyp()) and filter sequences containing several (for example three or more) potential hydroxyprolines.
The analysis layer:
find localized clusters of characteristic arabinogalactan motifs (AG glycomodules) to identify potential AGPs (scan_ag()) among the hydroxyproline containing sequences.
perform motif and amino acid bias (MAAB, (Johnson et al. 2017)) classification scheme (maab()) to classify hydroxyproline containing sequences.
-
annotate domains using Pfam (get_hmm()) and enrich with gene ontology (GO) terms (pfam2go()).
+
annotate domains using Pfam (get_hmm()) and enrich with gene ontology (GO) terms (pfam2go()). Alternatively annotate domains using CDD.
Additionally ragp provides tools for visualization of the mentioned attributes via plot_prot().
@@ -179,6 +180,9 @@
References
+
+Almagro Armenteros, José Juan, Konstantinos D. Tsirigos, Casper Kaae Sønderby, Thomas Nordahl Petersen, Ole Winther, Søren Brunak, Gunnar von Heijne, and Henrik Nielsen. 2019. “SignalP 5.0 Improves Signal Peptide Predictions Using Deep Neural Networks.” Nature Biotechnology 37: 420–23. https://doi.org/0.1038/s41587-019-0036-z.
+
Johnson, Kim L., Andrew M. Cassin, Andrew Lonsdale, Antony Bacic, Monika S. Doblin, and Carolyn J. Schultz. 2017. “Pipeline to Identify Hydroxyproline-Rich Glycoproteins.” Plant Physiology 174 (2): 886–903. https://doi.org/10.1104/pp.17.00294.
+plot_prot()nsp argument can now be "signalp", "signalp5" or "none". Default is "signalp5". This argument determines if get_signalp() or get_signalp5() are used for N-sp prediction. Data.frame input is accepted as well.
+
+plot_prot()tm argument can now be "phobius", "tmhmm" or "none". Default is "phobius". This argument determines if get_phobius() or get_tmhmm() are used for TM prediction. Data.frame input is accepted as well.
+
+plot_prot()domain argument can now be "cdd", "hmm" or "none". Default is "cdd". This argument determines if get_hmm() or get_cdd() are used for domain annotation. Data.frame input is accepted as well.
+
all get_* and scan_* functions, as well as maab() now work with AAStringSet class objects. #5
+
+
+
+
+Bug Fixes and Improvements
-get_espritz() has been restored since Espritz server is again available at http://old.protein.bio.unipd.it/espritz/.
+get_signalp() output contains an additional column sp.length - integer, length of the predicted signal peptide. This column is a copy of Ymax.pos. Potentially BREAKING CHANGE
-plot_prot() has new arguments gpi_size - controls the size of the gpi symbol and gpi_shape- controls the shape of the gpi symbol.
+get_phobius() output contains an additional column sp.length - integer, length of the predicted signal peptide. Potentially BREAKING CHANGE
-plot_prot() has new argument hyp_scan - which is a logical and determines if scan_ag() (when ag = TRUE) should scan for arabinogalactan motifs containing only predicted hydroxyprolines. This argument changes the previous default behavior of plot_prot() which was equivalent to hyp_scan = FALSE.
+get_phobius() the name of the output column “Name” has been changed to “id”. BREAKING CHANGE
+
plot_prot() has new arguments gpi_size - controls the size of the gpi symbol and gpi_shape- controls the shape of the gpi symbol.
+
plot_prot() has new argument hyp_scan - which is a logical and determines if scan_ag() (when ag = TRUE) should scan for arabinogalactan motifs containing only predicted hydroxyprolines. This argument changes the previous default behavior of plot_prot() which was equivalent to hyp_scan = FALSE.
get_big_pi() output column is.bigpi has been renamed to is.gpi to bring the output in line with other gpi predicting functions. BREAKING CHANGE.
@@ -194,9 +255,9 @@
maab() argument get_gpi has an additional option "netgpi". When set, the function will query NetGPI web server to resolve ambiguities in maab classes depending on GPI-anchoring predictions.
@@ -207,9 +268,9 @@
ragp 0.3.0.0002
-
+
-Bug Fixes and Improvements
+Bug Fixes and Improvements
maab() now correctly performs when only a single protein sequence is provided as an argument.
get_hmm() receives additional numeric arguments: ievalue and bitscore. These arguments are used to filter sequences with lower or equal ievalue and higher or equal bitscore in the output. This is useful when used from plot_prot() to avoid plotting weakly identified domains.
@@ -219,9 +280,9 @@
ragp 0.3.0.0001
-
+
-Bug Fixes and Improvements
+Bug Fixes and Improvements
get_big_pi() and get_pred_gpi() now return NA in respective is.bigpi and is.gpi columns when the servers are unable to make a prediction due to non-amino acid letters or length of the sequence.
maab() now correctly does not resolve class ambiguities when the logical vector provided as gpi argument contains NA values. Previously it returned NA as maab_class. This is useful when maab() is called with get_gpi = 'predgpi' or get_gpi = 'bigpi' arguments, and the corresponding servers are unable to make a prediction for a sequence due to non-amino acid letters or length of the sequence.
@@ -231,17 +292,17 @@
ragp 0.3.0.0000
-
+
-New Features
+New Features
predict_hyp() internal model is updated to 2nd version (‘V2’). Predictions are around 25% faster compared to the first version. The performance in terms of accuracy is similar based on the test set used. ‘V2’ was created using a more streamlined manner and is the default model. The old model (‘V1’) is still available using the version argument to predict_hyp().
-
+
-Bug Fixes and Improvements
+Bug Fixes and Improvements
predict_hyp() now checks if all provided ids are unique. Previously non unique ids caused an error at the end of computation.
predict_hyp()sequence output has changed for sequences containing non amino acid letters. Previously NA was returned for such sequences. At present all “P”" for which the probability is higher then the defined threshold (tprob argument) are changed to “O”" and all others are left as “P”.
@@ -251,9 +312,9 @@
ragp 0.2.1.9100
-
+
-Bug Fixes and Improvements
+Bug Fixes and Improvements
maab() now correctly outputs when there are no MAAB classes found and get_gpi argument is set to "predgpi" or "bigpi". Previously this caused an error due to stringr::write.fasta() attempting to write a file with no sequences.
get_targetp() and get_signalp() gain additional arguments progress and attempts. progress is a logical value determining whether to show the progress bar, (default TRUE). attempts is an integer value determining the number of repeated attempts if server unresponsive (default 2). These functions now return finished queues if server becomes unresponsive.
Completely rewrote the code for scan_ag() which is now simplified and easier to read. The function is now about 20% slower.
fixed a bug in scan_ag() when argument exclude_ext = "all" which resulted in detection of unwanted amino acids in certain sequence arrangements. The bug occurred only if AG glycomoduls were detected, it did not introduce any. If you used this function with exclude_ext = "all" it is advisable to rerun these analyses again.
@@ -310,17 +371,17 @@
ragp 0.1.0.0004
-
+
-New Features
+New Features
New get_espritz() queries ESpritz web server for predictions on protein disordered regions.
plot_prot() gains an additional argument disorder, logical indicating should the predicted disordered regions be plotted. Defaults to FALSE.
-
+
-Bug Fixes and Improvements
+Bug Fixes and Improvements
fixed a bug in function plot_prot() introduced in 0.1.0.0003 which prevented GPIs to be plotted when gpi = "bigpi".
@@ -329,16 +390,16 @@
ragp 0.1.0.0003
-
+
-New Features
+New Features
New get_pred_gpi() queries PredGPI web server for predictions on GPI presence and omega site location.
-
+
-Bug Fixes and Improvements
+Bug Fixes and Improvements
maab() argument get_gpi has been changed and now accepts strings as input: get_gpi = c("bigpi", "predgpi", "none") which indicate whether to query Big Pi or PredGPI server or not to resolve class ambiguities.
plot_prot() argument gpi has been changed and now accepts strings as input gpi = c("bigpi", "predgpi", "none") which indicate whether to query Big Pi or PredGPI server or not to plot GPI positions.
@@ -350,9 +411,9 @@
ragp 0.1.0.0002
-
+
-Bug Fixes and Improvements
+Bug Fixes and Improvements
Removed rvest dependency
@@ -361,16 +422,16 @@
ragp 0.1.0.0001
-
+
-New Features
+New Features
Added vignette.
-
+
-Bug Fixes and Improvements
+Bug Fixes and Improvements
When using plot_prot() with argument dom_sort = "ievalue" the domains with the lowest independent e-value will now be correctly plotted on top.
The ratio of x and y axes in plot_prot() output is now calculated based on sequence length and should provide more consistent diagrams.
@@ -381,21 +442,21 @@
ragp 0.1.0.0000
-
+
-New Features
+New Features
New plot_prot() returns a ggplot2 diagram of protein structure based on hmmscan domain annotation and several types of predictions.
New get_signalp(), get_targetp() and get_phobius() replace deprecated get_signalp_file(), get_targetp_file() and get_phobius_file(). These functions accept R objects as well as FASTA files as input (#5).
-
New plot_signalp() takes one single letter protein sequence and returns a detailed SignalP prediction along with a plot.
+
New plot_signalp() takes one single letter protein sequence and returns a detailed SignalP prediction along with a plot.
New scan_nglc() detects motifs for N-glycosylation on asparagine residues.
maab() gain an additional logical argument get_gpi (default FALSE); if TRUEget_big_pi will be called on all sequences belonging to one of the HRGP classes thus resolving class ambiguities that depend on GPI knowledge.
Added a NEWS.md file to track changes to the package.
-
+
-Bug Fixes and Improvements
+Bug Fixes and Improvements
Significantly improved the speed of get_big_pi(). Removed the verbose argument. Added a progress bar (#3).
get_hmm() gains new arguments: timeout - time in seconds to wait for the server response (default is 10s). attempts - number of attempts if server unresponsive (default is 2 times) (#3).
A data frame with protein amino acid sequences as strings in one column and corresponding id's in another. Alternatively a path to a .fasta file with protein sequences. Alternatively a list with elements of class "SeqFastaAA" resulting from read.fasta call. Should be left blank if vectors are provided to sequence and id arguments.
+
A data frame with protein amino acid sequences as strings in one column and corresponding id's in another. Alternatively a path to a .fasta file with protein sequences. Alternatively a list with elements of class SeqFastaAA resulting from read.fasta call. Alternatively an AAStringSet object. Should be left blank if vectors are provided to sequence and id arguments.
Conserved Domain Database is a protein annotation resource that consists of a collection of well-annotated multiple sequence alignment models for ancient domains and full-length proteins.
+
+
+
get_cdd(data, ...)
+
+# S3 method for character
+get_cdd(
+ data,
+ output =c("short", "standard", "full"),
+ splitter =100L,
+ progress =FALSE,
+ evalue =NULL,
+ bitscore =NULL,
+ ...
+)
+
+# S3 method for data.frame
+get_cdd(data, sequence, id, ...)
+
+# S3 method for list
+get_cdd(data, ...)
+
+# S3 method for default
+get_cdd(data =NULL, sequence, id, ...)
+
+# S3 method for AAStringSet
+get_cdd(data, ...)
+
+
Arguments
+
+
+
+
data
+
A data frame with protein amino acid sequences as strings in one column and corresponding id's in another. Alternatively a path to a .fasta file with protein sequences. Alternatively a list with elements of class SeqFastaAA resulting from read.fasta call. Alternatively an AAStringSet object. Should be left blank if vectors are provided to sequence and id arguments.
+
+
+
...
+
currently no additional arguments are accepted apart the ones documented bellow.
+
+
+
output
+
One of "short", "standard" or "full". Default is "short". "Short" returns the highest scoring hit, for each region of the query sequence; "standard" returns the best-scoring hit from each source database, for each region of the query sequence; "full" returns the complete set of hits.
+
+
+
splitter
+
An integer indicating the number of sequences to be in each .fasta file that is to be sent to the server. Default is 100. Change only in case of a server side error. Accepted values are in range of 1 to 4000.
+
+
+
progress
+
Boolean, whether to show the progress bar, at default set to FALSE.
+
+
+
evalue
+
Numeric, all sequences with E-value lower or equal to this value will be retained in the function output. Used to filter out low similarity matches. If set some queried sequences might be discarded from the output. Suggested values: 1e-2 - 1e-5.
+
+
+
bitscore
+
Numeric, all sequences with bitscore greater or equal to this value will be retained in the function output. Used to filter out low similarity. If set some queried sequences might be discarded from the output. Suggested values: 10 - 20.
+
+
+
sequence
+
A vector of strings representing protein amino acid sequences, or the appropriate column name if a data.frame is supplied to data argument. If .fasta file path, or list with elements of class "SeqFastaAA" provided to data, this should be left blank.
+
+
+
id
+
A vector of strings representing protein identifiers, or the appropriate column name if a data.frame is supplied to data argument. If .fasta file path, or list with elements of class "SeqFastaAA" provided to data, this should be left blank.
Character, CD-Search results can include hit types that represent various confidence levels (specific hits, non-specific hits) and domain model scope (superfamilies, multi-domains).
+
pssm.id
Character, unique identifier for a domain models position-specific scoring matrix (PSSM).
+
align_start
Integer, start of the alignement in the query protein sequence.
+
align_end
Integer, end of the alignement in the query protein sequence.
+
evalue
Numeric, the expect value, or E-value, indicates the statistical significance of the hit as the likelihood the hit was found by chance.
+
bitscore
Numeric, this values is derived from the raw alignment score in which the statistical properties of the scoring system used have been taken into account.
+
acc
Character, the accession number of the hit, which can either be a domain model or a superfamily cluster.
+
desc
Character, the short name of a conserved domain, which concisely defines the domain.
+
incomplete
Character, if the hit to a conserved domain is partial (i.e., if the alignment found by RPS-BLAST omitted more than 20 percent of the CDs extent at either the n- or c-terminus or both), this column will be populated with one of the following values: N - incomplete at the N-terminus, C - incomplete at the C-terminus, NC - incomplete at both the N-terminus and C-terminus. If the hit to a conserved domain is complete, then this column will be populated with a dash (-).
+
superfamily
Character, the short name of a conserved domain, which concisely defines the domain.
+
definition
Character, this column is populated only for domain models that are specific or non-specific hits, and it lists the accession number of the superfamily to which the domain model belongs.
+
+
+
+
Note
+
+
This function creates temporary files in the working directory.
+
References
+
+
Lu S, Wang J, Chitsaz F, Derbyshire MK, Geer RC, Gonzales NR, Gwadz M, Hurwitz DI, Marchler GH, Song JS, Thanki N, Yamashita RA, Yang M, Zhang D, Zheng C, Lanczycki CJ, Marchler-Bauer A (2020) CDD/SPARCLE: the conserved domain database in 2020, Nucleic Acids Res. 48(D1):D265-D268. doi: https://doi.org/10.1093/nar/gkz991
A data frame with protein amino acid sequences as strings in one column and corresponding id's in another. Alternatively a path to a .fasta file with protein sequences. Alternatively a list with elements of class "SeqFastaAA" resulting from read.fasta call. Should be left blank if vectors are provided to sequence and id arguments.
+
A data frame with protein amino acid sequences as strings in one column and corresponding id's in another. Alternatively a path to a .fasta file with protein sequences. Alternatively a list with elements of class SeqFastaAA resulting from read.fasta call. Alternatively an AAStringSet object. Should be left blank if vectors are provided to sequence and id arguments.
A data frame with protein amino acid sequences as strings in one column and corresponding id's in another. Alternatively a path to a .fasta file with protein sequences. Alternatively a list with elements of class "SeqFastaAA" resulting from read.fasta call. Should be left blank if vectors are provided to sequence and id arguments.
+
A data frame with protein amino acid sequences as strings in one column and corresponding id's in another. Alternatively a path to a .fasta file with protein sequences. Alternatively a list with elements of class SeqFastaAA resulting from read.fasta call. Alternatively an AAStringSet object. Should be left blank if vectors are provided to sequence and id arguments.
A data frame with protein amino acid sequences as strings in one column and corresponding id's in another. Alternatively a path to a .fasta file with protein sequences. Alternatively a list with elements of class "SeqFastaAA" resulting from read.fasta call. Should be left blank if vectors are provided to sequence and id arguments.
+
A data frame with protein amino acid sequences as strings in one column and corresponding id's in another. Alternatively a path to a .fasta file with protein sequences. Alternatively a list with elements of class SeqFastaAA resulting from read.fasta call. Alternatively an AAStringSet object. Should be left blank if vectors are provided to sequence and id arguments.
...
@@ -171,7 +174,7 @@
Arg
splitter
-
An integer indicating the number of sequences to be in each .fasta file that is to be sent to the server. Defaults to 250. Change only in case of a server side error. Accepted values are in range of 1 to 5000.
+
An integer indicating the number of sequences to be in each .fasta file that is to be sent to the server. Defaults to 2500. Change only in case of a server side error. Accepted values are in range of 1 to 5000.
A data frame with protein amino acid sequences as strings in one column and corresponding id's in another. Alternatively a path to a .fasta file with protein sequences. Alternatively a list with elements of class "SeqFastaAA" resulting from read.fasta call. Should be left blank if vectors are provided to sequence and id arguments.
+
A data frame with protein amino acid sequences as strings in one column and corresponding id's in another. Alternatively a path to a .fasta file with protein sequences. Alternatively a list with elements of class SeqFastaAA resulting from read.fasta call. Alternatively an AAStringSet object. Should be left blank if vectors are provided to sequence and id arguments.
...
@@ -189,12 +192,13 @@
SourceValue
A data frame with columns:
-
Name
Character, name of the submitted sequence.
+
id
Character, name of the submitted sequence.
tm
Integer, the number of predicted transmembrane segments.
sp
Character, Y/0 indicator if a signal peptide was predicted or not.
prediction
Character string, predicted topology of the protein.
cut_site
Integer, first amino acid after removal of the signal peptide
is.phobius
Logical, did Phobius predict the presence of a signal peptide
+
sp.length
Integer, length of the predicted signal peptide.
@@ -219,7 +223,7 @@
Examp
sequence,
Transcript.id)phobius_pred
-
#> Name tm sp prediction cut_site is.phobius
+
#> id tm sp prediction cut_site is.phobius
#> 1 ATCG00660.1 0 Y n17-24c29/30o 30 TRUE
#> 2 AT2G43600.1 0 Y n6-17c22/23o 23 TRUE
#> 3 AT2G28410.1 0 Y n6-17c22/23o 23 TRUE
@@ -239,7 +243,28 @@
Examp
#> 17 AT2G37390.1 0 Y n6-16c24/25o 25 TRUE
#> 18 AT2G37390.2 0 Y n6-16c24/25o 25 TRUE
#> 19 AT2G35250.1 0 Y n5-16c21/22o 22 TRUE
-#> 20 AT2G38390.1 0 Y n7-22c29/30o 30 TRUE
A data frame with protein amino acid sequences as strings in one column and corresponding id's in another. Alternatively a path to a .fasta file with protein sequences. Alternatively a list with elements of class "SeqFastaAA" resulting from read.fasta call. Should be left blank if vectors are provided to sequence and id arguments.
+
A data frame with protein amino acid sequences as strings in one column and corresponding id's in another. Alternatively a path to a .fasta file with protein sequences.Alternatively a list with elements of class SeqFastaAA resulting from read.fasta call. Alternatively an AAStringSet object. Should be left blank if vectors are provided to sequence and id arguments.
A data frame with protein amino acid sequences as strings in one column and corresponding id's in another. Alternatively a path to a .fasta file with protein sequences. Alternatively a list with elements of class "SeqFastaAA" resulting from read.fasta call. Should be left blank if vectors are provided to sequence and id arguments.
+
A data frame with protein amino acid sequences as strings in one column and corresponding id's in another. Alternatively a path to a .fasta file with protein sequences. Alternatively a list with elements of class SeqFastaAA resulting from read.fasta call. Alternatively an AAStringSet object. Should be left blank if vectors are provided to sequence and id arguments.
...
@@ -213,11 +215,7 @@
Arg
splitter
-
An integer indicating the number of sequences to be in each .fasta file that is to be sent to the server. Defaults to 500. Change only in case of a server side error. Accepted values are in range of 1 to 2000.
-
-
-
sleep
-
A numeric indicating the pause in seconds between POST and GET server calls, at default set to 1s. Decreasing is not recommended.
+
An integer indicating the number of sequences to be in each .fasta file that is to be sent to the server. Default is 1000. Change only in case of a server side error. Accepted values are in range of 1 to 2000.
attempts
@@ -256,6 +254,7 @@
Value
Dmaxcut
Numeric, as from input, Dcut_noTM if SignalP-noTM network used and Dcut_TM if SignalP-TM network used
Networks.used
Character, which network was used for the prediction: SignalP-noTM or SignalP-TM
is.signalp
Logical, did SignalP predict the presence of a signal peptide
+
sp.length
Integer, length of the predicted signal peptide.
@@ -267,7 +266,7 @@
R
Petersen TN. Brunak S. Heijne G. Nielsen H. (2011) SignalP 4.0: discriminating signal peptides from transmembrane regions. Nature Methods 8: 785-786
The SignalP 5.0 server predicts the presence of signal peptides and the location of their cleavage sites in proteins from Archaea, Gram-positive Bacteria, Gram-negative Bacteria and Eukarya.
+
+
+
get_signalp5(data, ...)
+
+# S3 method for character
+get_signalp5(
+ data,
+ org_type =c("euk", "gram-", "gram+", "archea"),
+ splitter =2500L,
+ attempts =2,
+ progress =FALSE,
+ ...
+)
+
+# S3 method for data.frame
+get_signalp5(data, sequence, id, ...)
+
+# S3 method for list
+get_signalp5(data, ...)
+
+# S3 method for default
+get_signalp5(data =NULL, sequence, id, ...)
+
+# S3 method for AAStringSet
+get_signalp5(data, ...)
+
+
Arguments
+
+
+
+
data
+
A data frame with protein amino acid sequences as strings in one column and corresponding id's in another. Alternatively a path to a .fasta file with protein sequences. Alternatively a list with elements of class SeqFastaAA resulting from read.fasta call. Alternatively an AAStringSet object. Should be left blank if vectors are provided to sequence and id arguments.
+
+
+
...
+
currently no additional arguments are accepted apart the ones documented bellow.
+
+
+
org_type
+
One of "euk", "gram-", "gram+" or "archea". Default is "euk". Are the protein sequences from Eukarya, Gram-negative Bacteria, Gram-positive Bacteria or Archaea.
+
+
+
splitter
+
An integer indicating the number of sequences to be in each .fasta file that is to be sent to the server. Default is 2500. Change only in case of a server side error. Accepted values are in range of 1 to 5000.
+
+
+
attempts
+
Integer, number of attempts if server unresponsive, at default set to 2.
+
+
+
progress
+
Boolean, whether to show messages of the job id for each batch. Default is FALSE.
+
+
+
sequence
+
A vector of strings representing protein amino acid sequences, or the appropriate column name if a data.frame is supplied to data argument. If .fasta file path, or list with elements of class "SeqFastaAA" provided to data, this should be left blank.
+
+
+
id
+
A vector of strings representing protein identifiers, or the appropriate column name if a data.frame is supplied to data argument. If .fasta file path, or list with elements of class "SeqFastaAA" provided to data, this should be left blank.
Integer, The type of signal peptide predicted: Sec/SPI, Tat/SPI, Sec/SPII or Other if no signal peptide predicted
+
SP.Sec.SPI
Numeric, marginal probability that the protein contains a Sec N-terminal signal peptide (Sec/SPI).
+
Other
Numeric, the probability that the sequence does not have any kind of signal peptid.
+
CS_pos
Character, cleavage site position.
+
Pr
Numeric, probability of the predicted cleavage site position.
+
cleave.site
Numeric,llocal amino acid sequence arround the predicted cleavage site.
+
is.signalp
Logical, did SignalP5 predict the presence of a signal peptide.
+
sp.length
Integer, length of the predicted signal peptide.
+
+
+
+
if org_type is one of "gram-", "gram+" or "archea" the returned data frame will have two additional columns between `SP.Sec.SPI` and `Other`:
+
+
TAT.Tat.SPI
Numeric, marginal probability that the protein contains a Tat N-terminal signal peptide (Tat/SPI).
+
LIPO.Sec.SPII
Numeric, marginal probability that the protein contains a Lipoprotein N-terminal signal peptide (Sec/SPII).
+
+
+
+
Note
+
+
This function creates temporary files in the working directory. If something goes wrong during communication with the server and progress was set to TRUE, predictions can be obtained using `file.path("http://www.cbs.dtu.dk/services/SignalP-5.0/tmp", jobid, "output_protein_type.txt")` eg `read.delim(file.path(...), header = TRUE, skip = 1)`.
+
References
+
+
Almagro Armenteros JJ, Tsirigos KD, Sønderby CK, Petersen TN, Winther O, Brunak S, von Heijne G, Nielsen H. (2019) SignalP 5.0 improves signal peptide predictions using deep neural networks. Nature Biotechnology, 37:420-423, doi:10.1038/s41587-019-0036-z
A data frame with protein amino acid sequences as strings in one column and corresponding id's in another. Alternatively a path to a .fasta file with protein sequences. Alternatively a list with elements of class "SeqFastaAA" resulting from read.fasta call. Should be left blank if vectors are provided to sequence and id arguments.
+
A data frame with protein amino acid sequences as strings in one column and corresponding id's in another. Alternatively a path to a .fasta file with protein sequences. Alternatively a list with elements of class SeqFastaAA resulting from read.fasta call. Alternatively an AAStringSet object. Should be left blank if vectors are provided to sequence and id arguments.
...
@@ -208,11 +210,7 @@
Arg
splitter
-
An integer indicating the number of sequences to be in each .fasta file that is to be sent to the server. Defaults to 200. Change only in case of a server side error. Accepted values are in range of 1 to 2000.
-
-
-
sleep
-
A numeric indicating the pause in seconds between server calls, at default set to 1
+
An integer indicating the number of sequences to be in each .fasta file that is to be sent to the server. Defaults to 1000. Change only in case of a server side error. Accepted values are in range of 1 to 2000.
attempts
@@ -259,7 +257,7 @@
R
Emanuelsson O, Nielsen H, Brunak S,von Heijne G. (2000) Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. J. Mol. Biol.300: 1005-1016
TMHMM server offers prediction of transmembrane helices in proteins
+
+
+
get_tmhmm(data, ...)
+
+# S3 method for character
+get_tmhmm(data, splitter =2500L, attempts =2, progress =FALSE, ...)
+
+# S3 method for data.frame
+get_tmhmm(data, sequence, id, ...)
+
+# S3 method for list
+get_tmhmm(data, ...)
+
+# S3 method for default
+get_tmhmm(data =NULL, sequence, id, ...)
+
+# S3 method for AAStringSet
+get_tmhmm(data, ...)
+
+
Arguments
+
+
+
+
data
+
A data frame with protein amino acid sequences as strings in one column and corresponding id's in another. Alternatively a path to a .fasta file with protein sequences. Alternatively a list with elements of class SeqFastaAA resulting from read.fasta call. Alternatively an AAStringSet object. Should be left blank if vectors are provided to sequence and id arguments.
+
+
+
...
+
currently no additional arguments are accepted apart the ones documented bellow.
+
+
+
splitter
+
An integer indicating the number of sequences to be in each .fasta file that is to be sent to the server. Default is 2500. Change only in case of a server side error. Accepted values are in range of 1 to 10000.
+
+
+
attempts
+
Integer, number of attempts if server unresponsive, at default set to 2.
+
+
+
progress
+
Boolean, whether to show messages of the job id for each batch. Default is FALSE
+
+
+
sequence
+
A vector of strings representing protein amino acid sequences, or the appropriate column name if a data.frame is supplied to data argument. If .fasta file path, or list with elements of class "SeqFastaAA" provided to data, this should be left blank.
+
+
+
id
+
A vector of strings representing protein identifiers, or the appropriate column name if a data.frame is supplied to data argument. If .fasta file path, or list with elements of class "SeqFastaAA" provided to data, this should be left blank.
Numeric, the expected number of amino acids in transmembrane helices.
+
First60
Numeric, the expected number of amino acids in transmembrane helices in the first 60 amino acids of the protein.
+
tm
Integer, the number of predicted transmembrane segments.
+
prediction
Character string, predicted topology of the protein.
+
+
+
Note
+
+
This function creates temporary files in the working directory. If something goes wrong during communication with the server and progress was set to TRUE, predictions can be obtained using the web address `paste("https://services.healthtech.dtu.dk/cgi-bin/webface2.cgi?jobid=", jobid, "&wait=20", sep = "")`.
+
References
+
+
Krogh A, Larsson B, von Heijne G, Sonnhammer EL (2001) Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol 305(3):567-80.
A data frame with protein amino acid sequences as strings in one column and corresponding id's in another. Alternatively a path to a .fasta file with protein sequences. Alternatively a list with elements of class "SeqFastaAA" resulting from read.fasta call. Should be left blank if vectors are provided to sequence and id arguments.
+
A data frame with protein amino acid sequences as strings in one column and corresponding id's in another. Alternatively a path to a .fasta file with protein sequences. Alternatively a list with elements of class SeqFastaAA resulting from read.fasta call. Alternatively an AAStringSet object. Should be left blank if vectors are provided to sequence and id arguments.
A string indicating if get_big_pi (gpi = "bigpi"), get_pred_gpi (gpi = "predgpi") or get_netGPI (gpi = "netgpi") should be called when predicting omega sites. To turn off omega site prediction use gpi = "none". At default set to "bigpi". Alternatively the output data frame of the mentioned functions (called with simplify = TRUE) can be supplied.
+
A string indicating if get_big_pi (gpi = "bigpi"), get_pred_gpi (gpi = "predgpi") or get_netGPI (gpi = "netgpi") should be called when predicting omega sites. To turn off omega site prediction use gpi = "none". At default set to "netgpi". Alternatively the output data frame of the mentioned functions (called with simplify = TRUE) can be supplied.
nsp
-
Boolean, should the N-terminal signal peptide predictions obtained using get_signalp be plotted. Alternatively the output data frame from get_signalp can be supplied.
+
A string indicating if get_signalp5 (nsp = "signalp5") or get_signalp (nsp = "signalp") should be used to obtain N-sp predictions. Alternatively a data frame containing three columns: a character column "id" indicating the protein id as from input, a logical column "is.signalp" and an integer column "sp.length". See get_signalp5 or get_signalp for details.
ag
@@ -216,11 +216,11 @@
Arg
tm
-
Boolean, should the transmembrane region predictions obtained using get_phobius be plotted. Alternatively the output data frame from get_phobius can be supplied.
+
A string indicating if get_phobius (tm = "phobius") or get_tmhmm (tm = "tmhmm") should be used to obtain transmembrane region predictions. Alternatively a data frame with two columns: a character column "id" indicating the protein id as from input and a "prediction" column containing the topology of the transmembrane regions (example "42o81-101i108-126o"). To turn off tm prediction use tm = "none".
domain
-
Boolean, should the domain predictions obtained using get_hmm be plotted. Alternatively the output data frame from get_hmm can be supplied.
+
A string indicating if get_cdd (domain = "cdd") or get_hmm (domain = "hmm") should be used to obtain domain annotation. Alternatively a data frame with five columns: a character column "id" indicating the protein id as from input, a character column "acc" indicating the accession of the domain hit, a character column "desc" indicating the description of the domain hit, a numeric column "align_start" indicating the start of the domain hit, a numeric column "align_end" indicating the end the domain hit.
Examp
id =Transcript.id,
sequence =sequence)
-hmm<-get_hmm(data =at_nsp[ind,],
+hmm<-get_hmm(data =at_nsp[ind,], #default is to use get_cdd()
id =Transcript.id,
- sequence =sequence)
+ sequence =sequence)gpi<-get_netGPI(data =at_nsp[ind,],
id =Transcript.id,
@@ -297,8 +296,7 @@
Plots the SignalP prediction for one protein sequence using base graphics. SignalP 4.1 server predicts the presence and location of signal peptide cleavage sites in amino acid sequences from different organisms: Gram-positive prokaryotes, Gram-negative prokaryotes, and eukaryotes. The method incorporates a prediction of cleavage sites and a signal peptide/non-signal peptide prediction based on a combination of several artificial neural networks.
String representing a protein amino acid sequence.
-
-
-
id
-
String representing a protein identifier.
-
-
-
org_type
-
One of c("euk", "gram-", "gram+"), defaults to "euk". Which model should be used for prediction.
-
-
-
Dcut_type
-
One of c("default", "sensitive", "user"), defaults to "default". The default cutoff values for SignalP 4 are chosen to optimize the performance measured as Matthews Correlation Coefficient (MCC). This results in a lower sensitivity (true positive rate) than SignalP 3.0 had. Setting this argument to "sensitive" will yield the same sensitivity as SignalP 3.0. This will make the false positive rate slightly higher, but still better than that of SignalP 3.0.
-
-
-
Dcut_noTM
-
A numeric value, with range 0 - 1, defaults to 0.45. For experimenting with cutoff values.
-
-
-
Dcut_TM
-
A numeric value, with range 0 - 1, defaults to 0.5. For experimenting with cutoff values.
-
-
-
method
-
One of c("best", "notm"), defaults to "best". Signalp 4.1 contains two types of neural networks. SignalP-TM has been trained with sequences containing transmembrane segments in the data set, while SignalP-noTM has been trained without those sequences. Per default, SignalP 4.1 uses SignalP-TM as a preprocessor to determine whether to use SignalP-TM or SignalP-noTM in the final prediction (if 4 or more positions are predicted to be in a transmembrane state, SignalP-TM is used, otherwise SignalP-noTM). An exception is Gram-positive bacteria, where SignalP-TM is used always. If you are confident that there are no transmembrane segments in your data, you can get a slightly better performance by choosing "Input sequences do not include TM regions", which will tell SignalP 4.1 to use SignalP-noTM always.
-
-
-
c.score.col
-
Plotting color of the C-score line. At default set to: '#ff0000'.
-
-
-
s.score.col
-
Plotting color of the S-score line. At default set to: '#728fcc'.
-
-
-
y.score.col
-
Plotting color of the Y-score line. At default set to: '#728fcc'.
-
-
-
t.col
-
Plotting color of the threshold line. At default set to: '#551a8b'.
-
-
-
main
-
Title of the plot.
-
-
-
sleep
-
A numeric indicating the pause in seconds between POST and GET server calls, at default set to 5s. Decreasing is not recommended.
Predict hydroxyproline positions in plant proteins based on primary structur
tprob =ifelse(version=="V1", 0.3, 0.224),
split =1,
...
-)
+)
+
+# S3 method for AAStringSet
+predict_hyp(data, ...)
Arguments
data
-
A data frame with protein amino acid sequences as strings in one column and corresponding id's in another. Alternatively a path to a .fasta file with protein sequences. Alternatively a list with elements of class "SeqFastaAA" resulting from read.fasta call. Should be left blank if vectors are provided to sequence and id arguments.
+
A data frame with protein amino acid sequences as strings in one column and corresponding id's in another. Alternatively a path to a .fasta file with protein sequences. Alternatively a list with elements of class SeqFastaAA resulting from read.fasta call. Alternatively an AAStringSet object. Should be left blank if vectors are provided to sequence and id arguments.
A data frame with protein amino acid sequences as strings in one column and corresponding id's in another. Alternatively a path to a .fasta file with protein sequences. Alternatively a list with elements of class "SeqFastaAA" resulting from read.fasta call. Should be left blank if vectors are provided to sequence and id arguments.
+
A data frame with protein amino acid sequences as strings in one column and corresponding id's in another. Alternatively a path to a .fasta file with protein sequences. Alternatively a list with elements of class SeqFastaAA resulting from read.fasta call. Alternatively an AAStringSet object. Should be left blank if vectors are provided to sequence and id arguments.
A data frame with protein amino acid sequences as strings in one column and corresponding id's in another. Alternatively a path to a .fasta file with protein sequences. Alternatively a list with elements of class "SeqFastaAA" resulting from read.fasta call. Should be left blank if vectors are provided to sequence and id arguments.
+
A data frame with protein amino acid sequences as strings in one column and corresponding id's in another. Alternatively a path to a .fasta file with protein sequences. Alternatively a list with elements of class SeqFastaAA resulting from read.fasta call. Alternatively an AAStringSet object. Should be left blank if vectors are provided to sequence and id arguments.
diff --git a/man/figures/README-SignalP3-1.png b/man/figures/README-SignalP3-1.png
deleted file mode 100644
index fb48f83..0000000
Binary files a/man/figures/README-SignalP3-1.png and /dev/null differ
diff --git a/man/figures/README-plot_prot-1.png b/man/figures/README-plot_prot-1.png
deleted file mode 100644
index 9ba168f..0000000
Binary files a/man/figures/README-plot_prot-1.png and /dev/null differ
diff --git a/man/figures/README-plot_prot-2.svg b/man/figures/README-plot_prot-2.svg
index 8ee5b47..a6ec3ae 100644
--- a/man/figures/README-plot_prot-2.svg
+++ b/man/figures/README-plot_prot-2.svg
@@ -1,702 +1,987 @@
-
-
-
-image/svg+xmlQ9FLL2
-Q9LS14
-Q9S7I8
-Q9M2Z2
-Q9FIN5
-200
-400
-600
-residue
-id
-domain
-Anaphase−promoting complex subunit 4 WD40 domain (PF12894.6)
-Eukaryotic translation initiation factor eIF2A (PF08662.10)
-Leucine Rich Repeat (PF00560.32)
-Leucine rich repeat (PF13855.5)
-Leucine rich repeat N−terminal domain (PF08263.11)
-Plant invertase/pectin methylesterase inhibitor (PF04043.14)
-Protein kinase domain (PF00069.24)
-Protein of unknown function (DUF3675) (PF12428.7)
-Protein tyrosine kinase (PF07714.16)
-RING−variant domain (PF12906.6)
-WD domain, G−beta repeat (PF00400.31)
-feature
-N−sp
-TM
-
\ No newline at end of file
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/man/get_signalp.Rd b/man/get_signalp.Rd
index bed8dd5..530b6df 100644
--- a/man/get_signalp.Rd
+++ b/man/get_signalp.Rd
@@ -23,8 +23,7 @@ get_signalp(data, ...)
method = c("best", "notm"),
minlen = NULL,
trunc = 70L,
- splitter = 500L,
- sleep = 3,
+ splitter = 1000L,
attempts = 2,
progress = FALSE,
...
@@ -57,9 +56,7 @@ get_signalp(data, ...)
\item{trunc}{An integer value corresponding to the N-terminal truncation of input sequence, at default set to 70. By default, the predictor truncates each sequence to max. 70 residues before submitting it to the neural networks. If you want to predict extremely long signal peptides, you can try a higher value, or disable truncation completely by entering 0 (zero).}
-\item{splitter}{An integer indicating the number of sequences to be in each .fasta file that is to be sent to the server. Defaults to 500. Change only in case of a server side error. Accepted values are in range of 1 to 2000.}
-
-\item{sleep}{A numeric indicating the pause in seconds between POST and GET server calls, at default set to 1s. Decreasing is not recommended.}
+\item{splitter}{An integer indicating the number of sequences to be in each .fasta file that is to be sent to the server. Default is 1000. Change only in case of a server side error. Accepted values are in range of 1 to 2000.}
\item{attempts}{Integer, number of attempts if server unresponsive, at default set to 2.}
@@ -106,5 +103,5 @@ signalp_pred
Petersen TN. Brunak S. Heijne G. Nielsen H. (2011) SignalP 4.0: discriminating signal peptides from transmembrane regions. Nature Methods 8: 785-786
}
\seealso{
-\code{\link[ragp]{get_phobius}} \code{\link[ragp]{get_targetp}}
+\code{\link[ragp]{get_signalp5}} \code{\link[ragp]{get_phobius}} \code{\link[ragp]{get_targetp}}
}
diff --git a/man/get_signalp5.Rd b/man/get_signalp5.Rd
index ea8fc49..321777c 100644
--- a/man/get_signalp5.Rd
+++ b/man/get_signalp5.Rd
@@ -9,7 +9,7 @@
\alias{get_signalp5.AAStringSet}
\title{Query SignalP 5.0 web server.}
\source{
-\url{http://www.cbs.dtu.dk/services/SignalP/}
+\url{https://services.healthtech.dtu.dk/service.php?SignalP-5.0}
}
\usage{
get_signalp5(data, ...)
diff --git a/man/get_targetp.Rd b/man/get_targetp.Rd
index 9abfd55..1de3540 100644
--- a/man/get_targetp.Rd
+++ b/man/get_targetp.Rd
@@ -22,8 +22,7 @@ get_targetp(data, ...)
pcut = NULL,
scut = NULL,
ocut = NULL,
- splitter = 200,
- sleep = 3,
+ splitter = 1000,
attempts = 2,
progress = FALSE,
...
@@ -54,9 +53,7 @@ get_targetp(data, ...)
\item{ocut}{A numeric value, with range 0 - 1, defaults to 0 (cutoff = "winner_takes_all"). User specified cutoff for "other" (not with mTP, cTP, SP).}
-\item{splitter}{An integer indicating the number of sequences to be in each .fasta file that is to be sent to the server. Defaults to 200. Change only in case of a server side error. Accepted values are in range of 1 to 2000.}
-
-\item{sleep}{A numeric indicating the pause in seconds between server calls, at default set to 1}
+\item{splitter}{An integer indicating the number of sequences to be in each .fasta file that is to be sent to the server. Defaults to 1000. Change only in case of a server side error. Accepted values are in range of 1 to 2000.}
\item{attempts}{Integer, number of attempts if server unresponsive, at default set to 2.}
@@ -100,5 +97,5 @@ targetp_pred
Emanuelsson O, Nielsen H, Brunak S,von Heijne G. (2000) Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. J. Mol. Biol.300: 1005-1016
}
\seealso{
-\code{\link[ragp]{get_signalp}}
+\code{\link[ragp]{get_signalp}} \code{\link[ragp]{get_signalp5}}
}
diff --git a/man/plot_prot.Rd b/man/plot_prot.Rd
index 6ca80e8..392c999 100644
--- a/man/plot_prot.Rd
+++ b/man/plot_prot.Rd
@@ -79,8 +79,7 @@ library(ragp)
library(ggplot2)
ind <- c(23, 5, 80, 81, 345)
pred <- plot_prot(sequence = at_nsp$sequence[ind],
- id = at_nsp$Transcript.id[ind],
- bitscore = 30) #passed to get_hmm
+ id = at_nsp$Transcript.id[ind])
pred +
theme(legend.position = "bottom",
legend.direction = "vertical")
@@ -90,9 +89,9 @@ nsp <- get_signalp(data = at_nsp[ind,],
id = Transcript.id,
sequence = sequence)
-hmm <- get_hmm(data = at_nsp[ind,],
+hmm <- get_hmm(data = at_nsp[ind,], #default is to use get_cdd()
id = Transcript.id,
- sequence = sequence)
+ sequence = sequence)
gpi <- get_netGPI(data = at_nsp[ind,],
id = Transcript.id,
@@ -112,8 +111,7 @@ pred2 <- plot_prot(sequence = at_nsp$sequence[ind],
nsp = nsp,
gpi = gpi,
domain = hmm,
- disorder = disorder,
- bitscore = 30)
+ disorder = disorder)
pred2 +
@@ -124,5 +122,5 @@ pred2 +
}
\seealso{
-\code{\link[ragp]{get_signalp5}} \code{\link[ragp]{get_signalp}} \code{\link[ragp]{get_phobius}} \code{\link[ragp]{get_tmhmm}} \code{\link[ragp]{get_hmm}} \code{\link[ragp]{get_espritz}} \code{\link[ragp]{predict_hyp}} \code{\link[ragp]{scan_ag}}
+\code{\link[ragp]{get_signalp5}} \code{\link[ragp]{get_signalp}} \code{\link[ragp]{get_phobius}} \code{\link[ragp]{get_tmhmm}} \code{\link[ragp]{get_hmm}} \code{\link[ragp]{get_cdd}} \code{\link[ragp]{get_espritz}} \code{\link[ragp]{predict_hyp}} \code{\link[ragp]{scan_ag}}
}
diff --git a/man/plot_signalp.Rd b/man/plot_signalp.Rd
deleted file mode 100644
index 8aa4a13..0000000
--- a/man/plot_signalp.Rd
+++ /dev/null
@@ -1,74 +0,0 @@
-% Generated by roxygen2: do not edit by hand
-% Please edit documentation in R/plot_signalp.R
-\name{plot_signalp}
-\alias{plot_signalp}
-\title{Plotting SignalP prediction.}
-\source{
-\url{http://www.cbs.dtu.dk/services/SignalP-4.1/}
-}
-\usage{
-plot_signalp(
- sequence,
- id,
- org_type = c("euk", "gram-", "gram+"),
- Dcut_type = c("default", "sensitive", "user"),
- Dcut_noTM = 0.45,
- Dcut_TM = 0.5,
- method = c("best", "notm"),
- c.score.col = "#ff0000",
- s.score.col = "#59a454",
- y.score.col = "#728fcc",
- t.col = "#551a8b",
- main = NULL,
- sleep = 5L
-)
-}
-\arguments{
-\item{sequence}{String representing a protein amino acid sequence.}
-
-\item{id}{String representing a protein identifier.}
-
-\item{org_type}{One of c("euk", "gram-", "gram+"), defaults to "euk". Which model should be used for prediction.}
-
-\item{Dcut_type}{One of c("default", "sensitive", "user"), defaults to "default". The default cutoff values for SignalP 4 are chosen to optimize the performance measured as Matthews Correlation Coefficient (MCC). This results in a lower sensitivity (true positive rate) than SignalP 3.0 had. Setting this argument to "sensitive" will yield the same sensitivity as SignalP 3.0. This will make the false positive rate slightly higher, but still better than that of SignalP 3.0.}
-
-\item{Dcut_noTM}{A numeric value, with range 0 - 1, defaults to 0.45. For experimenting with cutoff values.}
-
-\item{Dcut_TM}{A numeric value, with range 0 - 1, defaults to 0.5. For experimenting with cutoff values.}
-
-\item{method}{One of c("best", "notm"), defaults to "best". Signalp 4.1 contains two types of neural networks. SignalP-TM has been trained with sequences containing transmembrane segments in the data set, while SignalP-noTM has been trained without those sequences. Per default, SignalP 4.1 uses SignalP-TM as a preprocessor to determine whether to use SignalP-TM or SignalP-noTM in the final prediction (if 4 or more positions are predicted to be in a transmembrane state, SignalP-TM is used, otherwise SignalP-noTM). An exception is Gram-positive bacteria, where SignalP-TM is used always. If you are confident that there are no transmembrane segments in your data, you can get a slightly better performance by choosing "Input sequences do not include TM regions", which will tell SignalP 4.1 to use SignalP-noTM always.}
-
-\item{c.score.col}{Plotting color of the C-score line. At default set to: '#ff0000'.}
-
-\item{s.score.col}{Plotting color of the S-score line. At default set to: '#728fcc'.}
-
-\item{y.score.col}{Plotting color of the Y-score line. At default set to: '#728fcc'.}
-
-\item{t.col}{Plotting color of the threshold line. At default set to: '#551a8b'.}
-
-\item{main}{Title of the plot.}
-
-\item{sleep}{A numeric indicating the pause in seconds between POST and GET server calls, at default set to 5s. Decreasing is not recommended.}
-}
-\value{
-A list with two elements:
-\describe{
- \item{prediction}{Data frame with the prediction results.}
- \item{plot}{Data frame with values used for plotting.}
-}
-}
-\description{
-Plots the SignalP prediction for one protein sequence using base graphics. SignalP 4.1 server predicts the presence and location of signal peptide cleavage sites in amino acid sequences from different organisms: Gram-positive prokaryotes, Gram-negative prokaryotes, and eukaryotes. The method incorporates a prediction of cleavage sites and a signal peptide/non-signal peptide prediction based on a combination of several artificial neural networks.
-}
-\examples{
-library(ragp)
-pred <- plot_signalp(sequence = at_nsp$sequence[5],
- id = at_nsp$Transcript.id[5])
-
-}
-\references{
-Petersen TN. Brunak S. Heijne G. Nielsen H. (2011) SignalP 4.0: discriminating signal peptides from transmembrane regions. Nature Methods 8: 785-786
-}
-\seealso{
-\code{\link[ragp]{get_signalp}}
-}
diff --git a/vignettes/bibtex.bib b/vignettes/bibtex.bib
index e927460..f7e58d1 100644
--- a/vignettes/bibtex.bib
+++ b/vignettes/bibtex.bib
@@ -344,4 +344,30 @@ @article{hopp_prediction_1981
pmcid = {PMC319665},
keywords = {Amino Acid Sequence, Animals, Epitopes, Myoglobin, Oligopeptides, Protein Conformation, Proteins, Structure-Activity Relationship},
pages = {3824--3828}
+}
+
+@article{armenteros_signalp5_2019,
+ title = {SignalP 5.0 improves signal peptide predictions using deep neural networks},
+ volume = {37},
+ url = {https://doi.org/10.1038/s41587-019-0036-z},
+ doi = {0.1038/s41587-019-0036-z},
+ abstract = {Signal peptides (SPs) are short amino acid sequences in the amino terminus of many newly synthesized proteins that target proteins into, or across, membranes. Bioinformatic tools can predict SPs from amino acid sequences, but most cannot distinguish between various types of signal peptides. We present a deep neural network-based approach that improves SP prediction across all domains of life and distinguishes between three types of prokaryotic SPs},
+ journal = {Nature Biotechnology},
+ author = {Almagro Armenteros, José Juan and Tsirigos, Konstantinos D. and Sønderby, Casper Kaae and Petersen, Thomas Nordahl and Winther, Ole and Brunak, Søren and von Heijne, Gunnar and Nielsen, Henrik},
+ year = {2019},
+ pages = {420--423}
+}
+
+@article{gislason_netgpi_2021,
+ title = {Prediction of GPI-anchored proteins with pointer neural networks},
+ journal = {Current Research in Biotechnology},
+ volume = {3},
+ pages = {6--13},
+ year = {2021},
+ issn = {2590-2628},
+ doi = {https://doi.org/10.1016/j.crbiot.2021.01.001},
+ url = {https://www.sciencedirect.com/science/article/pii/S2590262821000010},
+ author = {Magnús Halldór Gíslason and Henrik Nielsen and José Juan {Almagro Armenteros} and Alexander Rosenberg Johansen},
+ keywords = {Glycosylphosphatidylinositol, Lipid anchored proteins, Post-translational modification, Protein sorting, Prediction, Neural networks},
+ abstract = {GPI-anchors constitute a very important post-translational modification, linking many proteins to the outer face of the plasma membrane in eukaryotic cells. Since experimental validation of GPI-anchoring signals is slow and costly, computational approaches for predicting them from amino acid sequences are needed. However, the most recent GPI predictor is more than a decade old and considerable progress has been made in machine learning since then. We present a new dataset and a novel method, NetGPI, for GPI signal prediction. NetGPI is based on recurrent neural networks, incorporating an attention mechanism that simultaneously detects GPI-anchoring signals and points out the location of their ω-sites. The performance of NetGPI is superior to existing methods with regards to discrimination between GPI-anchored proteins and other secretory proteins and approximate (±1 position) placement of the ω-site. NetGPI is available at: https://services.healthtech.dtu.dk/service.php?NetGPI. The code repository is available at: https://github.com/mhgislason/netgpi-1.1.}
}
\ No newline at end of file
diff --git a/vignettes/ragp-vignette.Rmd b/vignettes/ragp-vignette.Rmd
index 19eb1ed..161ce24 100644
--- a/vignettes/ragp-vignette.Rmd
+++ b/vignettes/ragp-vignette.Rmd
@@ -1,12 +1,12 @@
---
-title: "ragp vignette"
+title: "Introduction to ragp"
author:
- Milan Dragićević
output: rmarkdown::html_vignette
abstract: "This vignette provides an introduction into the functions and workflow in `ragp` R package. `ragp` is designed for mining and analysis of plant Hydroxyproline rich glycoproteins but it can also be used for protein annotation and for obtaining predictions for several protein features based on sequence (secretory signals, transmembrane regions, domains and glycosylphosphatidylinositol attachment sites). Additionally `ragp` provides tools for visualization of the mentioned attributes."
bibliography: bibtex.bib
vignette: >
- %\VignetteIndexEntry{Vignette Title}
+ %\VignetteIndexEntry{Introduction to ragp}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
@@ -29,13 +29,14 @@ knitr::knit_theme$set("edit-matlab")
options(width = 120)
```
-##Introduction
+What is ragp?
+=============
Hydroxyproline rich glycoproteins (HRGPs) are one of the most complex families of macromolecules found in plants, due to the diversity of glycans decorating the protein backbone, as well as the heterogeneity of the protein backbones. While this diversity is responsible for the wide array of physiological functions associated with HRGPs, it hinders attempts for homology based identification. Current approaches, based on identifying sequences with characteristic motifs and biased amino acid composition, are limited to prototypical sequences.
-`ragp` is an R package for mining and analyses of HRGPs, with emphasis on arabinogalactan protein sequences (AGPs). The `ragp` filtering pipeline exploits one of the HRGPs key features, the presence of hydroxyprolines which represent glycosylation sites. Main package features include prediction of proline hydroxylation sites, amino acid motif and bias analyses, efficient communication with web servers for prediction of N-terminal signal peptides, glycosylphosphatidylinositol modification sites and protein disorder, as well as the ability to annotate sequences through hmmscan and subsequent GO enrichment, based on predicted pfam domains.
+`ragp` is an R package for mining and analyses of HRGPs, with emphasis on arabinogalactan protein sequences (AGPs). The `ragp` filtering pipeline exploits one of the HRGPs key features, the presence of hydroxyprolines which represent glycosylation sites. Main package features include prediction of proline hydroxylation sites, amino acid motif and bias analyses, efficient communication with web servers for prediction of N-terminal signal peptides and glycosylphosphatidylinositol modification sites, as well as the ability to annotate sequences through [CDD](https://www.ncbi.nlm.nih.gov/Structure/bwrpsb/bwrpsb.cgi) or [hmmscan](https://www.ebi.ac.uk/Tools/hmmer/search/hmmscan) and subsequent [GO](http://www.geneontology.org/) enrichment, based on predicted [Pfam](https://pfam.xfam.org/) domains.
-The workflow in `ragp` is illustrated with the following diagram (`ragp` functions to be used for each of the tasks are boxed grey):
+The workflow in `ragp` is illustrated with the following diagram (`ragp` functions to be used for each of the tasks are boxed grey):
```{r echo = FALSE, out.width = "60%", fig.align = "center"}
knitr::include_graphics("ragp_flow_chart.svg")
@@ -43,541 +44,155 @@ knitr::include_graphics("ragp_flow_chart.svg")
The filtering layer:
-* predict the presence of secretory signals (N-sp) and filter sequences containing them. Since several prediction algorithms are utilized (functions `get_signalp`, `get_targetp` and `get_phobius`) this decision is reached by a majority vote.
-* predict proline hydroxylation (function `predict_hyp`) and filter sequences containing at least three potential hydroxyprolines.
+* predict the presence of secretory signals (N-sp) and filter sequences containing them. Several prediction algorithms are available. Currently recommended is to use `get_signalp5()` which queries [SignalP5](https://services.healthtech.dtu.dk/service.php?SignalP-5.0) [@armenteros_signalp5_2019].
+* predict proline hydroxylation (`predict_hyp()`) and filter sequences containing several (for example three or more) potential hydroxyprolines.
The analysis layer:
-* find localized clusters of characteristic arabinogalactan motifs (AG glycomodules) to identify potential AGPs (function `scan_ag`) among the hydroxyproline containing sequences.
-* perform motif and amino acid bias (MAAB, [@johnson_pipeline_2017]) classification scheme (function `maab`) to classify hydroxyproline containing sequences into 23 distinct HRGP groups.
-* annotate domains using [Pfam](https://pfam.xfam.org/) (function `get_hmm`) and enrich with gene ontology (GO) terms (function `pfam2go`).
-* predict the presence of potential glycosylphosphatidylinositol attachment sites (functions `get_bit_pi` and `get_pred_gpi`).
-* predict disordered regions in proteins (function `get_espritz`).
+* find localized clusters of characteristic arabinogalactan motifs (AG glycomodules) to identify potential AGPs (`scan_ag()`) among the hydroxyproline containing sequences.
+* perform motif and amino acid bias (MAAB, [@johnson_pipeline_2017]) classification scheme (`maab()`) to classify hydroxyproline containing sequences.
+* annotate domains using [Pfam](https://pfam.xfam.org/) (`get_hmm()`) and enrich with gene ontology (GO) terms (`pfam2go()`). Alternatively annotate domains using [CDD](https://www.ncbi.nlm.nih.gov/Structure/bwrpsb/bwrpsb.cgi).
+* predict the presence of potential glycosylphosphatidylinositol attachment sites (`get_big_pi()`, `get_pred_gpi()` and `get_netGPI()`).
+* predict disordered regions in proteins (`get_espritz()`).
+* predict transmembrane regions in proteins (`get_phobius()` and `get_tmhmm()`)
-Additionally `ragp` provides tools for visualization of the mentioned attributes via the function `plot_prot`.
+Additionally `ragp` provides tools for visualization of the mentioned attributes via `plot_prot()`.
-##Data import
-To proceed with this guide the following `R` packages are required:
+Installation
+------------
-```{r libraries}
-library(dplyr)
-library(seqinr)
-library(eulerr)
-library(ragp)
-library(ggplot2)
+There are several ways to install R packages hosted on git-hub, however the simplest is to use `devtools::install_github()` which will perform all the required steps automatically.
+
+To install `ragp` run:
+
+```{r gh-installation1, warning = FALSE, message = FALSE, eval = FALSE}
+#install.packages("devtools") #if it is not installed on your system
+devtools::install_github("missuse/ragp")
```
-The examples will be performed on a set of 2706 *Arabidopsis* protein sequences included in `ragp` package as the `at_nsp` data frame which contains four columns:
-```{r data}
-library(ragp)
-data(at_nsp)
-summary(at_nsp)
+alternatively run:
+
+```{r gh-installation2, warning=FALSE, message=FALSE, eval = FALSE}
+# install.packages("devtools")
+devtools::install_github("missuse/ragp",
+ build_vignettes = TRUE)
```
-`ragp` functions require single letter protein sequences and the corresponding identifiers as input. These can be provided in the form of basic `R` data types such as vectors or data frames. Additionally `ragp` imports the [`seqinr`](https://cran.r-project.org/web/packages/seqinr/index.html) package for the manipulation of `.FASTA` files, so the input objects can be a list of `SeqFastaAA` objects returned by the [`seqinr`](https://cran.r-project.org/web/packages/seqinr/index.html) function `read.fasta`. The location of a `.FASTA` file is also possible as a type of input.
+to build vignettes which can be viewed by:
+
+```{r gh-installation3, warning=FALSE, message=FALSE, eval = FALSE}
+browseVignettes("ragp")
+```
+
+
+Data import
+===========
+
+Inputs
+------
-Input options will be illustrated using `scan_ag` function:
+Most `ragp` functions require single letter protein sequences and the corresponding identifiers as input. These can be provided in the form of basic `R` data types such as vectors or data frames. Additionally `ragp` imports the [`seqinr`](https://cran.r-project.org/web/packages/seqinr/index.html) package for the manipulation of `.FASTA` files, so the input objects can be a list of [`SeqFastaAA`](https://search.r-project.org/CRAN/refmans/seqinr/html/SeqFastaAA.html) objects returned by the `seqinr::read.fasta()`. The location of a `.FASTA` file is also possible as a type of input. As of ragp version 0.3.5 objects of class [`AAStringSet`](https://web.mit.edu/~r/current/arch/i386_linux26/lib/R/library/Biostrings/html/XStringSet-class.html) are also supported.
+Input options will be illustrated using `scan_ag()` function:
+
+* provide a `character` vector of protein sequences to the `sequence` argument and a `character` vector of protein identifiers to the `id` argument:
```{r input 1}
-data(at_nsp) #a data frame of 2700 Arabidopsis protein sequences
+library(ragp)
+data(at_nsp) #a data frame of 2700 Arabidopsis sequences
input1 <- scan_ag(sequence = at_nsp$sequence,
id = at_nsp$Transcript.id)
```
-equivalent to:
-```{r input 2, eval = FALSE}
+
+* provide a `data.frame` to `data` argument, and names of columns containing the protein sequences and corresponding identifiers to `sequence` and `id` arguments:
+
+```{r input 2}
input2 <- scan_ag(data = at_nsp,
sequence = "sequence",
id = "Transcript.id")
```
-same as without quotation:
-```{r input 3, eval = FALSE}
+
+quoting column names is not necessary:
+
+```{r input 3}
input3 <- scan_ag(data = at_nsp,
sequence = sequence,
id = Transcript.id)
```
-A list of `SeqFastaAA` objects is also a viable input:
-```{r input 4, eval = FALSE}
+
+* provide a list of `SeqFastaAA` objects to `data` argument:
+
+```{r input 4}
+library(seqinr) #to create a fasta file with protein sequences
+
#write a FASTA file
seqinr::write.fasta(sequence = strsplit(at_nsp$sequence, ""),
name = at_nsp$Transcript.id, file = "at_nsp.fasta")
-#read a FASTA file
+#read a FASTA file to a list of SeqFastaAA objects
At_seq_fas <- read.fasta("at_nsp.fasta",
seqtype = "AA",
as.string = TRUE)
input4 <- scan_ag(data = At_seq_fas)
```
-and lastly the location of a `.FASTA` file to be analyzed as string:
-```{r input 5, eval = FALSE}
-input5 <- scan_ag(data = "at_nsp.fasta")
-```
-```{r hidden, echo = FALSE}
-vigfiles <- readRDS("vig_files.rds")
-id_nsp <- vigfiles$id_nsp
-at_hmm <- vigfiles$at_hmm
-nsp_signalp <- vigfiles$nsp_signalp
-at_gpi <- vigfiles$at_gpi
-at_gpi_pred <- vigfiles$at_gpi_pred
-at_nsp2 <- dplyr::filter(at_nsp, Transcript.id %in% id_nsp)
+* provide the location of a `.FASTA` file to be analyzed as string:
-```
-
-#Filtering HRGPs
-
-The filtering layer in `ragp` consists of two sequential tasks:
-
-* predict the presence of secretory signals (N-sp) and filter sequences containing them.
-* predict proline hydroxylation and filter sequences containing at least several potential hydroxyprolines.
-
-The following section explains how to perform these tasks.
-
-##Predicting N-terminal secretory signal sequences (N-sp)
-
-HRGPs are secreted proteins, therefore they are expected to contain a secretory signal sequence on the N-terminus (N-sp). `ragp` incorporates N-sp prediction by querying [SignalP](http://www.cbs.dtu.dk/services/SignalP/), [TargetP](http://www.cbs.dtu.dk/services/TargetP/) [@emanuelsson_predicting_2000] and [Phobius](http://phobius.sbc.su.se/) [@kall_advantages_2007] in an efficient manner via the functions: `get_signalp`, `get_targetp` and `get_phobius`.
-
-To query [SignalP](http://www.cbs.dtu.dk/services/SignalP/) predictions:
-
-```{r SignalP1, eval = FALSE}
-nsp_signalp <- get_signalp(at_nsp,
- sequence,
- Transcript.id)
-```
-
-The predictions for the 2700 sequences contained in `at_nsp` data frame should be available in around 1 minute. When handling large FASTA files with many sequences it is recommended to split them into smaller files each containing 10000 - 20000 sequences, using the `ragp` function `split_fasta` and to obtain predictions in a loop. The returned object `nsp_signalp` is a data frame resembling the web servers output:
-
-```{r SignalP2}
-head(nsp_signalp[,-c(1,10)]) #omitting columns "id" and "Networks.used"
+```{r input 5}
+input5 <- scan_ag(data = "at_nsp.fasta") #file at_nsp.fasta is in the working directory
```
-similarly [TargetP](http://www.cbs.dtu.dk/services/TargetP/) and [Phobius](http://phobius.sbc.su.se/) can be queried by:
-
-```{r TargetP1, eval = FALSE}
-nsp_targetp <- get_targetp(at_nsp,
- sequence,
- Transcript.id)
+* provide an [`AAStringSet`](https://web.mit.edu/~r/current/arch/i386_linux26/lib/R/library/Biostrings/html/XStringSet-class.html) object:
-nsp_phobius <- get_phobius(at_nsp,
- sequence,
- Transcript.id)
+```{r input 6}
+dat <- Biostrings::readAAStringSet("at_nsp.fasta") #file at_nsp.fasta is in the working directory
+input6 <- scan_ag(data = dat)
```
-The concordance of the predictions will be checked with Euler diagrams:
+All of the outputs are equal:
-```{r euler, eval = FALSE}
-set.seed(5)
-bind_cols(nsp_signalp, nsp_targetp, nsp_phobius) %>%
- select(is.phobius, is.targetp, is.signalp) %>%
- euler(shape = "ellipse", input = "disjoint") %>%
- plot(quantities = T)
-```
+```{r equal}
+all.equal(input1,
+ input2)
-```{r echo = FALSE, out.width = "40%", fig.align = "center"}
-knitr::include_graphics("eulerr.svg")
-```
+all.equal(input1,
+ input3)
-The three servers do not agree completely therefore we recommend using the majority vote to predict if the protein is secreted:
+all.equal(input1,
+ input4)
-```{r euler2, eval = FALSE}
-bind_cols(nsp_signalp, nsp_targetp, nsp_phobius) %>%
- select(is.phobius, is.targetp, is.signalp, id) %>%
- mutate(vote = rowSums(.[,1:3])) %>%
- filter(vote >= 2) %>%
- pull(id) -> id_nsp
+all.equal(input1,
+ input5)
-at_nsp2 <- filter(at_nsp, Transcript.id %in% id_nsp)
+all.equal(input1,
+ input6)
```
+The only exceptions to this design are the plotting function `plot_prot()` which requires protein sequences to be supplied in the form of string vectors (input1 in the above example) and `pfam2go()` which does not take sequences as input.
-This will create `at_nsp2` data frame with `r nrow(at_nsp2)` sequences.
+Further reading
+===============
-##Predicting proline hydroxylation
+All `ragp` functions return basic R data structures such as data frames, lists of vectors and lists of data frames, making them convenient for manipulation to anyone familiar with R. An especially effective way to manipulate these objects is by utilizing the [`tidyverse`](https://www.tidyverse.org/) collection of packages, especially [`dplyr`](https://dplyr.tidyverse.org/) and [`ggplot2`](https://ggplot2.tidyverse.org/). Several [`dplyr`](https://dplyr.tidyverse.org/) functions that will be especially handy for data wrangling are:
-The key feature of HRGPs is the presence of hydroxyprolines (Hyp, O) which represent glycosylation sites [@showalter_bioinformatics_2010]. While many HRGPs can be found based on biased amino acid composition, or the presence of certain amino acid motifs, there exists an abundance of chimeric proteins comprised from specific domains and HRGP motifs which are much harder to identify based on the mentioned features. In an attempt to identify these types of sequences `ragp` incorporates a model specifically built to predict the probability of proline hydroxylation in plant proteins. This model is incorporated in the `predict_hyp` function and can be utilized to filter potential HRGPs.
+* `dplyr::left_join()`
+* `dplyr::mutate()`
+* `dplyr::group_by()`
+* `dplyr::summarise()`
+* `dplyr::filter()`
+* `dplyr::distinct()`
-```{r predict 1}
-at_hyp <- predict_hyp(data = at_nsp2,
- sequence = sequence,
- id = Transcript.id)
-```
-
-The object returned by `predict_hyp` is a list with two elements `prediction` and `sequence`. The `prediction` element is a data frame containing the probability of hydroxylation for each P in each sequence along with the amino acid span used for prediction. First few rows of this element:
-```{r filter hyp show}
-head(at_hyp$prediction)
-```
-
-These predictions are based on a probability threshold of 0.224 which offers the best trade-off between specificity and sensitivity. To increase specificity at the cost of sensitivity the threshold can be increased using the `tprob` argument:
-
-```{r predict 2, eval = FALSE}
-at_hyp2 <- predict_hyp(data = at_nsp2,
- sequence = sequence,
- id = Transcript.id,
- tprob = 0.6)
-```
-The `at_hyp` object can be used to filter the sequences that contain more than a certain number of hydroxyprolines. The default threshold will be used. To filter sequences with three or more O:
+Examples on usage of these functions on objects returned by `ragp` functions are provided in [HRGP filtering](https://missuse.github.io/ragp/articles/pkgdown/filter) and [HRGP analysis](https://missuse.github.io/ragp/articles/pkgdown/analyse) tutorials. Additionally there are extensive examples on the internet on usage of the mentioned functions.
-```{r filter hyp three or more}
-at_hyp$prediction %>%
- group_by(id) %>%
- summarise(n = sum(HYP == "Yes")) %>%
- filter(n >= 3) %>%
- pull(id) -> at_3hyp
+Obtaining pretty visualizations is usually the goal of the above mentioned data manipulations. The golden standard of R graphics at present is the [ggplot2](https://ggplot2.tidyverse.org/) package and we recommend it to graphically summarize the data. Additionally `ragp` contains `plot_prot()` function which is a wrapper for `ggplot2`, and while `plot_prot()` can be used without knowing `ggplot2` syntax, to tweak the plot style at least a basic knowledge of `ggplot2` is required. Examples are provided in [protein sequence visualization](https://missuse.github.io/ragp/articles/pkgdown/PSV) tutorial.
-at_nsp2 %>%
- filter(Transcript.id %in% at_3hyp) -> at_nsp_3hyp
-```
-
-There are `r nrow(at_nsp_3hyp)` sequences that satisfy this condition.
-
-#Analyzing HRGPs
-
-The analysis layer in `ragp` consists of:
-
-* Arabinogalactan identification by finding localized clusters of characteristic motifs (AG glycomodules)
-* Hydroxyproline rich glycoprotein classification by the motif and amino acid bias scheme (MAAB, [@johnson_pipeline_2017]).
-* Domain annotation using [Pfam](https://pfam.xfam.org/) data base and enrichment with gene ontology (GO) terms.
-* Glycosylphosphatidylinositol attachment sites prediction.
-* Identification of disordered regions in proteins.
-
-##Arabinogalactan identification
-
-To check which of the filtered sequences are potential arabinogalactans the function `scan_ag` can be used. This function scans the protein sequences for AG glycomodule clusters, by constructing regular expressions based on user input.
-
-By default `scan_ag` searches for at least three dipeptides PA, PS, PT, PG, PV, AP, SP, TP, GP and VP which are separated by maximally 10 amino acids. The minimal number of dipeptides to be considered can be changed by the argument `dim` while the maximum number of separating amino acids can be tweaked using the argument `div`. To consider only PA, PS, PT, AP, SP and TP dipeptides the `type` argument can be changed to `"conservative"`. In cases when the dipeptides are ambiguous, such as SPT, PTP etc. all three amino acids will be counted as one dipeptide, while cases such as PTPA are considered as two dipeptides separated by zero amino acids. When the `sequence` object from `predict_hyp` function is passed to `scan_ag` only prolines predicted to be hydroxylated are considered when searching for dipeptides:
-
-```{r scan_ag}
-scaned_at <- scan_ag(data = at_nsp_3hyp,
- sequence,
- Transcript.id)
-```
-
-The output object `scaned_At` can be a list, or a data frame depending on the `simplify`/`tidy` arguments. The default output is a data frame one row per sequences with five columns. Here are the 55 N-terminal amino acids from the [Classical arabinogalactan protein 9](http://www.uniprot.org/uniprot/Q9C5S0) (AT2G14890.1) as an example:
-
-```{r scan_ag2}
-scaned_at %>%
- mutate(sequence = strtrim(sequence, 55)) %>%
- select(-3) %>% #omitting 3rd column
- filter(id == "AT2G14890.1")
-```
-
-The column "sequence" contains the input sequence where only the amino acids in potential AG glycomodules are in uppercase while all other are in lower case. The column "AG_aa" contains the amino acid count found in potential AG glycomodules, while columns "total_length" and "longest" correspond to the matched sequence spans.
-
-True AG glycomodules contain hydroxyproline instead of proline in the dipeptides: OA, OS, OT, AO, SO and TO (and in some cases OG, OV, GO and VO) but the above output considers P instead of O since most of the time the positions of O are unknown. However if the sequence argument to `scan_ag` contains O's at some positions, `scan_ag` will consider only them. To do this the `sequence` element of `predict_hyp` function output can be used:
-
-```{r scan_ag3}
-scaned_at <- scan_ag(data = at_hyp$sequence,
- sequence,
- id)
-```
-
-The 55 N-terminal amino acids from the [Classical arabinogalactan protein 9](http://www.uniprot.org/uniprot/Q9C5S0):
-
-```{r scan_ag4}
-scaned_at %>%
- mutate(sequence = strtrim(sequence, 55)) %>%
- select(-3) %>%
- filter(id == "AT2G14890.1")
-```
-
-Extensin motifs in the form of SPPP+/SOOO+ are also detected by `scan_ag` if in correct context, to avoid this add `exclude_ext = "yes"` as an argument:
-
-```{r scan_ag5}
-scaned_at <- scan_ag(data = at_hyp$sequence,
- sequence,
- id,
- exclude_ext = "yes")
-
-scaned_at %>%
- mutate(sequence = strtrim(sequence, 55)) %>%
- select(-3) %>%
- filter(id == "AT2G14890.1")
-```
-
-to switch of all PPP+/OOO+ from being detected:
-
-```{r scan_ag6}
-scaned_at <- scan_ag(data = at_hyp$sequence,
- sequence,
- id,
- exclude_ext = "all")
-
-scaned_at %>%
- mutate(sequence = strtrim(sequence, 55)) %>%
- select(-3) %>%
- filter(id == "AT2G14890.1")
-```
-
-To disregard VP, PV, GP, PG and the corresponding hydroxyproline dipeptides:
-
-```{r scan_ag7}
-scaned_at <- scan_ag(data = at_hyp$sequence,
- sequence,
- id,
- type = "conservative")
-
-scaned_at %>%
- mutate(sequence = strtrim(sequence, 55)) %>%
- select(-3) %>%
- filter(id == "AT2G14890.1")
-```
-
-```{r scan_ag8, echo = FALSE}
-scaned_at <- scan_ag(data = at_hyp$sequence,
- sequence,
- id)
-
-```
-
-With default settings `scan_ag` identifies `r sum(scaned_at$total_length != 0)` sequences. To compare it with the number of sequences that would be filtered if the proline hydroxylation prediction was omitted the function can be run on `at_nsp2` data:
-
-```{r scan_ag9}
-scaned_at_nsp2 <- scan_ag(data = at_nsp2,
- sequence,
- Transcript.id)
-
-sum(scaned_at_nsp2$total_length != 0)
-```
-
-Thus the prediction of hydroxyprolines greatly reduced the pool of potential arabinogalactans.
-`scan_ag` is capable of identifying potential chimeric and short AGPs without pronounced amino acid bias and with few characteristic motifs. In combination with `predict_hyp` the specificity of the found sequences is greatly increased since the majority of sequences unlikely to contain hydroxyprolines are excluded.
-
-##Motif and amino acid bias classification - MAAB
-
-The MAAB pipeline is explained in detail in @johnson_pipeline_2017. In short, after removal of sequences without a predicted N-sp and removal of sequences with predicted domains the sequences are first evaluated for amino acid bias. Sequences with > 45% of PAST (AGP bias) or PSKY (EXT bias) or PVKY (proline-rich proteins (PRP) bias) are grouped further into 24 distinct groups (23 of which represent HRGPs) based on amino acid bias, presence of GPIs and the following motifs (as regular expression):
-
-* EXT: the SPn motif `SP{3,5}` and the tyr motifs `[FY].Y`, `KHY`, `VY[HKDE]`, `V.Y` and `YY`.
-* PRP: `PPV.[KT]`, `PPV[QK]` and `KKPCPP`.
-* AGP: `[AVTG]P{1,3}` and `[ASVTG]P{1,2}`.
-
-The motifs are counted in a specific order ext > tyr > prp > agp, and overlapping motifs are not counted twice. Based on the final motif count and the overall coverage of the motifs HRGPs are clustered in the mentioned 24 groups. This classification can be performed using the `ragp` function `maab`:
-
-```{r maab 1}
-maab_at <- maab(at_nsp_3hyp,
- sequence,
- Transcript.id)
-```
-
-Of the `r nrow(at_nsp_3hyp)` sequences in `at_nsp_3hyp`, `r sum(maab_at$maab_class != "0")` are predicted to belong to any of the HRGP classes (1 - 24). The output of the `maab` function is a data frame containing columns: `id` - sequence identifier; `ext_sp`, `ext_tyr`, `prp` and `agp` - corresponding motif count; `past_percent`, `pvyk_percent`, `psky_percent`, `p_percent` - percentage of the corresponding amino acids in sequence; `coverage` - the motif coverage; `maab_class` - the MAAB class. The first several hits:
-
-```{r maab 2}
-maab_at %>%
- filter(maab_class != "0") %>%
- select(-ends_with("percent")) %>% #omitting columns ending with "percent"
- head
-```
-
-Since the function was not provided knowledge on the GPI presence, both possible classes are indicated as in: "1/4" or "2/9". GPI presence can be indicated with the argument `gpi` which accepts a Boolean vector of the same length as the sequences, like the `is.bigpi`/`is.gpi` columns from the outputs of `get_big_pi`/`get_pred_gpi` functions. Another way to remove ambiguities in classes if GPI presence is unknown is to set `get_gpi` argument to `"bigpi"` or `"predgpi"`:
-
-```{r maab 4, results ="hide", eval = TRUE}
-maab_at <- maab(at_nsp_3hyp,
- sequence,
- Transcript.id,
- get_gpi = "bigpi")
-```
-```{r maab 5}
-maab_at %>%
- filter(maab_class != "0") %>%
- select(-ends_with("percent")) %>%
- head
-```
-
-This will run `get_big_pi` or `get_pred_gpi` only on the sequences that belong to one of the HRGP classes thus resolving class ambiguities that depend on GPI knowledge.
-
-##Domain annotation and GPI prediction
-
-Domain annotation is a fundamental research tool in protein sequence analyses. In many occasions domain annotation has already been performed prior to HRGP mining, either by using [hmmer](http://hmmer.org/) [@eddy_accelerated_2011], [InterProScan](https://www.ebi.ac.uk/interpro/interproscan.html) [@jones_interproscan_2014] or some other resource. In such circumstances the annotation can be imported into `R` and joined with the filtered sequences. When this is not the case `ragp` offers the function `get_hmm` which queries the [hmmscan](https://www.ebi.ac.uk/Tools/hmmer/search/hmmscan) web server [@finn_hmmer_2011]. Domain annotation is not necessary for AGP prediction although it enhances interpretation, but it is a fundamental part of the MAAB classification scheme. As with other `ragp` functions the first argument is the `data`, followed by the names of the columns containing the protein sequences and corresponding identifiers:
-
-```{r hmmscan 1, eval = FALSE}
-at_hmm <- get_hmm(at_nsp_3hyp,
- sequence,
- Transcript.id)
-```
-
-The resulting data frame resembles the web server's output, or for those familiar with the [`hmmer`](http://hmmer.org/) software the domain hits table produced by using the argument `--domtblout`. First few rows with several columns omitted:
-
-```{r hmmscan 2}
-head(at_hmm[,-c(4, 5, 8, 9, 11:13)])
-```
-
-HRGPs, and especially AGPs are often linked to the membrane by a glycosylphosphatidylinositol (GPI) anchor [@ellis_arabinogalactan-proteins:_2010]. Attachment of the GPI moiety to the carboxyl terminus (omega-site) of the polypeptide occurs after proteolytic cleavage of a C-terminal propeptide. In order to predict the presence of the omega sites in the filtered sequences two functions are available: `get_big_pi` which queries the [big-PI Plant Predictor](http://mendel.imp.ac.at/gpi/plant_server.html) [@eisenhaber_glycosylphosphatidylinositol_2003] server and `get_pred_gpi` which queries the [PredGPI](http://gpcr.biocomp.unibo.it/predgpi/) [@pierleoni_predgpi_2008] server.
-
-To query [big-PI Plant Predictor](http://mendel.imp.ac.at/gpi/plant_server.html):
-
-```{r big_pi, eval = FALSE}
-at_gpi <- get_big_pi(at_nsp_3hyp,
- sequence,
- Transcript.id)
-```
-
-This function provides detailed per sequence output when the argument `simplify` is set to `FALSE`, by default `simplify` is set to `TRUE` and the resulting data frame has one row per sequence:
-
-```{r big_pi2}
-head(at_gpi)
-```
-
-similarly to query [PredGPI](http://gpcr.biocomp.unibo.it/predgpi/):
-
-```{r pred_gpi, eval = FALSE}
-at_gpi_pred <- get_pred_gpi(at_nsp_3hyp,
- sequence,
- Transcript.id)
-```
-
-```{r pred_gpi2}
-head(at_gpi_pred)
-```
-
-##Disorder prediction
-
-A key structural feature of HRGPs, based on abundance of disorder-promoting residues, is that they are intrinsically disordered proteins. To identify potential disordered regions in proteins `ragp` contains `get_espritz` function which queries [ESpritz](http://protein.bio.unipd.it/espritz/) [@walsh_espritz_2012] web server. Three models trained on different data sets are available and can be selected via the argument `model`:
-
-* 'X-Ray' - based on missing atoms from the [Protein Data Bank](http://www.wwpdb.org/) (PDB) X-ray solved structures. If this option is chosen then the predictors with short disorder options are executed.
-* 'Disprot' - contains longer disorder segments compared to x-ray. In particular, [disprot](http://www.disprot.org/) is a manually curated database which is often based on functional attributes of the disordered region was used for this definition. Disorder residues are defined if the disprot curators consider the residue to be disordered at least once. All other residues are considered ordered. If this option is chosen then the predictors with long disorder options are executed.
-* 'NMR' - based on NMR mobility. NMR flexibility is calculated using the [Mobi server](http://protein.bio.unipd.it/mobi/) [@martin_mobi_2010] optimized to replicate the ordered-disordered NMR definition used in CASP8.
-These models provide quite different predictions. To query ESpritz:
-
-```{r espritz, eval = FALSE}
-at_espritz <- get_espritz(at_nsp_3hyp,
- sequence,
- Transcript.id,
- model = "NMR")
-```
-
-The output of the function depends on the `simplify` argument. If `simplify = TRUE` (default) the output is a `data.frame` (one row per disordered region) with columns `start` and `end` which indicate the start and end of the predicted disordered region as well as the column `id` which is a protein identifier. When `simplify = FALSE` the output contains predicted probabilities of disorder for each amino acid.
-
-#Protein structure diagram
-
-All of the above mentioned protein features can be visualized using the function `plot_prot`. It accepts protein sequences in the form of strings, calls the above mentioned functions to obtain sequence features and returns a `ggplot` object:
-
-```{r plot_prot1, eval = FALSE}
-ind <- c(20, 23, 34, 52, 80, 127, 345) #some indexes which will be used to filter sequences
-p1 <- plot_prot(sequence = at_nsp$sequence[ind],
- id = at_nsp$Transcript.id[ind],
- dom_sort = "ievalue") #to plot domains by independent e-value (lower on top)
-
-p1
-```
-```{r echo = FALSE, out.width = "100%", fig.align = "center"}
-knitr::include_graphics("p1_evalue.svg")
-```
-
-The diagram contains the following elements:
-
-* Transmembrane regions (TM) are shown in yellow, extracellular regions are indicated by the dashed line above the sequences while intracellular regions are indicated by the dashed line bellow the sequence (as predicted by Phobius).
-* Signal peptides (as predicted by SignalP) are indicated by the thick red line on the N-terminal side.
-* GPI attachment sites (as predicted by Big Pi or PredGPI depending on the argument `gpi`) are indicated by blue rhomboids.
-* Hydroxyprolines (as predicted by `predict_hyp`) are indicated by vertical dark gray lines.
-* AG glycomodul spans are indicated by the light grey background.
-* Domains (as predicted by hmmscan) are indicated by rectangles with an appropriate fill as indicated in the legend.
-
-To change the order of domain plotting:
-
-```{r plot_prot2, eval = FALSE}
-p1 <- plot_prot(sequence = at_nsp$sequence[ind],
- id = at_nsp$Transcript.id[ind],
- dom_sort = "cba") #to plot domains by name, first on top, also check "abc"
-
-p1
-```
-```{r echo = FALSE, out.width = "100%", fig.align = "center"}
-knitr::include_graphics("p1.svg")
-```
-
-To change the axis ratio:
-```{r plot_prot3, eval = FALSE}
-p1 +
- coord_equal(ratio = 50)
-```
-```{r echo = FALSE, out.width = "100%", fig.align = "center"}
-knitr::include_graphics("p1_coord.svg")
-```
-
-To change the domain colors use another fill palette:
-```{r plot_prot4, eval = FALSE}
-p1 +
- scale_fill_brewer(type = "div")
-```
-```{r echo = FALSE, out.width = "100%", fig.align = "center"}
-knitr::include_graphics("p1_fill.svg")
-```
-
-To change feature colors set them as arguments (`hyp_col`, `gpi_col`, `nsp_col`, `ag_col` and `tm_col`) to `plot_prot`:
-
-```{r plot_prot5, eval = FALSE}
-p3 <- plot_prot(sequence = at_nsp$sequence[ind],
- id = at_nsp$Transcript.id[ind],
- hyp_col = "brown",
- ag_col = "orange",
- dom_sort = "cba")
-p3
-```
-
-```{r echo = FALSE, out.width = "100%", fig.align = "center"}
-knitr::include_graphics("p3.svg")
-```
-
-or use another color palette:
-
-```{r plot_prot7, eval = FALSE}
-p1 +
- scale_fill_brewer(type = "div")+
- scale_color_brewer(type = "qual")
-```
-```{r echo = FALSE, out.width = "100%", fig.align = "center"}
-knitr::include_graphics("p1_fill_color.svg")
-```
-
-display disordered regions as predicted by ESpritz and choose predGPI to predict GPI positions:
-
-```{r plot_prot_disorder, eval = FALSE}
-p2 <- plot_prot(at_nsp$sequence[ind],
- at_nsp$Transcript.id[ind],
- gpi = "predgpi",
- disorder = TRUE)
-p2
-```
-```{r echo = FALSE, out.width = "100%", fig.align = "center"}
-knitr::include_graphics("p2_dis.svg")
-```
-
-Choose which features to plot when calling `plot_prot`:
-
-```{r plot_prot8, eval = FALSE}
-p3 <- plot_prot(sequence = at_nsp$sequence[ind],
- id = at_nsp$Transcript.id[ind],
- hyp = FALSE,
- ag = FALSE,
- dom_sort = "cba")
-p3
-```
-```{r echo = FALSE, out.width = "100%", fig.align = "center"}
-knitr::include_graphics("p2.svg")
-```
-
-Add annotations to the plot (y axis is continuous):
-
-```{r plot_prot9, eval = FALSE}
-p1 +
- annotate(geom = "text",
- y = 6,
- x = 100,
- label = "look at all these TMs") +
- annotate(geom = "point",
- y = 4,
- x = c(237, 240, 298),
- shape = 21,
- fill = "dodgerblue",
- size = 3)
-```
-```{r echo = FALSE, out.width = "100%", fig.align = "center"}
-knitr::include_graphics("p1_anno.svg")
-```
-##Acknowledgements
+Acknowledgements
+================
This software was developed with funding from the Ministry of Education, Science and Technological Development of the Republic of Serbia (Projects TR31019 and OI173024).
-##References
\ No newline at end of file
+References
+==========
\ No newline at end of file
diff --git a/vignettes/vig_files.rds b/vignettes/vig_files.rds
index c9b217b..6211554 100644
Binary files a/vignettes/vig_files.rds and b/vignettes/vig_files.rds differ