-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathcourse2.Rmd
921 lines (572 loc) · 46.6 KB
/
course2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
---
title: "Introduction to R and interactive programming"
author: "Marc, Sean, Bronson and Debbie"
date: "11/11/2015"
output: html_document
---
Welcome to the second course in our series titled: 'Computational Biology for Biologists'. This course will introduce you to the experience of using high level interactive programming languages and specifically: to R. In this class we will be introducing you to the language as a tool that you can use to explore data. The course after this will attempt to teach you enough to actually get started with writing your own scripts etc.
# A quick review
Before we begin, lets have a quick review of what we learned last week by doing something that will help us out this week. First log into the sideswiper by using the ssh command:
```{bash, eval=FALSE}
ssh username@sideswiper
```
Now copy the following tarball down to your home directory with the copy command:
```{bash, eval=FALSE}
cp /tools/sampledata/course2Files.tar.gz .
```
Remember that in linux `.` means 'current directory'. Since you just logged in your current directory should be your home directory. You can check this by looking for the `~` in your prompt.
And now unpack that tarball so that the files are available for use during todays coursework:
```{bash, eval=FALSE}
tar -zxvf course2Files.tar.gz
```
This will create a folder called course2Files, but for today, we want those files to be in our home directory, so lets move them out of there:
```{bash, eval=FALSE}
mv course2Files/* .
```
If this all seems like we could have made it 'easier' for you, please remember that we are trying to teach you how to do this stuff. ;)
# Introduction to R
### Opening RStudio
We will be using a program called __RStudio__ to interface with R. RStudio is currently hosted on the Sideswiper machine until a more permanent home can be found for it. You access RStudio from a web browser and log in using your SCH credentials. Behind the scenes, you will be using the R instance that is built onto Sideswiper.
***
#### <span style="color:blue">__Exercise R0:__</span>
* Make sure that you are either in the Childrens network already (not just the PDN) OR use a Gemalto key to connect to a machine in the Childrens network.
* Open a browser window. Firefox and chrome work great. Internet Explorer 9 has some issues.
* In the address bar, type http://sideswiper:8787
* Enter your SCH userid and password
***
### Getting familiar with RStudio
First lets get you used to RStudio. RStudio is one of the most popular tools for working with R today. It normally has four main panels: 'Console', 'Source', 'Environment', and 'Files'. Lets take a minute and look at each of these:
#### Console panel:
This panel is your actual R console. This panel is basically what you would see if you were to run R at the command line without any of the RStudio bells and whistles.
#### Source panel:
The Source panel is where you can edit scripts and markdown documents etc. It's primary function is to be a text editor. You can use any text editor that you like to write R, but this one is nice because it comes pre-configured with sensible defaults for source code highlighting. This makes it easier to read your source code. Also convenient: if you highlight a command and then hit CTRL-ENTER keys, it will send the highlighted text directly to the console to be executed. If this panel is not open yet, you can activate it by opening a file.
#### Environment panel:
This panel is where you will see important details about your R session. Any local variables will be shown here, and there is also a tab to see your command history. This can be convenient if you are debugging or need to look at what is in a local object.
#### Files panel:
The Files panel has many tabs. The Files tab will let you look at the files in the current working directory. And the plots tab is where RStudio will draw plots that you generate. The Packages tab will let you browse installed packages to look at their helpful vignettes, and the help tab is where RStudio will render manual pages that you open. The Viewer tab has a similar role to the plots tab, except that it's usually used for looking at interactive shiny widgets when using RStudio.
### What is R?
R is the free and open source version of another language called 'S' which was the original 'statistical' language. Today it is more popular and widely used than ever for doing statistics and data modeling. If you want to learn a language to do data analysis or science: you should consider learning R. The R language is often called a "high level" language, which is another way of saying that it's more human friendly than a lot of other "low level" languages. And R already has a large existing collection of friendly and powerful software packages that you can harness to do useful work.
Also important: R is written entirely as an open source piece of software. This is very good for scientific computing since it means that don't have to blindly trust anyone else that R will work as advertised. If for some reason you find something with the software that you think is a bug, you can find and bother the people who authored it or even explore the source code for yourself. This is important because while all software has bugs: only open sourced software allows you easy access to discover and fix them. All R packages have a DESCRIPTION file that lists both who wrote it and also provides information about who maintains it.
#### Object oriented programming
R, like many other languages, has the concept of "objects". Objects are like variables in algebra, i.e. they are merely symbolic variables that can be used to represent values or other bits of information. That information can be simple, such as a numerical value, a string of text, or a logical (computer talk for `TRUE` or `FALSE`). Objects can also represent much more complicated data structures such as lists, tables, and high dimensional matrices. We will discuss these in more detail later in the course. For now, lets stick with some fairly simple objects.
#### Sample objects
Some R objects are always available for demonstration. An example of this is a simple variable called `pi`
```{r}
pi
```
This is handy. You can use the `pi` variable in any operation where you want to use the value of pi.
```{r}
pi * 2
```
What if you wanted to create your own variables?
#### Assignment operators
Before I explain about assignment. Lets start with: what is an operator? In R an operator is a way of succinctly indicating a specific kind of event. For those who already know what a function is, an operator is just a special kind of function. For those who don't know what a function is: just hang on a few more minutes and we will explaining those too. :) As an example of an operator the `+` operator works like this:
```{r}
1 + 2
```
As you use R you will learn that there are many other operators that you can leverage, and for most of them the way you use them is very straightforward. Here we are going to learn about the assignment operators.
Now lets consider the case where we want to assign a simple variable. In R there are actually __two__ different symbols used as assignment operators. I will demonstrate each of them below:
```{r}
foo <- "yes"
bar = "no"
```
Now lets look at the value for `foo`
```{r}
foo
```
Now you might be wondering: why would I have more than one way to assign values? What is the difference between `=` and `<-`? Hold that thought. Because the next section should shed some light on this...
#### Vectors
Let's look at another built in object.
```{r}
LETTERS
```
Here you can see that the object is storing more than one element, namely each letter of the alphabet. This type of structure is called a __vector__. A vector is basically a special type of list. It has the constraint that every element of the list be of the same "type" (e.g. numeric, string, logical, etc).
Compare the output from `LETTERS` and `pi`. One thing you will notice is that `pi` is a numeric object, whereas `LETTERS` is a character vector (as indicated by the quotes around each letter). You will also notice that in each case, there is a report (in brackets) at the start of each line to indicate which element number is at the start of each line. Why is this necessary for `pi` when it only has a single element?
#### This is because in R: Vectors are atomic
Therefore `pi` is also a vector. Vectors can theoretically hold any number of elements, including only one or even zero. So `pi` is a vector of length one, similar to a list with only one element. In R, vectors are 'atomic'. What does that mean? It means that in R there is no such thing as an object container that is only able to hold a single value. The smallest or simplest data container in R is still just a short vector.
#### Creating vectors
There are two ways to create vectors. If you are creating a numeric vector, one simple way is to use the `:` operator. You specify the first digit and the last digit, and R will fill in the rest in increments of one.
```{r}
numbers <- 1:10
numbers
```
To create vectors quickly, you can also use the `seq` or `rep` functions.
The second way to create vectors is by using the __concatenate__ function. It technically just combines vectors rather than creating them. But given that single elements are just vectors of length 1, it amounts to the same thing. Use this to create character vectors and vectors of other types:
```{r}
string <- c('The', 'lazy', 'brown', 'fox')
string
```
### Calling functions
Now lets talk about basic R functions. Languages like R use functions as a shortcut for doing specific sets of instructions over and over again. And R has no shortage of helpful functions right out of the box. But before you start using them there are a few things you might want to know. This section will attempt to set you up for success.
One of the simplest functions in R is the help function. The help function only requires one argument to use and that is the topic you would like to look up information on. Lets call it now to look up the man page for another function called `grepl`:
```{r, eval=FALSE}
help('grepl')
```
If you look at this help page, you will notice that the `grepl` function can take multiple different arguments. But it always requires at least two arguments. You know this because there are not default arguments set for the first two arguments when looking at its 'Usage' statement. Reading further in the manual page, you should notice that it's role is to return a logical vector indicating where the 1st argument matched the contents of the second argument. But in terms of passing these arguments to the function: R is pretty flexible about how it will interpret your attempts to call a function.
#### Arguments can be passed in "in order"
The quickest way to get R to do this is to just pass in the correct arguments in the correct order like this:
```{r}
grepl('A', LETTERS)
```
But this assumes that you remember the exact order that the arguments were designated in the original function, and that you pass them in in that exact same order. This method is faster in the sense that it requires less typing.
#### Alternatively, arguments can be passed in by name
Sometimes instead you may wish to be more explicit and actually refer to the function arguments by name. In that case, you can call the function like this:
```{r}
grepl(x=LETTERS, pattern='A')
```
Notice how I did not have to list the arguments in order this time? This is a good approach if you are unsure about the order, but it is also a good idea for writing code that is more robust over time. This method requires more typing, but in the event that someone changes the function you are calling, this method is more likely to keep working. So this method represents a more robust way to write scripts and functions.
Also please notice the use of the `=` sign to indicate argument names. This is the reason why a lot of people prefer to type `<-` for assignment operations instead of `=`. Because if you use `<-` instead of `=`, it means that you can do an assignment right in the middle of a function call (which is sometimes convenient).
#### You can also use as a mixture of named and un-named arguments.
But once you start using named arguments in your function call, you had better stick with it. This is because R can only guess the un-named arguments if they are still in the correct order left to right. R is pretty smart, but it can't actually read your mind.
```{r}
grepl('A', x=LETTERS)
```
#### The special argument: '`...`'
This is a special argument which indicates that a function can take multiple comma separated values (written in the function call _as if_ they were actually separate un-named arguments) _when in fact_ they will be treated by the function as if they were actually a single argument. The function will require that any arguments given after the use of a `...` argument be named since otherwise R will have no way to know when you meant to stop passing in values that apply to the `...`. To really understand how `...` works, you should do exercise R1.
***
#### <span style="color:blue">__Exercise R1:__</span>
##### <span style="color:blue">__Exercise R1 part 1:__</span>
First use the paste function to append a letter to a number. The output should look like this:
```{r, echo=FALSE}
paste("A", "1", sep="")
```
Be sure to use the `help` function in order to learn how to use the `paste` function. Be aware that in R if you call an argument that expects a character vector, you will need to use quotes to indicate that it's a vector. You might also wonder what an argument called "..." means. That is a special argument that just means you can pass a lot of values in (instead of just one). In the case of paste it means that unless you call a named argument (the other arguments are all named), paste will assume that your arguments are all character vectors that you want pasted together. So to paste two characters (a letter and a number) together, you should include at least two character vectors (of length one each).
##### <span style="color:blue">__Exercise R1 part 2:__</span>
Now create a vector of names so that each name contains the with the 1st day from each month. So the final vector should look like this:
```{r, echo=FALSE}
paste(month.name, "1",sep='_')
```
To do this, you will want a string that has all the months. R has such a string already it's called: `month.name`. Using that string and the `paste` function, create the vector described above.
##### <span style="color:blue">__Exercise R1 part 3:__</span>
Now pass in two character vectors of unequal length. Notice how paste handles the fact that you have passed in two different vectors by reusing the elements of the shorter vector... This behavior is called "recycling".
***
### Using external libraries
Part of what makes R so powerful is the way that it leverages the code of an entire community of other software engineers. It can do this because R makes it very straightforward to make or use external package libraries. An external package library is just a collection of R functions and objects that do something useful. To use an external package all you need to do is to load it with the library command. Here is how you can load the MASS package:
```{r}
library("stats4")
```
Now that you have loaded a package, lets talk about the search path. The search path is how R keeps track of everything that you have loaded into memory. You can look at it by just calling the `search` function.
```{r}
search()
```
This will print all the libraries/packages that your current R session has loaded as a single character vector. This is an important thing to pay attention to since every time you add a new package to the search path, it adds another set of objects, functions and symbols for R to look at every time you ask it to do something. So your code will run faster if you load fewer things onto that path.
To see the symbols in any element of the search path, you can use the list objects or `ls` function. The `ls` function can either take an index value like this:
```{r}
ls(2)
```
Or you can just look at the exact element by passing in a character string explicitly like this:
```{r}
ls("package:stats4")
```
### Atomic vectors and other fun data structures
Because R is a data centric statistical language it has a lot of fun ways to represent data.
#### Vectors and introducing the `class` and `length` functions
So we already talked about how vectors are the most basic data structure in R. What we didn't spend as much time on is that they come in different flavors or classes. You can see what class an object is by passing it to the class() function like this:
```{r}
class(LETTERS)
```
```{r}
class(pi)
```
The most common types of vectors are: character, numeric, logical, and integer. Another important function is length(). You can always use length to see how many elements are in a vector. For example:
```{r}
length(LETTERS)
```
There are 26 letters in the alphabet
#### Lists
Sometimes you want to put more than just one flavor of data into a vector-like object. For those times you need a much more flexible kind of container. That is what lists are for. Lets look an an example of a list:
```{r}
lst <- list(a=1, b='foo')
lst
```
And this allows us to put pretty much anything into our list. So in this very simple example, we have placed a character vector and a numeric vector. But we could also have placed a large range of different object types. And as with vectors, you have a length function you can call:
```{r}
length(lst)
```
***
#### <span style="color:blue">__Exercise R2:__</span>
Make the list `lst` from the example above and then make a second list `lst2` that contains both `lst` and the LETTERS vector. How long is this list?
***
#### data.frames, matrices, dim
Sometimes you need to represent 'square' data. We have all seen excel sheets and tables before. For this kind of data, R has matrices and data.frames. A matrix is usually used when you have simple 'square' data like a grid of numbers. Lets have a quick look at an example of a matrix as constructed from a numeric vector:
```{r}
numVec <- c(1,2,3,11,12,13)
numVec
mat <- matrix(numVec, nrow = 2, ncol = 3, byrow = TRUE)
mat
```
Once you have a matrix you can see how big it is with the dimensions function: `dim`.
```{r}
dim(mat)
```
You can also get the length of a matrix, but note that the result of length makes a lot less sense for matrices since it just represents the length if you collapsed the whole thing into a single vector:
```{r}
length(mat)
```
You will likely see matrices a lot in R. But even more common you will see __data.frames__. data.frames are usually used for when you have 'square' data and when the different columns are different from each other. They are popular, because this happens all the time. Lets look at an example so that you can know what I mean by the columns being different from each other:
```{r}
df <- data.frame(number = 1:4, letter = c('A','B','C','D'))
df
```
There are some obvious differences between a data.frame and a matrix. But there are some more subtle ones as well. For example, look at what happens if I look at the `dim` and the `length`:
```{r}
dim(df)
length(df)
```
The results for dim look like 'square' data, but the length is now the number of columns. In this sense (and in others) a data.frame object is actually more like the 'square' version of a `list` while a matrix is more like the 'square version of a `vector`.
#### S4 objects
Sometimes your data is not simple and does not fit into a simple container. For these instances R has what is called the "S4 class system" that allows for the creation of custom data containers. In this course there will not be time to learn the mechanisms for defining these, but many many such containers exist and it is often convenient to use them. Here is an example of one of these in action that I will load up using the `data` function
```{r, eval=TRUE, echo=FALSE, message=FALSE}
## Secret chunk to silently load this package :)
library(Biobase)
```
```{r}
library(Biobase)
data(sample.ExpressionSet)
sample.ExpressionSet
```
This is a complex object that was originally created to hold microarray data but has since been adapted for many other uses. It has several main components that are probably best described by looking at the man page.
```{r, eval=FALSE}
?'ExpressionSet'
```
### Getting data into R
#### read.table
One of the most common tools for reading data into R is the read.table function. And actually, it's a family of commands. If you look at the man page you will see a whole set of related commands.
```{r, eval=FALSE}
?read.table
```
The `read.table` command and it's friends are a good introduction to how you can load a popular kind of data format into R. Specifically tabular data.
```{r}
filename = 'refFlat.txt.gz'
foo = read.table(file=filename, nrows = 3)
foo
```
#### readLines
Sometimes you need to get data in and your file may not be as nicely formatted. `readLines` will let you get the data in and reads files in one line at a time. `readLines` is also worth showing because unlike a lot of other data import function in R, it does not take a file path directly, but requires that you pass it a connection object. Here is an example of how that looks:
```{r}
con <- file(filename)
bar = readLines(con)
```
Having said all that: mostly you won't need to use readLines. But it's there if you need it, and it's _very_ fast.
#### Other options: Databases, URIs, REST, HDF5 and many more
R also has a ton of alternate tools for reading in all manner of other file formats as R objects. This topic could almost become it's own course. Some popular options include, databases URIs, REST apis, HDF5 files (just to name a few). Many of these capabilities are not baked in to base R but are instead available via open source packages. Here is a very simplified example of how you can use the DBI interface which is useful for extracting data from a database file.
```{r}
library(org.Hs.eg.db)
con <- org.Hs.eg_dbconn()
dbGetQuery(con, "SELECT * FROM gene_info limit 3")
```
We won't go into how those commands work in the context of this course, but I wanted you to see an example of how other packages can allow you to interface with wildly different data sources after typing just a couple lines of R code...
***
#### <span style="color:blue">__Exercise R3:__</span>
Use read.table to read in the file `genesFile.txt`. Be sure to capture the results of this into a local variable called `myTable`. What kind of object is `myTable`? How do you know? How can you read in the table so that the column headers are understood?
***
### Data object management
Once you have an object in R, there are a TON of fun ways that you can manipulate it.
#### Data coercion
The 1st type of manipulation are coercion methods (sometimes called casting). These methods allow you to take data from one format and convert it into another format. For example, a lot of data can be coerced to a list using the `as.list` function. Think of our data.frame object from before and imaging if you could make each column into an element in a list like object instead. You can do that just like this.
```{r}
as.list(df)
```
Or: you could take that same data.frame and make it into a matrix like this:
```{r}
as.matrix(df)
```
Or you can take it all the way down to a simple character vector by first casting it to a matrix and then casting that matrix to a vector:
```{r}
as.vector(as.matrix(df))
```
#### single bracket subsetting operator `[`
One of the most powerful things in R are the subsetting operators. And the most common subsetting operator is the single bracket operator. For most R objects, if you apply the single bracket operator, you will get a piece of whatever was in the original object. Lets look at some simple examples to illustrate:
You can subset by supplying index values:
```{r}
LETTERS[1]
LETTERS[1:3]
LETTERS[c(1,3)]
LETTERS[c(4,2)]
```
Notice that in the last case, I can even use the subset to also "re-order" the values.
Or you can subset by logical position
```{r}
shortLetters <- LETTERS[1:4]
shortLetters[c(FALSE, TRUE, TRUE, FALSE)]
```
But logical vectors do not allow me to "re-order" the vector, only to subset it.
And: you can also subset by name (_if_ there are names):
```{r}
names(shortLetters) <- c('foo','bar','baz','bob')
names(shortLetters)
shortLetters
shortLetters['foo']
shortLetters[c('bob','bar')]
```
Be sure to notice how when I want to indicate several names or indices that are not in any order, I have to supply a character vector using `c` that has those values in it. Notice that just like using an index to subset, you can use names to "re-order" the vector as you are extracting values.
#### match(), %in%, unique()
Another very common thing to want to do with data objects is to match elements. There are a couple of methods for doing this. One is the `match` function. The `match` function will give you a vector with indices that match a 'table' of 'x' values. So for example:
```{r}
match(shortLetters, c('C','A'))
```
Notice though that the index refers to the position in the 'table' argument and not the position in 'x'.
This brings us to the `%in%` operator. The `%in%` operator lets you get a simple logical vector indicating whether or not the values in 'x' match the values in 'table'.
```{r}
shortLetters %in% c('C','A')
```
Between these two methods, the `%in%` is usually easier for subsetting since it does not require you to filter out NA values in the event that something doesn't match. But match is potentially more useful if you need to return things in an order that matches your table. But pay attention because `match` will only return the 1st element that matches, so if there are multiple matches in 'x', it will only tell you about the first one... This is another reason why the `%in%` operator is usually better if your aim is just to subset the data. IOW:
```{r}
c(shortLetters,shortLetters) %in% c('C','A')
```
Please also take note that you can reverse the meaning of a vector of TRUE/FALSE values by using the negation operator. For example, compare
```{r}
shortLetters %in% c('C','A')
```
To:
```{r}
!shortLetters %in% c('C','A')
```
Another useful function is the `unique` function. You can use `unique` whenever you have a vector or list like object that needs to be filtered down so that it only contains one instance of each element. So just compare:
```{r}
c(shortLetters,shortLetters)
unique(c(shortLetters,shortLetters))
```
***
#### <span style="color:blue">__Exercise R4:__</span>
* Use the `month.abb` vector with the `grepl` function to determine which elements start with the letter 'J'.
* Now subset the `month.abb` vector so that you only see those months.
* Now use `%in%` instead of `grepl` and look for every month __except__ Jun. <HINT: The negation operator is `!`>
***
#### double bracket subsetting operator `[[`
In addition to the single bracket subsetting operator, some objects also support a double bracket subsetting operator. This kind of operator is normally only used with list like objects such as `data.frame`s and `list`s. The difference between the single and double bracket operator is in the type of data that is returned. While the single bracket operator will try to return the same type of object as it was used on, the double bracket will return the type of thing that is 'in' the container. Lets look again at our list example:
Single bracket
```{r}
lst[1]
class(lst[1])
```
Double bracket
```{r}
lst[[1]]
class(lst[[1]])
```
If you look closely, you will notice that these two operators are not returning the same thing. A useful analogy is to think of your list of data.frame like it was a cargo train. In this analogy, the single bracket will always return to you a smaller train, but the double bracket will return to you the contents of a particular train car.
Now lets look at our data.frame example:
Single bracket
```{r}
df[1]
class(df[1])
```
Double bracket
```{r}
df[[1]]
class(df[[1]])
```
Notice how in both cases, the subset operator gets you the contents of the same 'train car'. But: only the second one actually strips off the train car and gives you just what is inside it.
#### Using the single bracket operator to subset square data
Something that we didn't mention above is that for 'square' data, the single bracket operator can take another argument. What this means is that you can extract both columns and rows at the same time. What is confusing though, is that when you use the second argument, the first argument will now apply to rows while the second argument applies to columns. Lets look at some examples:
The whole object (matrix)
```{r}
mat
```
The 1st row, 1st column (matrix)
```{r}
mat[1,1]
```
1st two rows, 2nd column (matrix)
```{r}
mat[1:2,2]
```
2nd row, columns 1:3 (matrix)
```{r}
mat[2, 1:3]
```
Now lets look these same operations but with data.frame objects:
Whole object (data.frame)
```{r}
df
```
1st row, 1st column (data.frame)
```{r}
df[1,1]
```
1st three rows, 1st column (data.frame)
```{r}
df[1:3,1]
```
Third row, columns 1:2 (data.frame)
```{r}
df[3,1:2]
```
#### When R gets too smart for it's own good (automatic casting):
So as you can see, you can use the subset operators in R to slice data in all sorts of useful different ways. But there is another thing that was happening for a lot of these subsets that you might not have noticed. Lets take a quick look at that:
```{r}
class(df)
class(df[3,1:2])
class(df[1,1])
```
Notice how in that last case the class changed from a data.frame to a simple integer vector? That's because R is assuming that if you are working interactively you might find that cast convenient. However, in many cases it will NOT be convenient to have your data objects automatically changing types on you. So please be aware that this can happen. But how can you avoid it? Well the single bracket subset operator when applied to a data.frame has a third argument that you can use called `drop`. If you set `drop` to be FALSE, then R will stop trying to cast your data for you:
```{r}
class(df[1,1, drop=FALSE])
```
***
#### <span style="color:blue">__Exercise R5:__</span>
##### <span style="color:blue">__Exercise R5 part 1:__</span>
Remember that list (`lst2`) we made in exercise R2? Well go back to that list now and extract out the original list (`lst`) by using the appropriate subsetting operator. Assign that recovered value into a variable called `recoveredLst`. How do you know which kind of operator to choose? How can you verify whether your `recoverdLst` is now the correct thing?
##### <span style="color:blue">__Exercise R5 part 2:__</span>
Now lets look at the `myTable` data.frame object we created when we used the `read.table` function. Using what we just learned about subset operators, extract out a vector of gene symbols. Now make those symbols unique with the `unique` function and assign the resulting value to a variable called `symbols`. If you look at the value for `symbols`, you will notice that it's not a character vector but is instead something called a factor. Convert it to a character vector using as.character(). Now that you have done this, do it another way (you should know at least two different ways to subset at a data.frame in order to extract a specific column).
Now, using what you understand about subsetting operators, see if you can figure out a convenient way to store the value of `symbols` inside of a third element for `lst2`...
***
#### What are these 'factors' about?
Because R is a statistical language, it is often useful to have categorical data like 'gender'. For data like this, R has a special kind of vector called a `factor`. Don't worry too much about factors today, but be aware that they are out there, and you might notice that in many cases your character data will by default be assumed to be this kind of data. Because of this, a lot of import functions like `read.table` will have a special argument called `stringsAsFactors` that can let you decide how your data will be treated after importing.
### Getting help
If you forget everything else we teach you today, I want to make sure that you know how and where to look for help. The next set of topics will cover that because R has a large number of ways that new users can get assistance for learning about it.
#### Man pages
The manual page is usually the first place you will look if you are wondering what an R command will do. R man pages will explain both what something is, and how to use it You can usually pull up a manual page for a command by typing `?` followed by the command. Like this:
```{r, eval=FALSE}
?data.frame
```
However a good habit is to wrap the thing you want help on in quotes like this:
```{r, eval=FALSE}
?'data.frame'
```
Why? Because those quotes will allow you to search for help on strings that have special characters like dashes in them. Why would you want to do that when this is not allowed for function names? Consider the following:
```{r, eval=FALSE}
?'GenomicRanges-class'
```
The above will take you to the help page for the GenomicFeatures S4 class. It's a very useful man page, but you can't get there if you leave out the quotes.
Alternatively, you could choose to just use the `help` command. But that is three extra letters of typing (and you would still have to quote the argument)...
##### example()
One of the more useful (and often overlooked) functions for exploring a new function is the `example` function. The `example` function takes advantage of the fact that most man pages have an example section. So if instead of calling `help`, you call `example`, the examples from the corresponding manual page will be run for you in your R session. Here is an example of calling the `example` function:
```{r}
example('as.numeric')
```
#### Vignettes
One step up from manual pages are vignettes. There are two kinds of vignettes for any R package. One is the automatically generated kind. This kind consists of all the manual pages collated together under a table of contents. This kind of vignette is not especially useful. The second kind of vignette though is very useful. This second kind of vignette gives an overview of how the various kinds of functions and objects were intended to be used. Many of the best R packages will contain this second, more useful kind of vignette. When they are available, these are a great way to see what the package authors intention was.
#### Workflows
Sometimes communities will organize around a series of packages that work well together (usually not an accident). When this happens you can sometimes find another kind of document to explain how these different packages are best used together. This is called a workflow, and it's a great place to start if you are looking to learn about a new set of tools.
#### Forums
All software has bugs. But open sourced software often will allow you to write to a community of like minded users (which often includes the package authors) to get help. There are many valuable forums where you can get questions answered about software. Always search first though. Most of the time you will find that someone else has already asked your question and gotten an answer.
#### Cheat sheets
One of the most challenging things for new users of R is not getting help on commands that they know about. It's usually knowing what the command is called in the first place (so that you can start to look for some help about it). A number of R 'cheat sheets' are available online that can help with this problem. Here is one of the more popular ones:
```
https://cran.r-project.org/doc/contrib/Short-refcard.pdf
```
This cheat sheet is valuable enough for new students that I plan to give you a copy of this cheat sheet, but there are many, many others on the web that cover a range of topics...
***
#### <span style="color:blue">__Exercise R6:__</span>
##### <span style="color:blue">__Exercise R6 part 1:__</span>
Go look up the help pages for the `help.start` function. Now actually use that function to explore some package vignettes and man pages in one simple interface. If this wasn't working for you, you might need to launch a vignette using `openVignette`. Please note that `openVignette` requires that you have loaded the `Biobase` package to use. Or you can look at the man pages using `help`. Find an example package and explore each of these.
##### <span style="color:blue">__Exercise R6 part 2:__</span>
Lets look more closely at the help page for data.frames. Notice the first argument? This is a special argument that allows for multiple comma separated values to be passed in. Because of the nature of this argument, all subsequent arguments will require being named. Now scroll down and look at the other arguments. Which arguments have default values? How can you tell?
***
### Exploring packages
One of the greatest strengths of R is the abundance of existing software that already exists. Here are several places where you can expect to find a lot of software for use with R:
#### CRAN
CRAN alone has thousands of freely available software packages for you to browse. CRAN is one of the oldest of the R repositories and it's also one of the largest and most popular. It's pretty easy to put a package on CRAN though, so a lot of what is there may not be terribly well documented. Also a bummer is that CRAN does not make it easy for package authors to update their packages which means that packages on CRAN may have significant delays before being updated for bugs etc. But: CRAN is a great resource and home to many wonderful packages that are useful for many different things.
#### Bioconductor
Bioconductor is smaller than CRAN with only about 1000 software packages. However all of them have been reviewed, are well documented, and are built and checked nightly to ensure that they are still in good working order. Because the mission of Bioconductor is to help with computational Biology, their package repository is limited to packages that relate to that topic. So if you need software that is designed for general statistical use you will usually learn that is has been stashed somewhere else (often on CRAN).
#### github
These days github has become a hotbed for open source software. Partly as a result of this, a lot of package authors are now storing their packages just on github. This is fine except that github will not run regular checks to make sure that such packages work as advertised or do any policing to make sure that packages are well documented.
### Bioconductor
The Bioconductor project is popular in no small part because it has provided a safe haven to help good software authors better serve the broader computational biology community. Packages at Bioconductor are built and checked nightly to make sure that as R and their dependencies change that they still work as advertised. This turns out to be hugely important since the entire field of computational Biology changes constantly. Another useful advantage for Bioconductor is that everything is documented. Not just at the man page level (as is common for many CRAN packages), but also at the vignette level.
#### work flows
In addition to demanding that all packages at Bioconductor are documented with manual pages and instructional vignettes, there are also high level overviews that show how groups of popular packages can be used in a common 'work flow' to solve specific problems. You can see the work flows on the Bioconductor website here:
```
http://bioconductor.org/help/workflows/
```
***
***
***
#### <span style="color:blue">__Exercise R1:__</span>
##### <span style="color:blue">__Exercise R1 part 1:__</span>
First use the paste function to append a letter to a number. The output should look like this:
```{r, echo=FALSE}
paste("A", "1", sep="")
```
Be sure to use the `help` function in order to learn how to use the `paste` function. Be aware that in R if you call an argument that expects a character vector, you will need to use quotes to indicate that it's a vector. You might also wonder what an argument called "..." means. That is a special argument that just means you can pass a lot of values in (instead of just one). In the case of paste it means that unless you call a named argument (the other arguments are all named), paste will assume that your arguments are all character vectors that you want pasted together. So to paste two characters (a letter and a number) together, you should include at least two character vectors (of length one each).
##### <span style="color:blue">__Exercise R1 part 2:__</span>
Now create a vector of names so that each name contains the with the 1st day from each month. So the final vector should look like this:
```{r, echo=FALSE}
paste(month.name, "1",sep='_')
```
To do this, you will want a string that has all the months. R has such a string already it's called: `month.name`. Using that string and the `paste` function, create the vector described above.
##### <span style="color:blue">__Exercise R1 part 3:__</span>
Now pass in two character vectors of unequal length. Notice how paste handles the fact that you have passed in two different vectors by reusing the elements of the shorter vector... This behavior is called "recycling".
##### <span style="color:blue">__Answer for Exercise R1:__</span>
###### <span style="color:blue">__Answer for Exercise R1 (part 1):__</span>
To just append one character to one number, you can do it like this:
```{r}
paste("A", "1", sep="")
```
###### <span style="color:blue">__Answer for Exercise R1 (part 2):__</span>
For the months example you can also use the the paste function like this with two longer vectors.
```{r, eval=FALSE}
paste(1:12, month.name, sep='_')
```
###### <span style="color:blue">__Answer for Exercise R1 (part 3):__</span>
To two unequal character vectors, you can just enter two vectors with different amounts of values for each like below. Notice how the shorter vector is 'recycled' by the paste function.
```{r}
paste(c("A","B","C"), "1", sep="_")
```
#### <span style="color:blue">__Exercise R2:__</span>
Make the list `lst` from the example above and then make a second list `ls2` that contains both `lst` and the LETTERS vector.
##### <span style="color:blue">__Answer for Exercise R2:__</span>
```{r}
lst2 <- list(lst, LETTERS)
lst2
length(lst2)
```
#### <span style="color:blue">__Exercise R3:__</span>
Use read.table to read in the file that you unpacked earlier called `genesFile.txt`. Be sure to capture the results of this into a local variable called `myTable`. What kind of object is `myTable`? How do you know? How can you read in the table so that the header is ignored?
##### <span style="color:blue">__Answer for Exercise R3:__</span>
```{r}
myTable <- read.table(file='genesFile.txt', header=TRUE)
class(myTable)
```
#### <span style="color:blue">__Exercise R4:__</span>
Use the `month.abb` vector with the `grepl` function to determine which elements start with the letter 'J'. Now subset the `month.abb` vector so that you only see those months. Now do the same thing, but this time use `%in%` instead of `grepl` and look for every month __except__ Jun.
##### <span style="color:blue">__Answer for Exercise R4:__</span>
```{r}
grepl('J',month.abb)
month.abb[grepl('J',month.abb)]
month.abb[!month.abb %in% 'Jun']
```
#### <span style="color:blue">__Exercise R5:__</span>
##### <span style="color:blue">__Exercise R5 part 1:__</span>
Remember that list (`lst2`) we made in exercise R2? Well go back to that list now and extract out the original list (`lst`) by using the appropriate subsetting operator. Assign that recovered value into a variable called `recoveredLst`. How do you know which kind of operator to choose? How can you verify whether your `recoverdLst` is now the correct thing?
##### <span style="color:blue">__Exercise R5 part 2:__</span>
Now lets look at the `myTable` data.frame object we created when we used the `read.table` function. Using what we just learned about subset operators, extract out a vector of gene symbols. Now make those symbols unique with the `unique` function and assign the resulting value to a variable called `symbols`. If you look at the value for `symbols`, you will notice that it's not a character vector but is instead something called a factor. Convert it to a character vector using as.character().
Now, using what you understand about subsetting operators, see if you can figure out a convenient way to store the value of `symbols` inside of a third element for `lst2`...
##### <span style="color:blue">__Answers for Exercise R5:__</span>
###### <span style="color:blue">__Answer for Exercise R5 (part 1):__</span>
```{r}
recoveredLst <- lst2[[1]]
class(recoveredLst)
```
###### <span style="color:blue">__Answer for Exercise R5 (part 2):__</span>
At this point you now know at least two different ways to do this:
```{r}
symbols <- unique(myTable[[2]])
symbols <- as.character(symbols)
```
And this:
```{r}
symbols <- unique(myTable[,2])
symbols <- as.character(symbols)
```
And just for fun here is a third way:
```{r}
symbols <- unique(myTable$symbol)
symbols <- as.character(symbols)
```
To append another element into your list, there are many ways to do it. But one very 'R-ish' way is this:
```{r}
lst2[[3]] <- symbols
lst2
```
#### <span style="color:blue">__Exercise R6:__</span>
##### <span style="color:blue">__Exercise R6 part 1:__</span>
Go look up the help pages for the `help.start` function. Now actually use that function to explore some package vignettes and man pages in one simple interface. If this wasn't working for you, you might need to launch a vignette using `openVignette`. Please note that `openVignette` requires that you have loaded the `Biobase` package to use. Or you can look at the man pages using `help`. Find an example package and explore each of these.
##### <span style="color:blue">__Exercise R6 part 2:__</span>
Lets look more closely at the help page for data.frames. Notice the first argument? This is a special argument that allows for multiple comma separated values to be passed in. Because of the nature of this argument, all subsequent arguments will require being named. Now scroll down and look at the other arguments. Which arguments have default values? How can you tell?
##### <span style="color:blue">__Answer for Exercise R6:__</span>
###### <span style="color:blue">__Answer for Exercise R6 (part 1):__</span>
There is no single correct answer for this question.
###### <span style="color:blue">__Answer for Exercise R6 (part 2):__</span>
This is a tricky question, but you can probably see what the answer is by looking closely at the manual page. You can tell which values have default values by looking at the usage statement. If the usage statement has a value assigned within it, then that is the default for that argument and will be used if an alternative value is not supplied.