-
Notifications
You must be signed in to change notification settings - Fork 0
/
week07.qmd
335 lines (250 loc) · 17.7 KB
/
week07.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
---
title: 'Text manipulation: regular expressions'
author: "Jeremy Van Cleve"
date: 08 10 2024
format:
html:
self-contained: true
---
# Reminders
- **Think about data to use and plots to make for your project talk and report!!!**.
- E-mail me if you want to discuss your ideas.
# Outline for today
- Text basics
- Basic string manipulation using `stringr`
- Regular expressions: basics
- Regex: character classes
- Regex: quantifiers
- Regex: grouping
- Regex: replacing matches
- Regex: practice!
# Text basics
Working with text and strings can feel like the most boring and tedious aspect of data analysis and wrangling (and it often can be!), but it is important and sometimes the most crucial element. Not only are data sets often untidy, but they are also often full of quirks related to how the data was initially entered. For example, *categorical variables*, like a country name, might be mixed with *numerical variables*, such as dates or population sizes, and these must then be separated intelligently in order to be further analyzed in R. **String** analysis tools are crucial for this task. The first set of tools you will use includes functions from the `stringr` package. The second tool, more difficult but arguably the most crucial, is the "regular expression", which is the like a Swiss Army knife for finding patterns in text.
## `stringr`
R comes with some base functions for manipulating text, but these can be inconsistent in terms of their arguments and their outputs. As normal for material in this course, Hadley Wickham comes to the rescue with a package, `stringr` (loaded when you load `tidyverse`) that provides some basic function for manipulating strings and for using regular expressions. All the functions start with `str_`, which makes seeing a list of them easy using the "tab complete" functionality in RStudio. While it is certainly useful to learn the base packages for string manipulation, the hope is that learning a simpler framework with `stringr` can make the concepts easier so that if you need to understand how to work one of the base functions, the task will be easier.
## Encoding
Computers store information in bits and while it can seems intuitive to convert bits to numbers (go from base 2 to base 10), converting bits to text requires some arbitrary choices. How many characters can convert to bits? Which special characters (e.g., ?, !, ^, etc) do you include beyond numbers and digits? Once you settle on which characters to include, you can simply list them in a table with the corresponding binary value that represents them. Initially, since English speakers were the dominant users of computers, this table, or "encoding", was called ASCII (for "American Standard Code for Information Interchange") and encoding only 128 characters (CS challenge: how many bits do you need to encode 128 characters?).
![ASCII chart from 1972](assets/ASCII.png)
Since 128 characters is not enough to encode characters found in many foreign languages (particularly those with non-Roman characters), other encodings exist that do much better. The main encoding in use on the internet now is "Unicode", which contains a repertoire of more than 128,000 characters covering 135 modern and historic scripts. The most common variant of Unicode is "UTF-8", which was designed for backward compatibility with ASCII. Instead of each character being represented by a specific 8-bit code, Unicode gives letter or character a "code point". If you've seen things like `U+0041`, this is a Unicode code point and happens to correspond to `LATIN CAPITAL LETTER A`. UTF-8 does a nice trick where the first 128 characters are stored like ASCII and other characters use more bytes.
Most computer software now uses UTF-8 as the default encoding and so encoding is something users don't have to worry about generally. However, some older software, including older R functions, may read in text expecting ASCII or some limited encoding when text is actually UTF-8. This means that some or all of the text will look mangled, say like this:
> ÉGÉìÉRÅ[ÉfÉBÉìÉOÇÕìÔǵÇ≠ǻǢ
There are some ways to fix this kind of problem. Dr. Jenny Bryan discusses some of these solutions here <https://stat545.com/character-encoding.html> and more generally gives some examples of how to convert from one encoding to another. For the purpose of learning more about text wrangling in this course, we can assume that everything is in UTF-8 and that we do not have to convert anything. You can often do this in real work though you may occasionally run into problems and need to google/pray to the R gods. For more on encodings than you probably want to know, try this article: <http://kunststube.net/encoding/>.
## Creating strings: quotes and special characters
You've already been creating strings all over the place, but we'll step through things a bit more methodically here.
Creating strings is easy: just use quotes, single or double (N.B. pairs of single and double quotes do the same thing so you can use a pair of one or the other interchangeably)
```{r}
iamstring = "I am very worried about the country right now."
alsostring = 'So very very worried'
print(iamstring)
print(alsostring)
```
Including a quote within a string is easy if you mix the quote styles.
```{r}
print("A 'quote' within a quote")
print('Another "quote" within a quote')
```
Note in the second example how R converted the single quotes defining the string to double quotes and put backslashes before the double quotes inside the string. This is how you can tell R to ignore the special function some characters have. Within a double quoted string, a double quote would mean "end of the string". Putting a backslash before the double quote says "this is just a double quote character." Generally, putting a backslash before a character in a string "*escapes*" the character. Other special "escape characters", which are often characters you can't type with the keyboard easily, include tab, "\t", and the end-of-line character, "\n".
```{r}
cat("a tab separates this\tand\tthis\n")
cat("\n")
cat("creating multiple lines\nis easy with the end-of-line\ncharacter")
```
# Basic string manipulation using `stringr`
The package `stringr` has a number of useful functions for working with strings. First, load `stringr` via `tidyverse.
```{r}
#| message: false
library(tidyverse)
```
### Length
The function `str_length()` not only tells you how a string is,
```{r}
str_length("ugh. so disappointed. so very disappointed.")
```
but it "threads" over vectors and will tell you how long each string is in a vector.
```{r}
whysosad = c("trump", "gingrich", "giuliani")
str_length(whysosad)
```
This "threading" behavior is useful since it can help you avoid writing a `for` loop. It is also very useful for `dplyr` functions like `mutate` since this means that the function will work on whole columns or variables in a data frame.
### Combining strings
Putting strings together is done using `str_c()` where the "c=concatenate" (again, R, use more characters for this word!).
```{r}
str_c("x", "y", "z", "w")
```
To put characters separating the strings, add the `sep` argument:
```{r}
str_c("x", "y", "z", "w", sep = " is a character and ")
```
`str_c` is also "vectorized" and threads over vectors,
```{r}
str_c("prefix-", c("a", "b", "c"), "-suffix")
```
Combined strings can be collapsed into a single string with the `collapse` argument.
```{r}
str_c(c("x", "y", "z", "w")) # just a vector still!
str_c(c("x", "y", "z", "w"), collapse = ", ") # now, a single string
```
### Detecting a string
Use `str_detect` to see if a string is contained in a vector of strings. Use `str_subset` to return the vector containing only those strings that match
```{r}
# vector of strings from tidyverse
fruit
str_detect(fruit, "fruit")
# get the subset that match
my_fruit <- str_subset(fruit, "fruit")
print(my_fruit)
```
Again, notice that this function is vectorized and thus can work nicely for slicing data with `dplyr`.
### Splitting a string
Passing a delimiter to `str_split()` will split strings or vectors of strings based on the delimiter:
```{r}
str_split(my_fruit, " ") # split on a space
```
### Subsetting a string
You can pull out parts of a string with `str_sub()`, which is also vectorized,
```{r}
x <- c("Trump", "Gingrich", "Giuliani")
str_sub(x, 1, 3)
# negative numbers count backwards from end
str_sub(x, -3, -1)
```
and works for assignment too,
```{r}
str_sub(x, 1, 3) = "XXX"
print(x)
```
### Replacing a substring within a string
Replacing a substring can be done using `str_replace()`:
```{r}
str_replace(my_fruit, "fruit", "XXX")
my_fruit # it didn't modify the original (as usual for most R functions)
```
# Regular expressions: basics
In the previous examples, strings were matched directly and entirely; either the whole string (e.g, "fruit") matched some piece of the string exactly or it didn't match at all. Regular expressions ("regex"" for short) provide a syntax for constructing much more complicated ways to match strings than an exact match. Using a regular expression, you can construct a pattern that will only match dates (no matter the date!) or ones that will only match URLs (no matter the URL!). Needless to say, this can be extremely helpful when you have to modify/edit tons of data, but the drawback is the regular expressions have a steep learning curve.
## Metacharacters
Like the quote characters and backslash, regex have special characters that are meant to represent something other than the character itself. The metacharacters are: `. \\ | ( [ { ^ $ * + ?`. To see how some of them work, you can use the function `str_view()`, which shows visually all places where a regex matches a string.
- `.`: match any character
```{r}
str_view("some characters #*$#(", ".")
```
- `^`: match at the beginning of the string
```{r}
str_view(c("ogres", "are trolls orgres?", "no, they aren't"), "^og")
```
- `$`: match at the end of the string
```{r}
str_view(c("ogres", "are trolls orgres?", "no, they aren't"), "s$")
```
- `|`, `[]`: creating alternatives and character classes; see below
- `*`, `+`, `?`, `{}`: quantifiers; see below
- `()`: create "groups"; see below
# Regex: character classes
Suppose that you want to match any of the characters "a" or "b"? The "|" operator, which is an "or" operator, works for this because it matches either pattern on its sides. For example,
```{r}
str_view("is 'party like it is 1999' the best prince song????", "a|b")
```
What about matching "a", "b", or "c"? A character class is used for this where "[abc]" denotes the class of characters "a", "b", and "c" and will matches any of those characters.
```{r}
str_view("is 'party like it is 1999' the best prince song????", "[abc]")
```
You can create an "inverted" character class that matches all characters except those specified by putting "^" in the class:
```{r}
str_view("is 'party like it is 1999' the best prince song????", "[^abc]")
```
Finally, there are some handy character classes that are built into regex. For example,
- `\d`, `\D`, `\s`, `\S`, and `\w` represent any decimal digit, not a digit, a space character, not a space character, and a ‘word’ character, respectively
```{r}
# we need the extra "\" so that the "\" is actually included in the string!
str_view("is 'party like it is 1999' the best prince song????", "\\d")
str_view("is 'party like it is 1999' the best prince song????", "\\D")
str_view("is 'party like it is 1999' the best prince song????", "\\s")
str_view("is 'party like it is 1999' the best prince song????", "\\S")
str_view("is 'party like it is 1999' the best prince song????", "\\w")
```
- `\b` and `\B`: matches the empty string at either edge of a word, or not at the edge of a word, respectively. The example below shows how this allows you to match the character "h" either at the edge of a word or within a word.
```{r}
str_view("is 'party like it is 1999' the best prince song????", "\\bi")
str_view("is 'party like it is 1999' the best prince song????", "\\Bi")
```
Other built in character classes are specified with "[::]" or ranges such as [a-z]. Some of these include
- `[:digit:]`: same as `\d` and [0-9].
- `[:lower:]`: lower-case letters, equivalent to [a-z].
- `[:upper:]`: upper-case letters, equivalent to [A-Z].
- `[:alpha:]`: alphabetic characters, equivalent to [A-z].
- `[:alnum:]`: alphanumeric characters, equivalent to [A-z0-9].
- `[:blank:]`: blank characters, i.e. space and tab.
- `[:space:]`: space characters: tab, newline, vertical tab, form feed, carriage return, space.
- `[:punct:]`: punctuation characters, ! " # $ % & ’ ( ) * + , - . / : ; < = > ? @ [ ] ^ _ ` { | } ~.
Finally, it is important to note that unless otherwise specified, when we put a character like "a" in a regex, the regex will match "a" and not "A". In other words, regex are **case sensitive**:
```{r}
str_view("Is 'party like it is 1999' the BEST Prince song????", "[abc]")
```
We could get this to match by adding additional uppercase characters to the character class:
```{r}
str_view("Is 'party like it is 1999' the BEST Prince song????", "[abcABC]")
```
but sometimes its simpler to tell the regex to match regardless of case:
```{r}
str_view("Is 'party like it is 1999' the BEST Prince song????", regex("[abc]", ignore_case = TRUE))
```
This is simpler especially if you worry there could be weird combinations of upper and lower case letters.
# Regex: quantifiers
One of the most important pieces of a regex pattern is a **quantifier** that denotes how many times each piece of the pattern show match. There are four basic quantifiers:
- `*`: match zero or more times
- `+`: match one or more times
- `?`: match zero or one times
- `{n}`, `{n,}`, or `{n,m}`: match n times, at least n times, or between n and m times.
```{r}
str_view("is 'party like it is 1999' the best prince song????", "\\?+")
str_view("is 'party like it is 1999' the best prince song????", "\\??")
str_view("is 'party like it is 1999' the best prince song????", "\\?{3}")
str_view("is 'party like it is 1999' the best prince song????", "\\?+?")
```
The last example shows how adding the `?` to either `*` or `+` can make the quantifier "not greedy" and it will only matches as few characters as possible to match.
# Regex: grouping and detecting matches.
Eventually you will want to extract pieces of a match. For example, if you try to match four digit years, you may want to know what years the regex matched. In order to extract this piece from a larger pattern, you can "group" the year part of the pattern.
```{r}
str_match("is 'party like it is 1999' the best prince song????", "(\\w+) +(\\d{4})")
```
The above example shows a match for a word followed by a four digit year. Using the `str_match()` function, you get the matched string and then the grouped expressions afterwards in columns two and three.
If you want to a true/false for whether the pattern matched, you can use `str_detect()` instead of `str_match()`
```{r}
str_detect("is 'party like it is 1999' the best prince song????", "(\\w+) +(\\d{4})")
```
and `str_count()` to get the number of matches in the whole string
```{r}
str_count("is 'party like it is 1999' the best prince song????", "\\?{2}")
```
Note here that matches don't overlap; otherwise you would count more than 2 matches in the four exclamation points.
# Regex: replacing matches
Replacing text with regular expression involves supplying a pattern to match and a replacement string. For example, using `str_replace()`,
```{r}
str_replace("is 'party like it is 1999' the best prince song????", "(\\w+) +(\\d{4})", "is 2024")
```
The groups can also be taken advantage of in the replacement where the variables (called "backreferences") "\\n" contain the nth saved or grouped part of the pattern.
```{r}
str_replace("is 'party like it is 1999' the best prince song????", "(\\w+) +(\\d{4})", "was 1899 but \\1 now \\2")
```
# Regex: practice!
There are a couple of nice websites that allow you to easily construct a regex pattern and see how it matches text at the same time. This can be very useful when trying to construct a pattern to do a very specific job.
- <http://regexr.com/>
- <http://regex101.com/>
This is a good places to learn the basics too in a more fun way:
- <https://regexcrossword.com/challenges/tutorial>
# Lab ![](assets/beaker.png)
For reference, here is the RStudio regular expression cheat sheet:
[pdf on github](https://github.com/rstudio/cheatsheets/blob/4cfbc4c7ec5857b524473db219a1c4d402dd5e80/regex.pdf).
### Problems
1. Go to <https://regexcrossword.com/challenges/beginner>. Solve the first four puzzles and submit the answers to them in your qmd file.
2. Write your own 2x2 regex crossword in the style of the puzzles in problem #1. Provide the puzzle and solution.
3. Use GWAS Catalog data for this problem, which you can load like this:
```{r}
#| message: false
gwas = read_tsv("gwas-catalog-v1.0.3.1-studies-r2024-09-22.tsv")
```
Unlike our previous GWAS data, these data are just the studies (not the individual SNPs) but they are all the studies in the database.\
For a description of the columns in this dataset, look here: <https://www.ebi.ac.uk/gwas/docs/fileheaders#_file_headers_for_unpublished_studies>.
How many different studies in the database study something to do with the thyroid? (use `str_detect` and `DISEASE/TRAIT` for this)\
How many of these studies reported no significant association for their `DISEASE/TRAIT`? (use `ASSOCIATION COUNT`)
4. Using the GWAS data above, how many studies use at least 50,000 individuals in their `INITIAL SAMPLE SIZE`?\
Hint: you'll definitely need a regular expression for this and `str_extract` should be useful. Getting all the numbers from `INITIAL SAMPLE SIZE` is tricky so get the first one as a start. If you're interesting in a challenge, get all the numbers and check them.