-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathlists.qmd
404 lines (270 loc) · 13 KB
/
lists.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
# Lists and data frames {#sec-lists}
```{r}
#| include: false
library(fontawesome)
#link::auto(type = "plain", keep_pkg_prefix = FALSE)
```
When we have finished this chapter, we should be able to:
::: {.callout-caution icon="false"}
## `r fa("circle-dot", prefer_type = "regular", fill = "red")` Learning objectives
- Create a list using the `list()` function.
- Refer a list item using its name or index number.
- Create a data frame from equal length vectors using the `tibble()` function.
- Refer to a column of a data frame using the \$ notation.
- Convert variables from character to factor variables.
:::
## Creating a list
In R, a list enables us to organize diverse objects (e.g., 1-D vectors, matrices, even other lists) under a single data structure. There is no requirement for these objects to be associated or related to each other in any way. Essentially, a list can be considered an advanced data type, allowing us to store practically any kind of information within it.
We construct a list using the `list()` function. For example:
```{r}
my_list <- list(1:5, c("apple", "carrot"), c(TRUE, TRUE, FALSE))
my_list
```
This list consists of three elements referred to as "list items" or "items", which are atomic vectors of different types of data (numeric, character, and logical).
We can assign names to the list items:
```{r}
my_list <- list(
num = 1:5,
fruits = c("apple", "carrot"),
TF = c(TRUE, TRUE, FALSE))
my_list
```
We can also confirm that the class of the object is `list`:
```{r}
class(my_list)
```
## Subsetting a list
### Subset list and preserve output as a list
We can use the extraction operator `[ ]` to extract one or more list items while preserving the output in list format:
```{r}
my_list[2] # extract the second list item (indexing by position)
class(my_list[2])
```
```{r}
my_list["fruits"] # same as above but using the item's name
```
```{r}
my_list[c(FALSE, TRUE, FALSE)] # same as above but using boolean indices (TRUE/FALSE)
```
### Subset list and simplify the output
We can use the `[[ ]]` to extract one or more list items while simplifying the output:
```{r}
my_list[[2]] # extract the second list item and simplify it to a vector
class(my_list[[2]])
my_list[["fruits"]] # same as above but using the item's name
```
We can also access the content of the list by typing the name of the list followed by a dollar sign `$` folowed by the name of the list item:
```{r}
my_list$fruits # extract the numbers and simplify to a vector
```
One thing that differentiates the `[[ ]]` operator from the `$` is that the `[[ ]]` operator can be used with computed indices and names. The `$` operator can only be used with names.
::: {.callout-important icon="false"}
## Simplifying Vs Preserving subsetting
It's important to understand the difference between simplifying and preserving subsetting. Simplifying subsets returns the simplest possible data structure that can represent the output. Preserving subsets keeps the structure of the output the same as the input.
:::
### Subset list to get individual elements out of a list item
To extract individual elements out of a specific list item combine the `[[ ]]` (or \$) operator with the `[ ]` operator:
```{r}
my_list[[2]][2] # using the index
my_list[["fruits"]][2] # using the name of the list item
my_list$fruits[2] # using the $
```
## Unlist a list
We can turn a list into an atomic vector with `unlist()`:
```{r}
my_unlist <- unlist(my_list)
my_unlist
class(my_unlist)
```
## Recursive vectors and Nested Lists
In R, lists are sometimes referred to as **recursive vectors** because they can include other lists within them. These sublists are known as **nested lists**. For example:
```{r}
my_super_list <- list(item1 = 3.14,
item2 = list(item2a_num = 5:10,
item2b_char = c("a", "b", "c")))
my_super_list
```
In this example, `item2`, which is the second item of `my_super_list`, is a nested list.
**Subsetting a nested list**
We can access the list items of a nested list by using the combination of `[[ ]]` (or \$) operator and the `[ ]` operator. For example:
```{r}
# preserve the output as a list
my_super_list[[2]][1]
class(my_super_list[[2]][1])
# simplify the output
my_super_list[[2]][[1]]
class(my_super_list[[2]][[1]])
# same as above with names
my_super_list[["item2"]][["item2a_num"]]
# same as above with $ operator
my_super_list$item2$item2a_num
```
We can also **extract individual elements** from the list items of a nested list. For example:
```{r}
# extract individual element
my_super_list[[2]][[2]][3]
class(my_super_list[[2]][[2]][3])
```
## Data frames
A data frame is the most common way of organizing and storing data in R and is generally the preferred data structure for conducting data analysis tasks.
::: {.callout-tip icon="false"}
## Data frame
In R, rectangular data is often referred to as a "data frame" consisting of rows and columns. While all elements within a column must have the same data type (e.g., numeric, character, or logical), it's possible for different columns to have different data types. Therefore, a **data frame** is a special type of list with **equal-length** atomic vectors as its items.
Various disciplines have different terms for the rows and columns in a data frame, such as observations and variables, records and fields, or examples and attributes. In this textbook, we will consistently use the terms **"observations"** and **"variables"**. Data in variables can be either categorical (categorical variables) or numerical (numerical variables) (see also the @sec-introduction).
:::
### Creating a data frame with `tibble()`
We will create a small fictional dataframe with eight rows based on the following information:
::: content-box-gray
- age: age of the patient (in years)
- smoking: smoking status of the patient (0=non-smoker, 1=smoker)
- ABO: blood type of the patient based on the ABO blood group system (A, B, AB, O)
- bmi: Body Mass Index (BMI) category of the patient (1=underweight, 2=healthy weight, 3=overweight, 4=obesity)
- occupation: occupation of the patient
- adm_date: admission date to the hospital
:::
A data frame can be created using the `data.frame()` function in base R, the `tibble()` function in the `{tidyverse}` package, or the [`data.table()`](https://rdatatable.gitlab.io/data.table/reference/data.table.html) function in the `{data.table}` package. Let's try the `tibble()` :
```{r}
#| message: false
#| warning: false
library(tidyverse) # load the tidyverse package
library(rstatix)
dat <- tibble(
age = c(30, 65, 35, 25, 45, 55, 40, 20),
smoking = c(0, 1, 1, 0, 1, 0, 0, 1),
ABO = c("A", "O", "O", "O", "B", "O", "A", "A"),
bmi = c(2, 3, 2, 2, 4, 4, 3, 1),
occupation = c("Journalist", "Chef", "Doctor", "Teacher",
"Lawyer", "Musician", "Pharmacist", "Nurse"),
adm_date = c("10-09-2023", "10-12-2023", "10-18-2023", "10-27-2023",
"11-04-2023", "11-09-2024", "11-22-2023", "12-02-2023")
)
dat
```
We can find the **type**, **class** and **dim** for the created object `dat`:
```{r}
#| results: hold
typeof(dat)
class(dat)
dim(dat)
```
The type is a *list* but the class is a `tbl` *(tibble)* object which is a "tidy" data frame (tibbles work better in the tidyverse). The dimensions are 8x8.
The `attribute()` function help us to explore the characteristics/attributes of our tibble:
```{r}
attributes(dat)
```
### Accessing variables in a data frame
In R, we can access variables in a data frame just like items in a list by using their names or indices. For example:
```{r}
#| results: hold
dat[["age"]]
dat[[2]]
```
or by using the **dollar sign (`$`)** :
```{r}
dat$age
```
We can also extract individual elements out of a specific variable as follows:
```{r}
dat$age[2:5]
```
Another easy way of selecting one variable, similar to `$`, is by utilizing the `pull()` function from the {dplyr} package. For example:
```{r}
pull(dat, age)
```
### Converting to the appropriate data type
It's critical to investigate the column's data type and convert it to the appropriate type for analysis if necessary. Often we use the `glimpse()` function in order to have a quick look at the structure of the data frame:
```{r}
glimpse(dat)
```
Observe the series of three letter abbreviations in angle brackets (`<dbl>`, `<chr>`). The abbreviations used in tibbles serve to describe the type of data in each column and are presented in (@tbl-data_types):
| **Data Type** | **Description** | **Abbreviation** |
|:----------------:|-----------------------------------|:----------------:|
| character | strings: letters, numbers, symbols, and spaces | `<chr>` |
| integer | numerical values: integer numbers | `<int>` |
| double | numerical values: real numbers | `<dbl>` |
| logical | logical data, typically representing `TRUE` or `FALSE` | `<lgl>` |
| date | date (e.g, 2020-10-09) | `<date>` |
| date+time | date plus time (e.g., 2020-10-09 10:03:25 UTC) | `<dttm>` |
| factor | categorical variables with fixed and known set of possible values (e.g., male/female) | `<fct>` |
| ordered factor | categorical variable with ordered fixed and known set of possible values | `<ord>` |
: Tibble abbreviations that describe the type of data in columns of a data frame {#tbl-data_types}
We can convert the categorical variables `smoking`, `ABO`, and `bmi` from `<dbl>`, `<chr>`, `<dbl>` types, respectively, into factors `<fct>` since they have fixed and known values.
- **Variable: smoking** (numeric coded values → factor)
converts a numeric variable representing smoking status into a factor variable with more meaningful labels and then displays the updated dataframe along with the levels of the newly converted factor variable.
```{r}
dat$smoking <- factor(dat$smoking, levels = c(0, 1),
labels = c("non-smoker", "smoker"))
dat
levels(dat$smoking)
```
- **Variable: ABO** (chr → factor)
It's important to note that not all potential values may be present in a given dataset. For example, if we tabulate the variable `ABO` (e.g. using the `table()` function) we will get counts of the categories in the data:
```{r}
# create a count table
table(dat$ABO)
```
The blood type "AB" of the ABO blood group system is absent from our data. In such cases, we can use the factor and create a list of all the valid levels:
```{r}
# create a vector containing the blood types A, B, AB, and O
ABO_levels <- c("A", "B", "AB", "O")
dat$ABO <- factor(dat$ABO, levels = ABO_levels)
dat
# show the levels of status variable
levels(dat$ABO)
# create a count table
table(dat$ABO)
```
- **Variable: bmi** (numeric coded values → ordered factor)
We might have noticed that the categorical variable `bmi` takes numerically coded values (1, 2, 3, 4) in our dataset, so it is recognized as a double `<dbl>` type. We can convert this variable into factor `<fct>` with levels (1=underweight, 2=healthy, 3=overweight, 4=obesity). Instead of overwriting the existing variable, we prefer to create a new variable `bmi1`, as follows:"
```{r}
# create a vector containing the four bmi categories
bmi1_labels <- c("underweight", "healthy", "overweight", "obesity")
# convert the variable to factor
dat$bmi1 <- factor(dat$bmi, levels = c(1, 2, 3, 4),
labels = bmi1_labels, ordered = TRUE)
dat$bmi1
dat
```
Now we can use, for example, the comparison operators `>` to check whether one element of the ordered vector is larger than the other.
```{r}
dat$bmi1[2] > dat$bmi1[6]
```
However, the use of these operators on factors is much less common compared to numeric vectors. Therefore, we typically omit the `ordered = TRUE` argument, especially when we provide the order of categories explicitly in the `levels` argument.
Now, let's merge the "overweight" and "obesity" categories into a single category named "overweight/obesity" within a new variable called `bmi2`:
```{r}
# recode the values
dat$bmi2 <- case_match(dat$bmi,
1 ~ "underweight",
2 ~ "healthy",
c(3, 4) ~ "overweight/obesity")
# set the levels in a order
bmi2_levels <- c("underweight", "healthy", "overweight/obesity")
# convert the variable to factor
dat$bmi2 <- factor(dat$bmi2, levels = bmi2_levels, ordered = TRUE)
dat$bmi2
dat
```
- **Variable: adm_date** (chr → date)
In R, by default, values of class Date are displayed as YYYY-MM-DD. Therefore, to represent the date "10-12-2023" (assuming it's in month-day-year format), we can use the following code:
```{r}
dat$adm_date <- mdy(dat$adm_date)
dat
class(dat$adm_date)
```