-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathtransformation.qmd
391 lines (267 loc) · 14.1 KB
/
transformation.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
# Data import, manipulation, and transformation {#sec-transformation}
```{r}
#| include: false
library(fontawesome)
```
Thus far, our focus has been on utilizing data objects generated through functions. However, in real-world scenarios, we often interact with externally stored data. In this chapter, we will learn how to insert, transform, and manipulate data in R.
When we have finished this chapter, we should be able to:
::: {.callout-caution icon="false"}
## `r fa("circle-dot", prefer_type = "regular", fill = "red")` Learning objectives
- Import the data from spreadsheets in R.
- Select rows in a data frame according to filtering conditions.
- Select columns in a data frame.
- Direct the output of one dplyr function to the input of another function with the native "pipe" operator `|>`.
- Create new variables to a data frame that are functions of existing columns with `mutate()`.
:::
## Importing data
Spreadsheets are frequently used to organize structured rectangular data (see also @sec-introduction), available in formats like Excel, CSV, or Google Sheets. Each format offers unique features, from Excel's comprehensive capabilities to CSV's simplicity and Google Sheets' collaborative editing. In this textbook, data utilized are stored in Excel files (.xlsx) and, generally, have the following basic characteristics:
- The first line in the file is a *header* row indicating the names of the columns/variables.
- Each value has its own cell.
- Missing values are represented as empty cells.
For example, @fig-excel illustrates data in a single sheet from an Excel file.
![Screenshot of an Excel file](images/excel.png){#fig-excel width="80%"}
The dataset `shock_index` includes data regarding the characteristics of 30 adult patients with suspected pulmonary embolism. The following variables were recorded:
- **sex:** F for females and M for males
- **age:** age of the patients in years
- **sbp:** systolic blood pressure in mmHg
- **HR:** resting heart rate in beats/min
### Importing data from Excel using the RStudio interface
To import data using the RStudio interface, we navigate to the Environment pane, where we'll find the "Import Dataset" button. Clicking this button will prompt a dropdown menu with options to import data from various file formats such as CSV, Excel, or other common formats. We select "From Excel..." (@fig-import).
![Import data using the RStudio interface](images/import.png){#fig-import width="80%"}
A dialog box then appears, enabling us to navigate our file system and select the desired dataset. Once selected, we confirm the import by clicking "Open". Finally, we proceed by clicking the "Import" button, which imports the selected dataset into our RStudio session."
![Import data using the RStudio interface](images/wizard.png){#fig-wizard width="100%"}
Upon successful import, the dataset will appear in the Environment pane, ready for analysis and manipulation using R's powerful data analysis tools and functions.
### Importing data from Excel into R with commands
Alternatively, if the dataset is saved in the subfolder named "data" of our RStudio project (see @sec-rstudio_projects), we can read it with the `read_excel()` function from `{readxl}` package as follows:
```{r}
#| message: false
library(here)
library(readxl)
dat <- read_excel(here("data", "shock_index.xlsx"))
```
::: {.callout-tip icon="false"}
## `r fa("circle-info", fill = "#1DC5CE")` INFO
The path of a file/directory is its location/address in the file system. There are two kinds of paths:
- absolute paths such as "`C:/My_name/../my_project/data/shock_index.xlsx`" and
- relative paths such as "`data/shock_index.xlsx`".
The function `here()` allows us to navigate throughout each of the subfolders and files within a given RStudio Project using *relative paths* (see also @sec-packages).
:::
Using `head()` is a convenient way to get a quick overview of our dataset and check for any obvious issues before proceeding with further analysis or data manipulation.
```{r}
head(dat)
```
R displays the **first six** rows of the data frame by default. If we want to view a different number of rows (e.g., the first ten rows), we can specify the desired number as an argument as follows:
```{r}
head(dat, 10)
```
When working with large datasets in R, it's often useful to inspect the last few rows of a data frame. The `tail(`) function is used for this purpose:
```{r}
tail(dat)
```
Now, R displays the **last six** rows of the data frame by default.
## Basic data manipulations
### Adding an unique identifier for each observation
A unique identifier for each observation is often useful for various data manipulation and analysis tasks. In R, we can add row numbers starting from 1 to the number of rows in our data frame using the `rowid_to_column()` function from the `{tibble}` package:
```{r}
library(tibble)
dat <- rowid_to_column(dat)
dat
```
### Renaming variables and cleaning names
```{r}
#| message: false
library(dplyr)
dat <- rename(dat,
patient_ID = rowid,
Heart.Rate = HR)
dat
```
```{r}
#| message: false
library(janitor)
dat <- clean_names(dat)
dat
```
### Converting to the appropriate data type
We might have noticed that the categorical variable `sex` is coded as F (for females) and M (for males), so it is recognized of `character` type. We can use the `factor()` function to encode a variable as a factor:
```{r}
dat$sex <- factor(dat$sex, levels = c("F", "M"),
labels = c("female", "male"))
dat
```
## Basic data transformations
### Subsetting observations (rows)
#### _A. Select observations by indexing position within square brackets_
Just as we use square brackets \[row, column\] to select elements from matrices, we employ the same approach to select elements from a data frame. For example, by specifying the row indices \[5:10, \] within the brackets, we can efficiently access the required data for rows 5 through 10:
```{r}
dat1 <- dat[5:10, ]
dat1
```
#### *B. Select observations by indexing with conditions within the square brackets*
```{r}
dat2 <- dat[which(dat$age > 55), ]
dat2
```
```{r}
dat3 <- dat[which(dat$age > 55 & dat$sex == "female"), ]
dat3
```
#### *C. Select observations using the `subset()` function*
```{r}
dat4 <- subset(dat, age > 55)
dat4
```
```{r}
dat5 <- subset(dat, age > 55 & sex == "female")
dat5
```
#### *D. Select observations using the `filter()` function from `{dplyr}` package*
We pass the data frame first and then one or more conditions separated by a comma:
```{r}
dat6 <- filter(dat, age > 55)
dat6
```
If we want to select only female patients with age \> 55:
```{r}
dat7 <- filter(dat, age > 55, sex == "female")
dat7
```
### Reordering observations (rows)
We can also sort observations in ascending or descending order of one or more variables (columns). When multiple column names are provided, each additional column helps resolve ties in preceding columns' values. For example, we can arrange the rows of our data based on the values of `sbp` (in ascending order which is the default) as follows:
```{r}
dat8 <- arrange(dat, sbp)
head(dat8, 10)
```
We can use `desc()` to re-arrange in descending order. For example:
```{r}
dat9 <- arrange(dat, desc(sbp))
head(dat9, 10)
```
```{r}
dat10 <- arrange(dat, sbp, heart_rate)
head(dat10, 10)
```
### Subsetting variables (columns)
#### *A. Select or exclude variables by indexing their name within square brackets*
We can select only the `sex`, `age`, `heart_rate`, variables from the data frame:
```{r}
dat11 <- dat[, c("sex", "age" , "heart_rate")]
head(dat11)
```
```{r}
dat12 <- dat[, !names(dat) %in% c("patient_id", "sbp")]
head(dat12)
```
#### *B. Select or exclude variables by indexing position within square brackets (not recommended)*
By specifying the column indices, such as `c(2, 3, 5)`, within brackets, we can select the data in columns 2, 3, and 5 as follows:
```{r}
dat13 <- dat[, c(2, 3, 5)]
head(dat13)
```
```{r}
dat14 <- dat[-c(1, 4)]
head(dat14)
```
#### *C. Select variables using the `subset()` function*
```{r}
dat15 <- subset(dat, select = c("sex", "age" , "heart_rate"))
head(dat15)
```
```{r}
dat16 <- subset(dat, select = c(2, 3, 5))
head(dat16)
```
The `select` argument allows the selection by indexing the columns of interest in.
#### *D. Select variables using the `select()` function from `{dplyr}` package*
In `select()` function we pass the data frame first and then the variables separated by commas:
```{r}
dat17 <- select(dat, sex, age, heart_rate)
head(dat17)
```
### Creating new variables
Suppose we want to create a new variable representing the ratio of heart rate to systolic blood pressure (HR/sbp), which we will call shock index. We can achieve this by using the `mutate()` function as follows:
```{r}
dat18 <- mutate(dat, shock_index = round(heart_rate / sbp, digits = 2))
dat18
```
::: {.callout-tip icon="false"}
## `r fa("circle-info", fill = "#1DC5CE")` INFO
By default, the `mutate()` function adds the new variable as the last column in the data frame.
:::
Next, we want to categorize the new variable considering that normal values of this index range from 0.5 to 0.7 (beats/mmHg\*min).
```{r}
dat19 <- mutate(dat18,
shock_index2 = cut(shock_index, breaks = c(-Inf, 0.5, 0.7, Inf),
labels = c("low","normal","high"))
)
dat19
```
## Using the native pipe operator in a sequence of commands
Given that each verb in `{dplyr}` specializes in a particular task, addressing complex queries often typically involves integrating several verbs, a process simplified by using the native pipe operator, `|>`. For example, we want to filter the original dataset to include only rows where the shock index category is "low" or "high", using a sequence of dplyr verbs.
```{r}
sub_dat <- dat |>
mutate(shock_index = round(heart_rate / sbp, digits = 2),
shock_index2 = cut(shock_index, breaks = c(-Inf, 0.5, 0.7, Inf),
labels = c("low","normal","high"))
) |>
filter(shock_index2 == "low" | shock_index2 == "high")
```
We would read this sequence as:
1. Begin with the dataset `dat` **then**
2. Use it as input to the `mutate()` function to calculate the `shock_index` variable and create the `shock_index2` categorical variable, **then**
3. Use this output as the input to the `filter()` function to select patients with "low" or "high" value of shock index.
```{r}
print(sub_dat, n = Inf)
```
Alternatively we can use the `%in%` operator. This operator helps us to easily create multiple `OR` arguments:
```{r}
dat |>
mutate(shock_index = round(heart_rate / sbp, digits = 2),
shock_index2 = cut(shock_index, breaks = c(-Inf, 0.5, 0.7, Inf),
labels = c("low","normal","high"))
) |>
filter(shock_index2 %in% c("low", "high"))
```
## Reshaping data
In a pre-post test study investigating the effects of margarine consumption on blood total cholesterol (TCH) levels, eighteen participants completed a 12-week dietary intervention. Before beginning the specialized diet, baseline measurements of blood total cholesterol (in mmol/L) were recorded for each participant (week 0). Follow-up evaluations took place at the 6-week and 12-week intervals, enabling the assessment of changes in cholesterol levels during the intervention period. The collected data were organized and stored within an Excel spreadsheet, structured as follows:
```{r}
dat_TCH <- read_excel(here("data", "cholesterol.xlsx"))
head(dat_TCH)
```
We observe that the data set is structured in what is commonly referred to as a "wide" format. The **wide format** of the data is particularly suited for a repeated measures design, also known as a longitudinal or within-subject design. In such a design, the same individuals are measured multiple times over a period, with each measurement representing a different time point.
### Ensuring that every row has a unique identifier
Before proceeding with data manipulation and analysis tasks, we create a new column in the data set that assigns a unique identifier to each row using the `rowid_to_column()` function:
```{r}
dat_TCH <- dat_TCH |>
rowid_to_column()
head(dat_TCH)
```
### From wide to long format
Wide format data may not always be the most suitable data shape for certain types of statistical analyses or visualizations compared to **"long" format** data. In long format, each row typically represents a single observation, while another variable denotes different time points or conditions.
Now, let's reshape the dataset from a wide format to a long format.
```{r}
library(tidyr)
dat_TCH2_long <- dat_TCH |>
pivot_longer(
cols = starts_with("week"),
names_to = "week",
values_to = "total_cholesterol"
)
head(dat_TCH2_long)
```
To understand how the reshaping process works, it's helpful to break down the `pivot_longer()` function with its arguments:
The argument `cols = starts_with("week")` specifies which columns in the original data should be pivoted. In this instance, we select the columns that start with the prefix "week", such as "week0", "week6", and "week12".
First, the values within the identifier column (rowid) of wide format need to be repeated in the long format as many times as the number of columns being pivoted, which in this case is three (@fig-reshape01).
![Columns in the original data that should be reshaped.](images/reshape01.png){#fig-reshape01 width="90%"}
By using the argument `names_to = "week"`, the column names being pivoted (i.e., week0, week6, and week12) become values within a newly defined variable in the long format with the name `week`. They are repeated as many times as the number of rows in the wide format, which in this case is two (@fig-reshape02).
![Columns in the original data that should be reshaped.](images/reshape02.png){#fig-reshape02 width="90%"}
Finally, by using the argument `values_to = "total_cholesterol"`, the cell values from the pivoted variables are placed into a newly defined variable in the long format with the name `total_cholesterol` (@fig-reshape03).
![Columns in the original data that should be reshaped.](images/reshape03.png){#fig-reshape03 width="78%"}
### From long to wide format
```{r}
dat_TCH2_long |>
pivot_wider(
names_from = week,
values_from = total_cholesterol
)
```