lists.qmd

# Lists and data frames {#sec-lists}

```{r}
#| include: false

library(fontawesome)
#link::auto(type = "plain", keep_pkg_prefix = FALSE)
```

When we have finished this chapter, we should be able to:

::: {.callout-caution icon="false"}
## `r fa("circle-dot", prefer_type = "regular", fill = "red")` Learning objectives

-   Create a list using the `list()` function.
-   Refer a list item using its name or index number.
-   Create a data frame from equal length vectors using the `tibble()` function.
-   Refer to a column of a data frame using the \$ notation.
-   Convert variables from character to factor variables.
:::

 
## Creating a list

In R, a list enables us to organize diverse objects (e.g., 1-D vectors, matrices, even other lists) under a single data structure. There is no requirement for these objects to be associated or related to each other in any way. Essentially, a list can be considered an advanced data type, allowing us to store practically any kind of information within it.

We construct a list using the `list()` function. For example:

```{r}
my_list <- list(1:5, c("apple", "carrot"), c(TRUE, TRUE, FALSE))
my_list
```

This list consists of three elements referred to as "list items" or "items", which are atomic vectors of different types of data (numeric, character, and logical).

We can assign names to the list items:

```{r}
my_list <- list(
              num = 1:5, 
              fruits = c("apple", "carrot"), 
              TF = c(TRUE, TRUE, FALSE))
my_list
```

We can also confirm that the class of the object is `list`:

```{r}
class(my_list)
```

 
## Subsetting a list

### Subset list and preserve output as a list

We can use the extraction operator `[ ]` to extract one or more list items while preserving the output in list format:

```{r}
my_list[2]    # extract the second list item (indexing by position)

class(my_list[2])
```

```{r}
my_list["fruits"]   # same as above but using the item's name
```

```{r}
my_list[c(FALSE, TRUE, FALSE)]    # same as above but using boolean indices (TRUE/FALSE)
```

 
### Subset list and simplify the output

We can use the `[[ ]]` to extract one or more list items while simplifying the output:

```{r}
my_list[[2]]   # extract the second list item and simplify it to a vector

class(my_list[[2]])

my_list[["fruits"]]   # same as above but using the item's name
```

We can also access the content of the list by typing the name of the list followed by a dollar sign `$` folowed by the name of the list item:

```{r}
my_list$fruits  # extract the numbers and simplify to a vector
```

One thing that differentiates the `[[ ]]` operator from the `$` is that the `[[ ]]` operator can be used with computed indices and names. The `$` operator can only be used with names.

::: {.callout-important icon="false"}
## Simplifying Vs Preserving subsetting

It's important to understand the difference between simplifying and preserving subsetting. Simplifying subsets returns the simplest possible data structure that can represent the output. Preserving subsets keeps the structure of the output the same as the input.
:::

 
### Subset list to get individual elements out of a list item

To extract individual elements out of a specific list item combine the `[[ ]]` (or \$) operator with the `[ ]` operator:

```{r}
my_list[[2]][2]          # using the index

my_list[["fruits"]][2]  # using the name of the list item

my_list$fruits[2]       # using the $

```

 
## Unlist a list

We can turn a list into an atomic vector with `unlist()`:

```{r}
my_unlist <- unlist(my_list)
my_unlist
class(my_unlist)
```

 
## Recursive vectors and Nested Lists

In R, lists are sometimes referred to as **recursive vectors** because they can include other lists within them. These sublists are known as **nested lists**. For example:

```{r}
my_super_list <- list(item1 = 3.14,
                      item2 = list(item2a_num = 5:10,
                                   item2b_char = c("a", "b", "c")))

my_super_list
```

In this example, `item2`, which is the second item of `my_super_list`, is a nested list.

 
**Subsetting a nested list**

We can access the list items of a nested list by using the combination of `[[ ]]` (or \$) operator and the `[ ]` operator. For example:

```{r}
# preserve the output as a list
my_super_list[[2]][1]
class(my_super_list[[2]][1])

# simplify the output
my_super_list[[2]][[1]]
class(my_super_list[[2]][[1]])

# same as above with names
my_super_list[["item2"]][["item2a_num"]]


# same as above with $ operator
my_super_list$item2$item2a_num

```

 
We can also **extract individual elements** from the list items of a nested list. For example:

```{r}
# extract individual element
my_super_list[[2]][[2]][3]
class(my_super_list[[2]][[2]][3])
```

 
## Data frames

A data frame is the most common way of organizing and storing data in R and is generally the preferred data structure for conducting data analysis tasks.

::: {.callout-tip icon="false"}
## Data frame

In R, rectangular data is often referred to as a "data frame" consisting of rows and columns. While all elements within a column must have the same data type (e.g., numeric, character, or logical), it's possible for different columns to have different data types. Therefore, a **data frame** is a special type of list with **equal-length** atomic vectors as its items.

Various disciplines have different terms for the rows and columns in a data frame, such as observations and variables, records and fields, or examples and attributes. In this textbook, we will consistently use the terms **"observations"** and **"variables"**. Data in variables can be either categorical (categorical variables) or numerical (numerical variables) (see also the @sec-introduction).
:::

### Creating a data frame with `tibble()`

We will create a small fictional dataframe with eight rows based on the following information:

::: content-box-gray
-   age: age of the patient (in years)
-   smoking: smoking status of the patient (0=non-smoker, 1=smoker)
-   ABO: blood type of the patient based on the ABO blood group system (A, B, AB, O)
-   bmi: Body Mass Index (BMI) category of the patient (1=underweight, 2=healthy weight, 3=overweight, 4=obesity)
-   occupation: occupation of the patient
-   adm_date: admission date to the hospital
:::

A data frame can be created using the `data.frame()` function in base R, the `tibble()` function in the `{tidyverse}` package, or the [`data.table()`](https://rdatatable.gitlab.io/data.table/reference/data.table.html) function in the `{data.table}` package. Let's try the `tibble()` :

```{r}
#| message: false
#| warning: false

library(tidyverse)   # load the tidyverse package
library(rstatix)

dat <- tibble(
  age = c(30, 65, 35, 25, 45, 55, 40, 20),
  smoking = c(0, 1, 1, 0, 1, 0, 0, 1),
  ABO = c("A", "O", "O", "O", "B", "O", "A", "A"),
  bmi = c(2, 3, 2, 2, 4, 4, 3, 1),
  occupation = c("Journalist", "Chef", "Doctor", "Teacher",
                  "Lawyer", "Musician", "Pharmacist", "Nurse"),
  adm_date = c("10-09-2023", "10-12-2023", "10-18-2023", "10-27-2023",
               "11-04-2023", "11-09-2024", "11-22-2023", "12-02-2023")
)

dat
```

We can find the **type**, **class** and **dim** for the created object `dat`:

```{r}
#| results: hold

typeof(dat)
class(dat)
dim(dat)
```

The type is a *list* but the class is a `tbl` *(tibble)* object which is a "tidy" data frame (tibbles work better in the tidyverse). The dimensions are 8x8.

The `attribute()` function help us to explore the characteristics/attributes of our tibble:

```{r}
attributes(dat)
```

 
### Accessing variables in a data frame

In R, we can access variables in a data frame just like items in a list by using their names or indices. For example:

```{r}
#| results: hold

dat[["age"]]
dat[[2]]
```

or by using the **dollar sign (`$`)** :

```{r}
dat$age
```

We can also extract individual elements out of a specific variable as follows:

```{r}
dat$age[2:5]
```

Another easy way of selecting one variable, similar to `$`, is by utilizing the `pull()` function from the {dplyr} package. For example:

```{r}
pull(dat, age)
```

 
### Converting to the appropriate data type

It's critical to investigate the column's data type and convert it to the appropriate type for analysis if necessary. Often we use the `glimpse()` function in order to have a quick look at the structure of the data frame:

```{r}
glimpse(dat)
```

Observe the series of three letter abbreviations in angle brackets (`<dbl>`, `<chr>`). The abbreviations used in tibbles serve to describe the type of data in each column and are presented in (@tbl-data_types):

| **Data Type**  | **Description**                                                                       | **Abbreviation** |
|:----------------:|-----------------------------------|:----------------:|
|   character    | strings: letters, numbers, symbols, and spaces                                        |     `<chr>`      |
|    integer     | numerical values: integer numbers                                                     |     `<int>`      |
|     double     | numerical values: real numbers                                                        |     `<dbl>`      |
|    logical     | logical data, typically representing `TRUE` or `FALSE`                                |     `<lgl>`      |
|      date      | date (e.g, 2020-10-09)                                                                |     `<date>`     |
|   date+time    | date plus time (e.g., 2020-10-09 10:03:25 UTC)                                        |     `<dttm>`     |
|     factor     | categorical variables with fixed and known set of possible values (e.g., male/female) |     `<fct>`      |
| ordered factor | categorical variable with ordered fixed and known set of possible values              |     `<ord>`      |

: Tibble abbreviations that describe the type of data in columns of a data frame {#tbl-data_types}

 
We can convert the categorical variables `smoking`, `ABO`, and `bmi` from `<dbl>`, `<chr>`, `<dbl>` types, respectively, into factors `<fct>` since they have fixed and known values.

 
-   **Variable: smoking** (numeric coded values → factor)

converts a numeric variable representing smoking status into a factor variable with more meaningful labels and then displays the updated dataframe along with the levels of the newly converted factor variable.

```{r}
dat$smoking <- factor(dat$smoking, levels = c(0, 1), 
                  labels = c("non-smoker", "smoker"))
dat
levels(dat$smoking)
```

 
-   **Variable: ABO** (chr → factor)

It's important to note that not all potential values may be present in a given dataset. For example, if we tabulate the variable `ABO` (e.g. using the `table()` function) we will get counts of the categories in the data:

```{r}
# create a count table
table(dat$ABO)
```

The blood type "AB" of the ABO blood group system is absent from our data. In such cases, we can use the factor and create a list of all the valid levels:

```{r}

# create a vector containing the blood types A, B, AB, and O
ABO_levels <- c("A", "B", "AB", "O")

dat$ABO <- factor(dat$ABO, levels = ABO_levels)
dat
# show the levels of status variable
levels(dat$ABO)
# create a count table
table(dat$ABO)
```

 
-   **Variable: bmi** (numeric coded values → ordered factor)

We might have noticed that the categorical variable `bmi` takes numerically coded values (1, 2, 3, 4) in our dataset, so it is recognized as a double `<dbl>` type. We can convert this variable into factor `<fct>` with levels (1=underweight, 2=healthy, 3=overweight, 4=obesity). Instead of overwriting the existing variable, we prefer to create a new variable `bmi1`, as follows:"

```{r}
# create a vector containing the four bmi categories
bmi1_labels <- c("underweight", "healthy", "overweight", "obesity")

# convert the variable to factor
dat$bmi1 <- factor(dat$bmi, levels = c(1, 2, 3, 4), 
                   labels = bmi1_labels, ordered = TRUE)
dat$bmi1

dat
```

Now we can use, for example, the comparison operators `>` to check whether one element of the ordered vector is larger than the other.

```{r}
dat$bmi1[2] > dat$bmi1[6]
```

However, the use of these operators on factors is much less common compared to numeric vectors. Therefore, we typically omit the `ordered = TRUE` argument, especially when we provide the order of categories explicitly in the `levels` argument.

 
Now, let's merge the "overweight" and "obesity" categories into a single category named "overweight/obesity" within a new variable called `bmi2`:

```{r}
# recode the values
dat$bmi2 <- case_match(dat$bmi, 
                      1 ~ "underweight",
                      2 ~ "healthy",
                      c(3, 4) ~ "overweight/obesity")

# set the levels in a order
bmi2_levels <- c("underweight", "healthy", "overweight/obesity")

# convert the variable to factor
dat$bmi2 <- factor(dat$bmi2, levels = bmi2_levels, ordered = TRUE)
dat$bmi2

dat
```

 
-   **Variable: adm_date** (chr → date)

In R, by default, values of class Date are displayed as YYYY-MM-DD. Therefore, to represent the date "10-12-2023" (assuming it's in month-day-year format), we can use the following code:

```{r}
dat$adm_date <- mdy(dat$adm_date)
dat
class(dat$adm_date)
```