Skip to content

Commit

Permalink
update README, edit YAML metadata
Browse files Browse the repository at this point in the history
  • Loading branch information
reikookamoto committed Sep 16, 2024
1 parent 431fe71 commit 4f2e99b
Show file tree
Hide file tree
Showing 10 changed files with 146 additions and 123 deletions.
34 changes: 25 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,20 +2,36 @@

Material for workshop at Département de science politique de l'Université de Montréal

## Directory Structure
## Getting Started with the Repository

- Clone or download the repository

- Clone the repository to your local machine using Git

- Alternatively, you can click the green "\<\> Code" button and select "Download ZIP" to download the repository as a ZIP file. After downloading, extract the contents.

- After cloning or downloading, navigate to the project folder on your computer.

- intro-to-dplyr/: Contains files for the first half of the workshop, focusing on data manipulation using dplyr
- Double click on `2024-09-18_intro-to-tidyverse.Rproj` in the project root to open the RStudio Project.

- intro-to-ggplot2/: Contains files for the second half of the workshop, covering basic data visualization with ggplot2
- Before running the code, make sure you have the following R packages installed:

- tidyverse

- here

- RColorBrewer

## Directory Structure

- more-ggplot2/: Includes additional code that we probably won't have time to cover during the workshop but may be helpful for further learning
- `intro-to-dplyr/`: Contains files for the first half of the workshop, focusing on data manipulation using dplyr

## Prerequisites
- `intro-to-ggplot2/`: Contains files for the second half of the workshop, covering basic data visualization with ggplot2

Before running the code, make sure you have the following R packages installed:
- `more-ggplot2/`: Includes additional code that we probably won't have time to cover during the workshop but is helpful for further learning

- tidyverse
## During the workshop

- here
- In each of the above folders, you'll find a `.qmd` file with `_blank` in its name. If you'd like to **code along**, you can use these files, which provide skeletons for you to fill in as we work through the material.

- RColorBrewer
- If you prefer to **follow along without coding**, open the other `.qmd` file that already contains all the code.
45 changes: 10 additions & 35 deletions intro-to-dplyr/intro-to-dplyr.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,18 @@
# Introduction to dplyr
Reiko Okamoto
2024-09-13
2024-09-16

## 👋Welcome to the tidyverse

#### ***What is the tidyverse?***
#### *What is the tidyverse?*

The [tidyverse](https://tidyverse.tidyverse.org/) is a collection of R
packages designed for data science. Arguably, two of the most popular
packages in the tidyverse are [dplyr](https://dplyr.tidyverse.org/) for
data manipulation and [ggplot2](https://ggplot2.tidyverse.org/) for data
visualization. These are also two of the packages we are covering today!

***Why learn it?***
#### *Why learn it?*

The skills you gain are not just limited to R. The concepts, like
filtering data and creating plots, are applicable to other languages
Expand All @@ -21,7 +21,7 @@ languages later. Additionally, the tidyverse is in tune with open
science practices, helping you create analyses that are more accessible,
transparent, and reproducible.

***Keep in mind…***
#### *Keep in mind…*

There’s no expectation for you to memorize everything. Even experienced
programmers don’t have every function memorized - they’re constantly
Expand Down Expand Up @@ -695,22 +695,6 @@ penguins |>
The function creates a new data frame with a single row containing the
summary statistic.

💻Calculate the minimum and maximum of body mass at the same time:

``` r
penguins |>
summarise(min_body_mass = min(body_mass_g, na.rm = TRUE),
max_body_mass = max(body_mass_g, na.rm = TRUE))
```

# A tibble: 1 × 2
min_body_mass max_body_mass
<int> <int>
1 2700 6300

Similar to what we’ve seen in other functions, we can create multiple
summaries in a single step by separating them with commas.

## 7️⃣Group by one or more variables: group_by()

In data analysis, a common task is to split our data into groups, apply
Expand Down Expand Up @@ -756,21 +740,9 @@ penguins |>
2 Dream 124
3 Torgersen 52

💻Achieve this count using the
Alternatively, use the
[`count()`](https://dplyr.tidyverse.org/reference/count.html) function,
which combines `group_by()` and `tally()` in one step:

``` r
penguins |>
count(island)
```

# A tibble: 3 × 2
island n
<fct> <int>
1 Biscoe 168
2 Dream 124
3 Torgersen 52
which combines `group_by()` and `tally()` in one step.

💻Calculate the mean and standard deviation of body mass for each
combination of species and sex:
Expand Down Expand Up @@ -801,7 +773,7 @@ penguins |>
By default, when we apply a grouping with multiple factors, dplyr will
keep the last level of grouping after the summary. Here, the output is
still grouped by `species`. To remove grouping from a data frame, use
the `ungroup()` function or the `.groups = "drop"` argument in the
the `ungroup()` function or the `.groups = "drop"` argument in the
`summarise()` function. Both methods will allow us to continue working
with the data as a regular data frame.

Expand Down Expand Up @@ -838,6 +810,9 @@ df <- penguins |>
.groups = "drop")
```

Similar to what we’ve seen in other functions, we can create multiple
summaries in a single step by separating them with commas.

## 8️⃣Sort rows and extract specific values: arrange(), slice(), pull()

Sometimes, we’re interested in extracting a particular value from a data
Expand Down
35 changes: 10 additions & 25 deletions intro-to-dplyr/intro-to-dplyr.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -4,23 +4,21 @@ author: "Reiko Okamoto"
date: "`r Sys.Date()`"
format: gfm
editor: visual
execute:
echo: true
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## 👋Welcome to the tidyverse

#### ***What is the tidyverse?***
#### *What is the tidyverse?*

The [tidyverse](https://tidyverse.tidyverse.org/) is a collection of R packages designed for data science. Arguably, two of the most popular packages in the tidyverse are [dplyr](https://dplyr.tidyverse.org/) for data manipulation and [ggplot2](https://ggplot2.tidyverse.org/) for data visualization. These are also two of the packages we are covering today!

***Why learn it?***
#### *Why learn it?*

The skills you gain are not just limited to R. The concepts, like filtering data and creating plots, are applicable to other languages like SQL and Python. This makes it easier to pick up other tools and languages later. Additionally, the tidyverse is in tune with open science practices, helping you create analyses that are more accessible, transparent, and reproducible.

***Keep in mind...***
#### *Keep in mind...*

There's no expectation for you to memorize everything. Even experienced programmers don't have every function memorized - they're constantly googling things! My goal today is to help you get comfortable with, and hopefully interested in, using the tidyverse for data analysis.

Expand Down Expand Up @@ -252,16 +250,6 @@ penguins |>

The function creates a new data frame with a single row containing the summary statistic.

💻Calculate the minimum and maximum of body mass at the same time:

```{r}
penguins |>
summarise(min_body_mass = min(body_mass_g, na.rm = TRUE),
max_body_mass = max(body_mass_g, na.rm = TRUE))
```

Similar to what we've seen in other functions, we can create multiple summaries in a single step by separating them with commas.

## 7️⃣Group by one or more variables: group_by()

In data analysis, a common task is to split our data into groups, apply a function to each group, and then combine the results. This approach is known as the split-apply-combine paradigm. The [`group_by()`](https://dplyr.tidyverse.org/reference/group_by.html) function helps us achieve this by allowing us to specify how we want to split our data into groups.
Expand All @@ -284,12 +272,7 @@ penguins |>
tally()
```

💻Achieve this count using the [`count()`](https://dplyr.tidyverse.org/reference/count.html) function, which combines `group_by()` and `tally()` in one step:

```{r}
penguins |>
count(island)
```
Alternatively, use the [`count()`](https://dplyr.tidyverse.org/reference/count.html) function, which combines `group_by()` and `tally()` in one step.

💻Calculate the mean and standard deviation of body mass for each combination of species and sex:

Expand All @@ -300,7 +283,7 @@ penguins |>
sd_body_mass = sd(body_mass_g, na.rm = TRUE))
```

By default, when we apply a grouping with multiple factors, dplyr will keep the last level of grouping after the summary. Here, the output is still grouped by `species`. To remove grouping from a data frame, use the `ungroup()` function or the `.groups = "drop"` argument in the `summarise()` function. Both methods will allow us to continue working with the data as a regular data frame.
By default, when we apply a grouping with multiple factors, dplyr will keep the last level of grouping after the summary. Here, the output is still grouped by `species`. To remove grouping from a data frame, use the `ungroup()` function or the `.groups = "drop"` argument in the `summarise()` function. Both methods will allow us to continue working with the data as a regular data frame.

```{r}
# option 1
Expand All @@ -318,6 +301,8 @@ df <- penguins |>
.groups = "drop")
```

Similar to what we've seen in other functions, we can create multiple summaries in a single step by separating them with commas.

## 8️⃣Sort rows and extract specific values: arrange(), slice(), pull()

Sometimes, we're interested in extracting a particular value from a data frame, like finding the largest or smallest value in a column.
Expand Down Expand Up @@ -392,7 +377,7 @@ penguins_wide <- penguins_long |>
## 📚Resources

| Function | Description |
|-------------------------|-----------------------------------------------|
|-------------------------|--------------------------------------------------|
| `dplyr::glimpse()` | Get a glimpse of your data |
| `dplyr::select()` | Keep or drop columns using their names and types |
| `dplyr::filter()` | Keep rows that match a condition |
Expand Down
58 changes: 18 additions & 40 deletions intro-to-dplyr/intro-to-dplyr_blank.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -4,23 +4,21 @@ author: "Reiko Okamoto"
date: "`r Sys.Date()`"
format: gfm
editor: visual
execute:
echo: true
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## 👋Welcome to the tidyverse

#### ***What is the tidyverse?***
#### *What is the tidyverse?*

The [tidyverse](https://tidyverse.tidyverse.org/) is a collection of R packages designed for data science. Arguably, two of the most popular packages in the tidyverse are [dplyr](https://dplyr.tidyverse.org/) for data manipulation and [ggplot2](https://ggplot2.tidyverse.org/) for data visualization. These are also two of the packages we are covering today!

***Why learn it?***
#### *Why learn it?*

The skills you gain are not just limited to R. The concepts, like filtering data and creating plots, are applicable to other languages like SQL and Python. This makes it easier to pick up other tools and languages later. Additionally, the tidyverse is in tune with open science practices, helping you create analyses that are more accessible, transparent, and reproducible.

***Keep in mind...***
#### *Keep in mind...*

There's no expectation for you to memorize everything. Even experienced programmers don't have every function memorized - they're constantly googling things! My goal today is to help you get comfortable with, and hopefully interested in, using the tidyverse for data analysis.

Expand Down Expand Up @@ -132,9 +130,9 @@ The vertical bar acts as an OR operator, meaning a row is returned if any of the

3. Filter the data to find all penguins that are either on Biscoe Island or Torgersen Island.

```{r}
# YOUR CODE HERE
```
```{r}
# YOUR CODE HERE
```
## 4️⃣Pipes
Expand Down Expand Up @@ -206,9 +204,9 @@ By separating the new columns with a comma, we can create multiple new variables

2. Create a new column called `flipper_size` that categorizes penguins as short, average, or long based on their `flipper_length_mm`. Hint: Define short as less than 190 mm, average as between 190 and 210 mm, and long as greater than 210 mm.

```{r}
# YOUR CODE HERE
```
```{r}
# YOUR CODE HERE
```
## 6️⃣Compute summary statistics: summarise()
Expand All @@ -222,14 +220,6 @@ We often need to summarize our data to understand key characteristics (e.g., mea

The function creates a new data frame with a single row containing the summary statistic.

💻Calculate the minimum and maximum of body mass at the same time:

```{r}
# YOUR CODE HERE
```

Similar to what we've seen in other functions, we can create multiple summaries in a single step by separating them with commas.

## 7️⃣Group by one or more variables: group_by()

In data analysis, a common task is to split our data into groups, apply a function to each group, and then combine the results. This approach is known as the split-apply-combine paradigm. The [`group_by()`](https://dplyr.tidyverse.org/reference/group_by.html) function helps us achieve this by allowing us to specify how we want to split our data into groups.
Expand All @@ -248,24 +238,22 @@ A grouped data frame has all the properties of a regular data frame but has an a
# YOUR CODE HERE
```

💻Achieve this count using the [`count()`](https://dplyr.tidyverse.org/reference/count.html) function, which combines `group_by()` and `tally()` in one step:

```{r}
# YOUR CODE HERE
```
Alternatively, use the [`count()`](https://dplyr.tidyverse.org/reference/count.html) function, which combines `group_by()` and `tally()` in one step.

💻Calculate the mean and standard deviation of body mass for each combination of species and sex:

```{r}
# YOUR CODE HERE
```

By default, when we apply a grouping with multiple factors, dplyr will keep the last level of grouping after the summary. Here, the output is still grouped by `species`. To remove grouping from a data frame, use the `ungroup()` function or the `.groups = "drop"` argument in the `summarise()` function. Both methods will allow us to continue working with the data as a regular data frame.
By default, when we apply a grouping with multiple factors, dplyr will keep the last level of grouping after the summary. Here, the output is still grouped by `species`. To remove grouping from a data frame, use the `ungroup()` function or the `.groups = "drop"` argument in the `summarise()` function. Both methods will allow us to continue working with the data as a regular data frame.

```{r}
# YOUR CODE HERE
```

Similar to what we've seen in other functions, we can create multiple summaries in a single step by separating them with commas.

## 8️⃣Sort rows and extract specific values: arrange(), slice(), pull()

Sometimes, we're interested in extracting a particular value from a data frame, like finding the largest or smallest value in a column.
Expand Down Expand Up @@ -307,31 +295,21 @@ Imagine our observation of interest is the measurement itself, rather than the p
💻Use the [`pivot_longer()`](https://tidyr.tidyverse.org/reference/pivot_longer.html) function to reshape the `penguins` data so that all the measurements related to the penguins are in a single column, and another column indicates what measurement type it is:

```{r}
penguins_long <- penguins |>
mutate(id = row_number()) |>
pivot_longer(
cols = ends_with("_mm") | ends_with("_g"),
names_to = "measurement_type",
values_to = "value"
)
# YOUR CODE HERE
```

What if we want to go back to the original wide format?

💻Use the [`pivot_wider()`](https://tidyr.tidyverse.org/reference/pivot_wider.html) function to reverse the process:

```{r}
penguins_wide <- penguins_long |>
pivot_wider(
names_from = measurement_type,
values_from = value
)
# YOUR CODE HERE
```

## 📚Resources

| Function | Description |
|-------------------------|--------------------------------------------------|
|------------------------|-----------------------------------------------|
| `dplyr::glimpse()` | Get a glimpse of your data |
| `dplyr::select()` | Keep or drop columns using their names and types |
| `dplyr::filter()` | Keep rows that match a condition |
Expand Down
2 changes: 1 addition & 1 deletion intro-to-ggplot2/intro-to-ggplot2.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Introduction to ggplot2
Reiko Okamoto
2024-09-13
2024-09-16

## 🎨Introduction to ggplot2

Expand Down
8 changes: 3 additions & 5 deletions intro-to-ggplot2/intro-to-ggplot2.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,10 @@ author: "Reiko Okamoto"
date: "`r Sys.Date()`"
format: gfm
editor: visual
execute:
echo: true
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## 🎨Introduction to ggplot2

ggplot2 helps us create a wide range of static, informative, and visually appealing graphics. Its name comes from the Grammar of Graphics, which is a framework for building plots in a structured way. We can build a plot incrementally by adding layers like data points, axes, colours, and labels.
Expand Down Expand Up @@ -37,7 +35,7 @@ trains_df
🧠Explore the type and description of each variable:

| Variable | Type | Description |
|-----------------------|------------------|-------------------------------|
|---------------------------|-----------|--------------------------------------|
| `year` | double | Year of observation |
| `month` | double | Month of observation |
| `service` | character | Type of service |
Expand Down
Loading

0 comments on commit 4f2e99b

Please sign in to comment.