From 4f2e99bb688cdf4c0b4e5c807aff6b5998a519be Mon Sep 17 00:00:00 2001 From: Reiko Okamoto Date: Mon, 16 Sep 2024 11:07:58 -0400 Subject: [PATCH] update README, edit YAML metadata --- README.md | 34 +++++++--- intro-to-dplyr/intro-to-dplyr.md | 45 +++---------- intro-to-dplyr/intro-to-dplyr.qmd | 35 +++------- intro-to-dplyr/intro-to-dplyr_blank.qmd | 58 +++++----------- intro-to-ggplot2/intro-to-ggplot2.md | 2 +- intro-to-ggplot2/intro-to-ggplot2.qmd | 8 +-- intro-to-ggplot2/intro-to-ggplot2_blank.qmd | 10 +-- more-ggplot2/more-ggplot2.md | 2 +- more-ggplot2/more-ggplot2.qmd | 2 + more-ggplot2/more-ggplot2_blank.qmd | 73 +++++++++++++++++++++ 10 files changed, 146 insertions(+), 123 deletions(-) create mode 100644 more-ggplot2/more-ggplot2_blank.qmd diff --git a/README.md b/README.md index 71148aa..3750997 100644 --- a/README.md +++ b/README.md @@ -2,20 +2,36 @@ Material for workshop at Département de science politique de l'Université de Montréal -## Directory Structure +## Getting Started with the Repository + +- Clone or download the repository + + - Clone the repository to your local machine using Git + + - Alternatively, you can click the green "\<\> Code" button and select "Download ZIP" to download the repository as a ZIP file. After downloading, extract the contents. + +- After cloning or downloading, navigate to the project folder on your computer. -- intro-to-dplyr/: Contains files for the first half of the workshop, focusing on data manipulation using dplyr +- Double click on `2024-09-18_intro-to-tidyverse.Rproj` in the project root to open the RStudio Project. -- intro-to-ggplot2/: Contains files for the second half of the workshop, covering basic data visualization with ggplot2 +- Before running the code, make sure you have the following R packages installed: + + - tidyverse + + - here + + - RColorBrewer + +## Directory Structure -- more-ggplot2/: Includes additional code that we probably won't have time to cover during the workshop but may be helpful for further learning +- `intro-to-dplyr/`: Contains files for the first half of the workshop, focusing on data manipulation using dplyr -## Prerequisites +- `intro-to-ggplot2/`: Contains files for the second half of the workshop, covering basic data visualization with ggplot2 -Before running the code, make sure you have the following R packages installed: +- `more-ggplot2/`: Includes additional code that we probably won't have time to cover during the workshop but is helpful for further learning -- tidyverse +## During the workshop -- here +- In each of the above folders, you'll find a `.qmd` file with `_blank` in its name. If you'd like to **code along**, you can use these files, which provide skeletons for you to fill in as we work through the material. -- RColorBrewer +- If you prefer to **follow along without coding**, open the other `.qmd` file that already contains all the code. diff --git a/intro-to-dplyr/intro-to-dplyr.md b/intro-to-dplyr/intro-to-dplyr.md index 3fd95af..86dac7d 100644 --- a/intro-to-dplyr/intro-to-dplyr.md +++ b/intro-to-dplyr/intro-to-dplyr.md @@ -1,10 +1,10 @@ # Introduction to dplyr Reiko Okamoto -2024-09-13 +2024-09-16 ## 👋Welcome to the tidyverse -#### ***What is the tidyverse?*** +#### *What is the tidyverse?* The [tidyverse](https://tidyverse.tidyverse.org/) is a collection of R packages designed for data science. Arguably, two of the most popular @@ -12,7 +12,7 @@ packages in the tidyverse are [dplyr](https://dplyr.tidyverse.org/) for data manipulation and [ggplot2](https://ggplot2.tidyverse.org/) for data visualization. These are also two of the packages we are covering today! -***Why learn it?*** +#### *Why learn it?* The skills you gain are not just limited to R. The concepts, like filtering data and creating plots, are applicable to other languages @@ -21,7 +21,7 @@ languages later. Additionally, the tidyverse is in tune with open science practices, helping you create analyses that are more accessible, transparent, and reproducible. -***Keep in mind…*** +#### *Keep in mind…* There’s no expectation for you to memorize everything. Even experienced programmers don’t have every function memorized - they’re constantly @@ -695,22 +695,6 @@ penguins |> The function creates a new data frame with a single row containing the summary statistic. -💻Calculate the minimum and maximum of body mass at the same time: - -``` r -penguins |> - summarise(min_body_mass = min(body_mass_g, na.rm = TRUE), - max_body_mass = max(body_mass_g, na.rm = TRUE)) -``` - - # A tibble: 1 × 2 - min_body_mass max_body_mass - - 1 2700 6300 - -Similar to what we’ve seen in other functions, we can create multiple -summaries in a single step by separating them with commas. - ## 7️⃣Group by one or more variables: group_by() In data analysis, a common task is to split our data into groups, apply @@ -756,21 +740,9 @@ penguins |> 2 Dream 124 3 Torgersen 52 -💻Achieve this count using the +Alternatively, use the [`count()`](https://dplyr.tidyverse.org/reference/count.html) function, -which combines `group_by()` and `tally()` in one step: - -``` r -penguins |> - count(island) -``` - - # A tibble: 3 × 2 - island n - - 1 Biscoe 168 - 2 Dream 124 - 3 Torgersen 52 +which combines `group_by()` and `tally()` in one step. 💻Calculate the mean and standard deviation of body mass for each combination of species and sex: @@ -801,7 +773,7 @@ penguins |> By default, when we apply a grouping with multiple factors, dplyr will keep the last level of grouping after the summary. Here, the output is still grouped by `species`. To remove grouping from a data frame, use -the `ungroup()` function or the `.groups = "drop"` argument in the +the `ungroup()` function or the `.groups = "drop"` argument in the `summarise()` function. Both methods will allow us to continue working with the data as a regular data frame. @@ -838,6 +810,9 @@ df <- penguins |> .groups = "drop") ``` +Similar to what we’ve seen in other functions, we can create multiple +summaries in a single step by separating them with commas. + ## 8️⃣Sort rows and extract specific values: arrange(), slice(), pull() Sometimes, we’re interested in extracting a particular value from a data diff --git a/intro-to-dplyr/intro-to-dplyr.qmd b/intro-to-dplyr/intro-to-dplyr.qmd index 07099bf..0005bd2 100644 --- a/intro-to-dplyr/intro-to-dplyr.qmd +++ b/intro-to-dplyr/intro-to-dplyr.qmd @@ -4,23 +4,21 @@ author: "Reiko Okamoto" date: "`r Sys.Date()`" format: gfm editor: visual +execute: + echo: true --- -```{r setup, include=FALSE} -knitr::opts_chunk$set(echo = TRUE) -``` - ## 👋Welcome to the tidyverse -#### ***What is the tidyverse?*** +#### *What is the tidyverse?* The [tidyverse](https://tidyverse.tidyverse.org/) is a collection of R packages designed for data science. Arguably, two of the most popular packages in the tidyverse are [dplyr](https://dplyr.tidyverse.org/) for data manipulation and [ggplot2](https://ggplot2.tidyverse.org/) for data visualization. These are also two of the packages we are covering today! -***Why learn it?*** +#### *Why learn it?* The skills you gain are not just limited to R. The concepts, like filtering data and creating plots, are applicable to other languages like SQL and Python. This makes it easier to pick up other tools and languages later. Additionally, the tidyverse is in tune with open science practices, helping you create analyses that are more accessible, transparent, and reproducible. -***Keep in mind...*** +#### *Keep in mind...* There's no expectation for you to memorize everything. Even experienced programmers don't have every function memorized - they're constantly googling things! My goal today is to help you get comfortable with, and hopefully interested in, using the tidyverse for data analysis. @@ -252,16 +250,6 @@ penguins |> The function creates a new data frame with a single row containing the summary statistic. -💻Calculate the minimum and maximum of body mass at the same time: - -```{r} -penguins |> - summarise(min_body_mass = min(body_mass_g, na.rm = TRUE), - max_body_mass = max(body_mass_g, na.rm = TRUE)) -``` - -Similar to what we've seen in other functions, we can create multiple summaries in a single step by separating them with commas. - ## 7️⃣Group by one or more variables: group_by() In data analysis, a common task is to split our data into groups, apply a function to each group, and then combine the results. This approach is known as the split-apply-combine paradigm. The [`group_by()`](https://dplyr.tidyverse.org/reference/group_by.html) function helps us achieve this by allowing us to specify how we want to split our data into groups. @@ -284,12 +272,7 @@ penguins |> tally() ``` -💻Achieve this count using the [`count()`](https://dplyr.tidyverse.org/reference/count.html) function, which combines `group_by()` and `tally()` in one step: - -```{r} -penguins |> - count(island) -``` +Alternatively, use the [`count()`](https://dplyr.tidyverse.org/reference/count.html) function, which combines `group_by()` and `tally()` in one step. 💻Calculate the mean and standard deviation of body mass for each combination of species and sex: @@ -300,7 +283,7 @@ penguins |> sd_body_mass = sd(body_mass_g, na.rm = TRUE)) ``` -By default, when we apply a grouping with multiple factors, dplyr will keep the last level of grouping after the summary. Here, the output is still grouped by `species`. To remove grouping from a data frame, use the `ungroup()` function or the `.groups = "drop"` argument in the `summarise()` function. Both methods will allow us to continue working with the data as a regular data frame. +By default, when we apply a grouping with multiple factors, dplyr will keep the last level of grouping after the summary. Here, the output is still grouped by `species`. To remove grouping from a data frame, use the `ungroup()` function or the `.groups = "drop"` argument in the `summarise()` function. Both methods will allow us to continue working with the data as a regular data frame. ```{r} # option 1 @@ -318,6 +301,8 @@ df <- penguins |> .groups = "drop") ``` +Similar to what we've seen in other functions, we can create multiple summaries in a single step by separating them with commas. + ## 8️⃣Sort rows and extract specific values: arrange(), slice(), pull() Sometimes, we're interested in extracting a particular value from a data frame, like finding the largest or smallest value in a column. @@ -392,7 +377,7 @@ penguins_wide <- penguins_long |> ## 📚Resources | Function | Description | -|-------------------------|-----------------------------------------------| +|-------------------------|--------------------------------------------------| | `dplyr::glimpse()` | Get a glimpse of your data | | `dplyr::select()` | Keep or drop columns using their names and types | | `dplyr::filter()` | Keep rows that match a condition | diff --git a/intro-to-dplyr/intro-to-dplyr_blank.qmd b/intro-to-dplyr/intro-to-dplyr_blank.qmd index 0ba692d..fa8d1b4 100644 --- a/intro-to-dplyr/intro-to-dplyr_blank.qmd +++ b/intro-to-dplyr/intro-to-dplyr_blank.qmd @@ -4,23 +4,21 @@ author: "Reiko Okamoto" date: "`r Sys.Date()`" format: gfm editor: visual +execute: + echo: true --- -```{r setup, include=FALSE} -knitr::opts_chunk$set(echo = TRUE) -``` - ## 👋Welcome to the tidyverse -#### ***What is the tidyverse?*** +#### *What is the tidyverse?* The [tidyverse](https://tidyverse.tidyverse.org/) is a collection of R packages designed for data science. Arguably, two of the most popular packages in the tidyverse are [dplyr](https://dplyr.tidyverse.org/) for data manipulation and [ggplot2](https://ggplot2.tidyverse.org/) for data visualization. These are also two of the packages we are covering today! -***Why learn it?*** +#### *Why learn it?* The skills you gain are not just limited to R. The concepts, like filtering data and creating plots, are applicable to other languages like SQL and Python. This makes it easier to pick up other tools and languages later. Additionally, the tidyverse is in tune with open science practices, helping you create analyses that are more accessible, transparent, and reproducible. -***Keep in mind...*** +#### *Keep in mind...* There's no expectation for you to memorize everything. Even experienced programmers don't have every function memorized - they're constantly googling things! My goal today is to help you get comfortable with, and hopefully interested in, using the tidyverse for data analysis. @@ -132,9 +130,9 @@ The vertical bar acts as an OR operator, meaning a row is returned if any of the 3. Filter the data to find all penguins that are either on Biscoe Island or Torgersen Island. -```{r} -# YOUR CODE HERE -``` + ```{r} + # YOUR CODE HERE + ``` ## 4️⃣Pipes @@ -206,9 +204,9 @@ By separating the new columns with a comma, we can create multiple new variables 2. Create a new column called `flipper_size` that categorizes penguins as short, average, or long based on their `flipper_length_mm`. Hint: Define short as less than 190 mm, average as between 190 and 210 mm, and long as greater than 210 mm. -```{r} -# YOUR CODE HERE -``` + ```{r} + # YOUR CODE HERE + ``` ## 6️⃣Compute summary statistics: summarise() @@ -222,14 +220,6 @@ We often need to summarize our data to understand key characteristics (e.g., mea The function creates a new data frame with a single row containing the summary statistic. -💻Calculate the minimum and maximum of body mass at the same time: - -```{r} -# YOUR CODE HERE -``` - -Similar to what we've seen in other functions, we can create multiple summaries in a single step by separating them with commas. - ## 7️⃣Group by one or more variables: group_by() In data analysis, a common task is to split our data into groups, apply a function to each group, and then combine the results. This approach is known as the split-apply-combine paradigm. The [`group_by()`](https://dplyr.tidyverse.org/reference/group_by.html) function helps us achieve this by allowing us to specify how we want to split our data into groups. @@ -248,11 +238,7 @@ A grouped data frame has all the properties of a regular data frame but has an a # YOUR CODE HERE ``` -💻Achieve this count using the [`count()`](https://dplyr.tidyverse.org/reference/count.html) function, which combines `group_by()` and `tally()` in one step: - -```{r} -# YOUR CODE HERE -``` +Alternatively, use the [`count()`](https://dplyr.tidyverse.org/reference/count.html) function, which combines `group_by()` and `tally()` in one step. 💻Calculate the mean and standard deviation of body mass for each combination of species and sex: @@ -260,12 +246,14 @@ A grouped data frame has all the properties of a regular data frame but has an a # YOUR CODE HERE ``` -By default, when we apply a grouping with multiple factors, dplyr will keep the last level of grouping after the summary. Here, the output is still grouped by `species`. To remove grouping from a data frame, use the `ungroup()` function or the `.groups = "drop"` argument in the `summarise()` function. Both methods will allow us to continue working with the data as a regular data frame. +By default, when we apply a grouping with multiple factors, dplyr will keep the last level of grouping after the summary. Here, the output is still grouped by `species`. To remove grouping from a data frame, use the `ungroup()` function or the `.groups = "drop"` argument in the `summarise()` function. Both methods will allow us to continue working with the data as a regular data frame. ```{r} # YOUR CODE HERE ``` +Similar to what we've seen in other functions, we can create multiple summaries in a single step by separating them with commas. + ## 8️⃣Sort rows and extract specific values: arrange(), slice(), pull() Sometimes, we're interested in extracting a particular value from a data frame, like finding the largest or smallest value in a column. @@ -307,13 +295,7 @@ Imagine our observation of interest is the measurement itself, rather than the p 💻Use the [`pivot_longer()`](https://tidyr.tidyverse.org/reference/pivot_longer.html) function to reshape the `penguins` data so that all the measurements related to the penguins are in a single column, and another column indicates what measurement type it is: ```{r} -penguins_long <- penguins |> - mutate(id = row_number()) |> - pivot_longer( - cols = ends_with("_mm") | ends_with("_g"), - names_to = "measurement_type", - values_to = "value" - ) +# YOUR CODE HERE ``` What if we want to go back to the original wide format? @@ -321,17 +303,13 @@ What if we want to go back to the original wide format? 💻Use the [`pivot_wider()`](https://tidyr.tidyverse.org/reference/pivot_wider.html) function to reverse the process: ```{r} -penguins_wide <- penguins_long |> - pivot_wider( - names_from = measurement_type, - values_from = value -) +# YOUR CODE HERE ``` ## 📚Resources | Function | Description | -|-------------------------|--------------------------------------------------| +|------------------------|-----------------------------------------------| | `dplyr::glimpse()` | Get a glimpse of your data | | `dplyr::select()` | Keep or drop columns using their names and types | | `dplyr::filter()` | Keep rows that match a condition | diff --git a/intro-to-ggplot2/intro-to-ggplot2.md b/intro-to-ggplot2/intro-to-ggplot2.md index 1ac6c6d..fb9b698 100644 --- a/intro-to-ggplot2/intro-to-ggplot2.md +++ b/intro-to-ggplot2/intro-to-ggplot2.md @@ -1,6 +1,6 @@ # Introduction to ggplot2 Reiko Okamoto -2024-09-13 +2024-09-16 ## 🎨Introduction to ggplot2 diff --git a/intro-to-ggplot2/intro-to-ggplot2.qmd b/intro-to-ggplot2/intro-to-ggplot2.qmd index 24229ff..bef3be3 100644 --- a/intro-to-ggplot2/intro-to-ggplot2.qmd +++ b/intro-to-ggplot2/intro-to-ggplot2.qmd @@ -4,12 +4,10 @@ author: "Reiko Okamoto" date: "`r Sys.Date()`" format: gfm editor: visual +execute: + echo: true --- -```{r setup, include=FALSE} -knitr::opts_chunk$set(echo = TRUE) -``` - ## 🎨Introduction to ggplot2 ggplot2 helps us create a wide range of static, informative, and visually appealing graphics. Its name comes from the Grammar of Graphics, which is a framework for building plots in a structured way. We can build a plot incrementally by adding layers like data points, axes, colours, and labels. @@ -37,7 +35,7 @@ trains_df 🧠Explore the type and description of each variable: | Variable | Type | Description | -|-----------------------|------------------|-------------------------------| +|---------------------------|-----------|--------------------------------------| | `year` | double | Year of observation | | `month` | double | Month of observation | | `service` | character | Type of service | diff --git a/intro-to-ggplot2/intro-to-ggplot2_blank.qmd b/intro-to-ggplot2/intro-to-ggplot2_blank.qmd index f13439d..1db9145 100644 --- a/intro-to-ggplot2/intro-to-ggplot2_blank.qmd +++ b/intro-to-ggplot2/intro-to-ggplot2_blank.qmd @@ -4,12 +4,10 @@ author: "Reiko Okamoto" date: "`r Sys.Date()`" format: gfm editor: visual +execute: + echo: true --- -```{r setup, include=FALSE} -knitr::opts_chunk$set(echo = TRUE) -``` - ## 🎨Introduction to ggplot2 ggplot2 helps us create a wide range of static, informative, and visually appealing graphics. Its name comes from the Grammar of Graphics, which is a framework for building plots in a structured way. We can build a plot incrementally by adding layers like data points, axes, colours, and labels. @@ -35,7 +33,7 @@ We'll use a data set sourced from a [GitHub repository](https://github.com/rford 🧠Explore the type and description of each variable: | Variable | Type | Description | -|-----------------------|------------------|-------------------------------| +|-----------------------|----------------|---------------------------------| | `year` | double | Year of observation | | `month` | double | Month of observation | | `service` | character | Type of service | @@ -193,8 +191,6 @@ Instead of working with separate `year` and `month` columns, we create a single 💻Create a line plot to show how the monthly number of trips from "PARIS MONTPARNASSE" to multiple cities in Brittany (i.e., "RENNES", "BREST", "QUIMPER") fluctuates throughout the year: ```{r} -cities <- c("RENNES", "BREST", "QUIMPER") - # YOUR CODE HERE ``` diff --git a/more-ggplot2/more-ggplot2.md b/more-ggplot2/more-ggplot2.md index a8adcff..e8d532d 100644 --- a/more-ggplot2/more-ggplot2.md +++ b/more-ggplot2/more-ggplot2.md @@ -1,6 +1,6 @@ # More ggplot2 Reiko Okamoto -2024-08-22 +2024-09-16 I can only fit so much into a three-hour workshop! In this document, we’ll use the same data set of French train delays to explore additional diff --git a/more-ggplot2/more-ggplot2.qmd b/more-ggplot2/more-ggplot2.qmd index 9027c14..95ec4e7 100644 --- a/more-ggplot2/more-ggplot2.qmd +++ b/more-ggplot2/more-ggplot2.qmd @@ -4,6 +4,8 @@ author: Reiko Okamoto date: "`r Sys.Date()`" format: gfm editor: visual +execute: + echo: true --- I can only fit so much into a three-hour workshop! In this document, we'll use the same data set of French train delays to explore additional features of ggplot2. diff --git a/more-ggplot2/more-ggplot2_blank.qmd b/more-ggplot2/more-ggplot2_blank.qmd new file mode 100644 index 0000000..c67aef5 --- /dev/null +++ b/more-ggplot2/more-ggplot2_blank.qmd @@ -0,0 +1,73 @@ +--- +title: "More ggplot2" +author: Reiko Okamoto +date: "`r Sys.Date()`" +format: gfm +editor: visual +execute: + echo: true +--- + +I can only fit so much into a three-hour workshop! In this document, we'll use the same data set of French train delays to explore additional features of ggplot2. + +💻Load the necessary packages: + +```{r} +# YOUR CODE HERE +``` + +💻Read in the data: + +```{r} +# YOUR CODE HERE +``` + +## 📈Box plots + +Box plots are great for visualizing the distribution of a continuous variable across different categories. + +💻Create a box plot to show the distribution of average arrival delay across different journeys from "BORDEAUX ST JEAN": + +```{r} +# YOUR CODE HERE +``` + +## 📈Violin plots + +Violin plots are similar to box plots. However, violin plots reveal the full data distribution, unlike box plots, which only highly summary statistics like the median and interquartile range. This is especially helpful when the data has multiple peaks (i.e., multimodal distribution). + +💻Create a violin plot to show the distribution of average arrival delay across different journeys from "BORDEAUX ST JEAN": + +```{r} +# YOUR CODE HERE +``` + +Note that we only had to change one line of code! + +## 📈Facet plots + +Facets allow us to break down a plot into smaller, related subplots. + +💻Create a line plot to show how the monthly number of trips from "PARIS MONTPARNASSE" to "RENNES" fluctuates over time. Use facets to organize the subplots by year, with a separate panel for each year: + +```{r} +# YOUR CODE HERE +``` + +## 📈Modify components of a theme + +Theme elements are non-data components of a plot, which include things like the background colour, text size, font, and grid lines. These changes don't alter the underlying data; rather, they adjust the appearance of the plot. + +💻Create a base plot: + +```{r} +# YOUR CODE HERE +``` + +💻Modify the look and feel of your plot using the [`theme()`](https://ggplot2.tidyverse.org/reference/theme.html) function: + +```{r} +# YOUR CODE HERE +``` + +Note: In no way do I think this aesthetic makes the plot more visually appealing or easier to read. However, I hope it demonstrates just how customizable ggplot2 is 🤠 For more information on what you can modify, check out this link: