homework07.Rmd

---
title: 'Stat 579 - Homework #6'
author: "Spencer Wadsworth"
date: "10/10/2019"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## Star Wars

FiveThirtyEight is a website founded by Statistician and writer Nate Silver to publish results from  opinion poll analysis, politics, economics, and sports blogging. 
One of the featured articles discusses [popularity of movies in the Star Wars Franchise](https://fivethirtyeight.com/features/americas-favorite-star-wars-movies-and-least-favorite-characters/)

This article is based on a survey collected by FiveThirtyEight and publicly available on github. Use the code below to read in the data from the survey:

```{r, message=FALSE, warning=FALSE}
library(dplyr)
library(ggplot2)
library(readr)
starwars <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/star-wars-survey/StarWars.csv")

# the following lines are necessary to fix the multibyte problem and make proper names
# part of the names:
line1 <- names(starwars)
line2 <- unlist(starwars[1,])
varnames <- paste(line1, line2)
# clean up some of the multibyte characters:
names(starwars) <- enc2native(stringi::stri_trans_general(varnames, "latin-ascii"))

starwars <- starwars[-1,]
head(starwars)
```


1. Download the RMarkdown file with these homework instructions to use as a template for your work.
Make sure to replace "Your Name" in the YAML with your name.

3. How many people responded to the survey? How many people have seen at least one of the movies? Use the variable `Have you seen any of the 6 films in the Star Wars franchise? Response` to answer this question. Only consider responses of participants who have seen at least one of the Star Wars films for the remainder of the homework.

```{r}
dim(starwars)[1]
starwars <- starwars %>% 
  filter(`Have you seen any of the 6 films in the Star Wars franchise? Response` == 'Yes')
dim(starwars)[1]
```

### 1,186 people responded to the survey. Of those who responded, 936 have seen at least one Star Wars movie.
4. Variables `Gender Response` and `Age Response` are two of the demographic variables collected. Use `dplyr` to provide a frequency break down for each variable. Does the result surprise you? Comment. Reorder the levels in the variable `Age Response` from youngest to oldest.
```{r}
starwars <- starwars %>% 
  mutate(`Age Response` = factor(`Age Response`, levels(factor(`Age Response`))[c(2:4,1)]))

total = nrow(starwars)
starwars %>% 
  filter(!is.na(`Gender Response`) & !is.na(`Age Response`)) %>% 
  group_by(`Age Response`) %>% 
  mutate(total = n()) %>% 
  group_by(`Age Response`, `Gender Response`) %>% 
  summarize(
   n = n(), perc = n/mean(total)*100 
  )
```

###The last column of the table above shows the percentage of male and female respondents for each age group. It doesn't surprise me that for most age groups, males is slightly higher than females in having responded "Yes" to seeing the movies. The oldest group, however, shows that more women responded "Yes" than men. 

5. Variables 10 through 15 answer the question: "Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film." for each of the films. 
Bring the data set into a long form. Introduce a variable for the star wars episode and the corresponding ranking. 
Find the average rank for each of the films. Are average ranks different between mens' and womens' rankings?
On how many responses are the averages based? Show these numbers together with the averages.
```{r}
colnames(starwars)[10:15] <- c('I','II','III','IV','V','VI')

starwars %>% 
  gather(key = episode, value = rank, 10:15) %>% 
  filter(`Gender Response` %in% c('Female', 'Male')) %>% 
  filter(!is.na(rank)) %>% 
  mutate(rank = as.numeric(rank)) %>% 
  mutate(responses = n()) %>% 
  group_by(episode, `Gender Response`) %>%
  summarize(
    mean = mean(rank), response = n()
  )
  
```
6. R2 D2 or C-3P0? Which of these two characters is the more popular one? Use responses to variables 25 and 26 to answer this question. Note: first you need to define what you mean by  "popularity" based on the available data.  
```{r}

```

7. Popularity contest: which of the surveyed characters is the most popular? use the popularity measure you defined in the previous question to evaluate responses for characters 16 through 29. Use an appropriate long form of the data to get to your answer. Visualize the result. 


Due date: please refer to the website and Canvas for the due date. 

For the submission: submit your solution in an R Markdown file and (just for insurance) submit the corresponding html/word file with it.