-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathexercise3_solutions.qmd
111 lines (80 loc) · 4.53 KB
/
exercise3_solutions.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
# Exercise 3 solutions
```{r setup, include = FALSE}
pacman::p_load(tidyverse, haven)
```
## Question 1
You have been provided with another .sav file which contains the interview responses from the EHS. Create and save a tidy version of this dataset, ensuring variables are classified as the correct type and names follow the style conventions (if you cannot remember these, check [here](https://style.tidyverse.org/syntax.html) for a reminder.
The variables we need in the tidy dataset are:
- The unique identifier `serialanon`
- The gross household income `HYEARGRx`
- The length of residence `lenresb`
- The weekly rent `rentwkx` and mortgage `mortwkx` payments
- Whether the property is freehold or leasehold `freeLeas`
### Solution {.unnumbered}
The first step is to load in the data. However, before this, we need the name of the file. We can look into our documents to get this, but the `list.files` function will do it for us:
```{r list.files to get file name}
list.files(path = "data")
```
From the console, we can copy and paste the file name into the `read_spss` function:
```{r load SPSS file, eval = FALSE}
ehs_interview_tidy <- read_spss("data/interviewfs21_EUL.sav")
```
Next, we can `select` just the variables we need to reduce the data size:
```{r select ehs_interview variables}
ehs_interview_tidy <- read_spss("data/interviewfs21_EUL.sav") %>%
select(serialanon, HYEARGRx, lenresb, rentwkx, mortwkx, freeLeas)
```
We can now explore this data to see which variables need converting, and which are truly numeric:
```{r str interview data}
str(ehs_interview_tidy)
```
As with the general data, all variables are classified as `dbl + lbl` by R. Of these, the length of residence and freehold/leashold variables appear to by categorical. There are also labels attached to the gross annual income (for those over £100,000) which we need to be aware of when analysing this data.
Therefore, our next step will involve converting the categorical variables into factors:
```{r ehs interview factors}
ehs_interview_tidy <- read_spss("data/interviewfs21_EUL.sav") %>%
select(serialanon, HYEARGRx, lenresb, rentwkx, mortwkx, freeLeas) %>%
mutate(length_residence = as_factor(lenresb),
freehold_leasehold = as_factor(freeLeas))
head(ehs_interview_tidy)
```
Finally, we need to rename the existing variables to ensure they are informative and follow the style rules, and remove any unnecessary variables:
```{r finish tidying EHS interview}
ehs_interview_tidy <- read_spss("data/interviewfs21_EUL.sav") %>%
select(serialanon, HYEARGRx, lenresb, rentwkx, mortwkx, freeLeas) %>%
mutate(length_residence = as_factor(lenresb),
freehold_leasehold = as_factor(freeLeas)) %>%
rename(id = serialanon,
gross_income = HYEARGRx,
weekly_rent = rentwkx,
weekly_mortgage = mortwkx) %>%
select(id, gross_income, length_residence, weekly_rent, weekly_mortgage,
freehold_leasehold)
```
## Question 2
Save the tidy interview dataset as a csv file with an appropriate file name.
### Solution {.unnumbered}
```{r save tidy interview data as a csv file}
write_csv(ehs_interview_tidy, file = "saved_data/ehs_interview_tidy.csv")
```
## Question 3
Using the new, tidy dataset, answer the following questions:
- How many respondents paid weekly rent of between £150 and £300?
- How many respondents did not give a response to either the weekly rent or weekly mortgage question?
- What is the highest household gross income of these responders?
### Solution {.unnumbered}
For the first two part, we use the `filter` function to return a subgroup matching the condition, and combine this with the `count` function that counts the number of rows in a tibble:
```{r count rows}
ehs_interview_tidy %>%
filter(between(weekly_rent, 150, 300)) %>%
count()
ehs_interview_tidy %>%
filter(is.na(weekly_rent), is.na(weekly_mortgage)) %>%
count()
```
There were 993 respondents that paid weekly rent of between £150 and £300.
There were 2956 respondents that did not give a response to either the weekly rent or mortgage question.
The final part could usually be carried out with the base R `max` function:
```{r max income}
max(ehs_interview_tidy$gross_income)
```
However, the labels attached to the SPSS file showed that a value of 100000 actually represents a group of responders earning at least £100,000. Therefore, we cannot answer this question from the available data. If we were to analyse this variable, we would need to categorise the rest of the data, losing a lot of information. Failure to do this would produce invalid results.