07 - r-udfs-modeling.qmd

---
title: "Modeling"
---

## Catch up

```{r}
library(sparklyr)
library(dplyr)
sc <- spark_connect(method = "databricks_connect")
```

## 01 - Get sample of data
*Download a sampled data set locally to R*

1. Create a pointer to the `lendingclub` data. It is in the `samples` schema

```{r}
lendingclub_dat <- tbl(sc, I("workshops.samples.lendingclub"))
```

2. Using `slice_sample()`, download 20K records, and name it `lendingclub_sample`

```{r}

```

3. Preview the data using the `View()` command
```{r}

```

4. Keep only `int_rate`, `term`, `bc_util`, `bc_open_to_buy` and `all_util` 
fields. Remove the percent sign out of `int_rate`, and coerce it to numeric. 
Save resulting table to a new variable called `lendingclub_prep`

```{r}

```

5. Preview the data using `glimpse()`

```{r}

```

6. Disconnect from Spark

```{r}

```


## 02 - Create model using `tidymodels`

1. Run the following code to create a `workflow` that contains the pre-processing
steps, and a linear regression model

```{r}
library(tidymodels)

lendingclub_rec <- recipe(int_rate ~ ., data = lendingclub_prep) |> 
  step_mutate(term = trimws(substr(term, 1,2))) |> 
  step_mutate(across(everything(), as.numeric)) |> 
  step_normalize(all_numeric_predictors()) |>
  step_impute_mean(all_of(c("bc_open_to_buy", "bc_util"))) |>   
  step_filter(!if_any(everything(), is.na))


lendingclub_lr <- linear_reg()

lendingclub_wf <- workflow() |> 
  add_model(lendingclub_lr) |> 
  add_recipe(lendingclub_rec)

lendingclub_wf
```

2. Fit the model in the workflow, now in a variable called `lendingclub_wf`, with
the `lendingclub_prep` data
```{r}

```

3. Measure the performance of the model using `metrics()`. Make sure to use
`augment()` to add the predictions first
```{r}

```

4. Run a histogram over the predictions

```{r}

```

## 03 - Using Vetiver
*Convert the workflow into a `vetiver` model*

1. Load the `vetiver` package

```{r}

```

2. Convert to Vetiver using `vetiver_model()`. Name the variable `lendingclub_vetiver`

```{r}

```

3. Save `lendingclub_vetiver` as "lendingclub.rds"

```{r}

```

## 04 - Create prediction function
*Creating a self-contained prediction function that will read the model, and then run the predictions*

1. Create a very simple function that takes `x`, and it assumes it will be a 
data frame. Inside the function, it will read the "lendingclub.rds" file, and
then use it to predict against `x`. Name the function `predict_vetiver`

```{r}

```

2. Test the `predict_vetiver` function against `lendingclub_prep`

```{r}

```

3. Modify the function, so that it will add the predictions back to `x`. The
new variable should be named `ret_pred`. The function should output the 
modified `x`

```{r}

```

4. Test the `predict_vetiver` function against `lendingclub_prep`

```{r}
predict_vetiver(lendingclub_prep)
```

5. At the beginning of the function, load the `workflows` and `vetiver` libraries

```{r}

```

6. Test the `predict_vetiver` function against `lendingclub_prep`

```{r}

```

## 05 - Predict in Spark
*Run the predictions in Spark against the entire data set*


1. Add conditional statement that reads the RDS file if it's available locally,
and if not, read it from: "/Volumes/workshops/models/vetiver/lendingclub.rds"

```{r}

```

2. Re-connect to your Spark cluster

```{r}

```

3. Create a pointer to the `lendingclub` data. It is in the `samples` schema

```{r}
lendingclub_dat <- tbl(sc, I("workshops.samples.lendingclub"))
```

4. Select the `int_rate`, `term`, `bc_util`, `bc_open_to_buy`, and `all_util` 
fields from `lendingclub_dat`. And then pass just the top rows (using `head()`)
to `spark_apply()`. Use the updated `predict_vetiver` to run the model.

```{r}

```

5. Add the `columns` specification to the `spark_apply()` call

```{r}

```

6. Append `compute()` to the end of the code, remove `head()`, and save the
results into a variable called `lendingclub_predictions`

```{r}

```

7. Preview the `lendingclub_predictions` table

```{r}

```