-
Notifications
You must be signed in to change notification settings - Fork 48
/
Copy path00_getting_started.py
101 lines (83 loc) · 2.59 KB
/
00_getting_started.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
# Databricks notebook source
# MAGIC %md
# MAGIC
# MAGIC # Getting Started
# COMMAND ----------
# MAGIC %md ## Configuration
# COMMAND ----------
# MAGIC %run ./includes/utilities
# COMMAND ----------
# MAGIC %md ## Clear workspace
# COMMAND ----------
dbutils.fs.rm(projectPath, recurse=True)
# COMMAND ----------
# MAGIC %md ## Retrieve and Load the Data
# MAGIC
# MAGIC We will be working with two files:
# MAGIC
# MAGIC - "health_profile_data.snappy.parquet"
# MAGIC - "user_profile_data.snappy.parquet"
# MAGIC
# MAGIC These files can be retrieved and loaded using the utility function `process_file`
# MAGIC
# MAGIC This function takes three arguments:
# MAGIC
# MAGIC - `file_name: str`
# MAGIC - the name of the file to retrieve
# MAGIC - `path: str`
# MAGIC - the location to write the file as a Delta table
# MAGIC - `table_name: str`
# MAGIC - the name of a table to be used in the Metastore to reference the data
# MAGIC
# MAGIC This function does three things:
# MAGIC
# MAGIC 1. Retrieve a file and load it into your Databricks Workspace.
# MAGIC 1. Create a Delta table using the file.
# MAGIC 1. Register the Delta table in the Metastore so that it can be
# MAGIC referenced using SQL or a PySpark `table` reference.
# COMMAND ----------
# MAGIC %md ### Retrieve and Load the Data
# MAGIC
# MAGIC Retrieve the data using the following arguments:
# MAGIC
# MAGIC | `file_name` | `path` | `table_name` |
# MAGIC |:-:|:-:|:-|
# MAGIC | `health_profile_data.snappy.parquet` | `silverDailyPath` | `health_profile_data` |
# MAGIC | `user_profile_data.snappy.parquet` | `dimUserPath` | `user_profile_data` |
# COMMAND ----------
# TODO
# # Use the utility function `process_file` to retrieve the data
# # Use the arguments in the table above.
#
# process_file(
# FILL_IN_FILE_NAME,
# FILL_IN_PATH,
# FILL_IN_TABLE_NAME
# )
#
# process_file(
# FILL_IN_FILE_NAME,
# FILL_IN_PATH,
# FILL_IN_TABLE_NAME
# )
# COMMAND ----------
# ANSWER
process_file(
"health_profile_data.snappy.parquet",
silverDailyPath,
"health_profile_data"
)
process_file(
"user_profile_data.snappy.parquet",
dimUserPath,
"user_profile_data"
)
# COMMAND ----------
# MAGIC %md ## Data Availability
# MAGIC
# MAGIC In a typical workflow, data will have been made available to you
# MAGIC as tables that can be queried using SQL or PySpark. This function
# MAGIC `process_file` has performed the steps necessary to make the files
# MAGIC available to your workspace so that you can focus on data science.
# MAGIC This mirrors a typical workflow, where the data has been made
# MAGIC available to you by a data engineer.