-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[pdp] Refine data assessment template nb #59
base: develop
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi this looks good for the most part, though i was curious why we wanted to remove the f-strings and variable names?
|
||
# COMMAND ---------- | ||
|
||
catalog = "sst_dev" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did we want to remove the catalog
, read_schema
and write_schema
variables? Just curious because I am seeing the CATALOG
, INST_NAME_bronze
, and INST_NAME_silver
strings below.
I like having the variables and f-strings because I think it's less manual and not as much typing within each cell/function, but just wanted to understand the motivation of removing and having the users input directly into the different paths below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like you realized why this was done later in the review -- project configs are replacing much of the hard-coded "magic" variables+logic. Will loop back round if I've misread your later comments.
# MAGIC %md | ||
# MAGIC ### filter invalid rows(?) | ||
|
||
# COMMAND ---------- | ||
|
||
# this is probably a filter you'll want to apply | ||
# these courses known to be an issue w/ PDP data | ||
df_course_valid = df_course.loc[df_course["course_number"].notna(), :] | ||
df_course_valid | ||
df_course_filtered = df_course.loc[df_course["course_number"].notna(), :] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like calling this filtered vs. valid!
@@ -574,6 +589,7 @@ | |||
|
|||
ax = sb.histplot( | |||
df_course.sort_values(by="academic_year"), | |||
# df_course_filtered.sort_values(by="academic_year"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this! I kept going back and forth with looking at df_course
and df_course_valid
, so I like adding this comment for each plot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no prob! always on the lookout for minor quality-of-life improvements
Oh! I'm seeing that this was removed because we have config files now, which is good! So then the |
Yeah, there's a workflow for starting with hard-coded variable names, then migrating values to the project config when you're ready, for safekeeping. I repeat this pattern in the following notebooks as well. |
changes
updates pdp data assessment template nb to leverage new project configs, fixes a few minor bugs, and adds more instruction in the middle and end wrt action items
context
trying to standardize the full pipeline, leverage project configs, and make everything work in both training and prediction runs
a subset of changes in PR #55, broken out for reviewability
questions