[pdp] Refine data assessment template nb #59

bdewilde · 2025-01-25T16:55:42Z

changes

updates pdp data assessment template nb to leverage new project configs, fixes a few minor bugs, and adds more instruction in the middle and end wrt action items

context

trying to standardize the full pipeline, leverage project configs, and make everything work in both training and prediction runs

a subset of changes in PR #55, broken out for reviewability

questions

vishpillai123

Hi this looks good for the most part, though i was curious why we wanted to remove the f-strings and variable names?

vishpillai123 · 2025-01-28T21:35:40Z

notebooks/pdp/01-data-assessment-eda-TEMPLATE.py

-
-# COMMAND ----------
-
-catalog = "sst_dev"


Why did we want to remove the catalog, read_schema and write_schema variables? Just curious because I am seeing the CATALOG, INST_NAME_bronze, and INST_NAME_silver strings below.

I like having the variables and f-strings because I think it's less manual and not as much typing within each cell/function, but just wanted to understand the motivation of removing and having the users input directly into the different paths below.

Looks like you realized why this was done later in the review -- project configs are replacing much of the hard-coded "magic" variables+logic. Will loop back round if I've misread your later comments.

vishpillai123 · 2025-01-28T21:36:48Z

notebooks/pdp/01-data-assessment-eda-TEMPLATE.py

 # MAGIC %md
 # MAGIC ### filter invalid rows(?)

 # COMMAND ----------

 # this is probably a filter you'll want to apply
 # these courses known to be an issue w/ PDP data
-df_course_valid = df_course.loc[df_course["course_number"].notna(), :]
-df_course_valid
+df_course_filtered = df_course.loc[df_course["course_number"].notna(), :]


I like calling this filtered vs. valid!

vishpillai123 · 2025-01-28T21:39:54Z

notebooks/pdp/01-data-assessment-eda-TEMPLATE.py

@@ -574,6 +589,7 @@

 ax = sb.histplot(
    df_course.sort_values(by="academic_year"),
+    # df_course_filtered.sort_values(by="academic_year"),


Thanks for adding this! I kept going back and forth with looking at df_course and df_course_valid, so I like adding this comment for each plot.

no prob! always on the lookout for minor quality-of-life improvements

vishpillai123 · 2025-01-28T21:45:12Z

Oh! I'm seeing that this was removed because we have config files now, which is good! So then the CATALOG, INST_NAME_bronze, and INST_NAME_silver are for the exception cases, which is I'm assuming when the config for a specific school hasn't been defined. Am I understanding correctly?

bdewilde · 2025-01-29T01:26:03Z

Oh! I'm seeing that this was removed because we have config files now, which is good! So then the CATALOG, INST_NAME_bronze, and INST_NAME_silver are for the exception cases, which is I'm assuming when the config for a specific school hasn't been defined. Am I understanding correctly?

Yeah, there's a workflow for starting with hard-coded variable names, then migrating values to the project config when you're ready, for safekeeping. I repeat this pattern in the following notebooks as well.

bdewilde and others added 3 commits January 25, 2025 16:55

refine pdp data assessment template nb

a6192ea

fix: Variable typos and unneeded import

a3f1eab

Merge branch 'develop' into pdp-update-data-assessment-nb-template

b563d21

bdewilde marked this pull request as ready for review January 25, 2025 17:03

bdewilde requested review from kaylawilding and vishpillai123 as code owners January 25, 2025 17:03

bdewilde requested a review from nm3224 January 25, 2025 17:03

fix: Catch more errors on cfg field

471778f

This was referenced Jan 25, 2025

[pdp] Refine prep modeling dataset template nb #60

Open

[pdp] Refine train+eval model template nb #61

Open

vishpillai123 requested changes Jan 28, 2025

View reviewed changes

Merge branch 'develop' into pdp-update-data-assessment-nb-template

f56f109

bdewilde requested a review from vishpillai123 January 29, 2025 01:26

docs: Clarify flows in 01 template nb

9770382

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pdp] Refine data assessment template nb #59

[pdp] Refine data assessment template nb #59

bdewilde commented Jan 25, 2025 •

edited

Loading

vishpillai123 left a comment

vishpillai123 Jan 28, 2025 •

edited

Loading

bdewilde Jan 29, 2025

vishpillai123 Jan 28, 2025

vishpillai123 Jan 28, 2025

bdewilde Jan 29, 2025

vishpillai123 commented Jan 28, 2025

bdewilde commented Jan 29, 2025

[pdp] Refine data assessment template nb #59

Are you sure you want to change the base?

[pdp] Refine data assessment template nb #59

Conversation

bdewilde commented Jan 25, 2025 • edited Loading

changes

context

questions

vishpillai123 left a comment

Choose a reason for hiding this comment

vishpillai123 Jan 28, 2025 • edited Loading

Choose a reason for hiding this comment

bdewilde Jan 29, 2025

Choose a reason for hiding this comment

vishpillai123 Jan 28, 2025

Choose a reason for hiding this comment

vishpillai123 Jan 28, 2025

Choose a reason for hiding this comment

bdewilde Jan 29, 2025

Choose a reason for hiding this comment

vishpillai123 commented Jan 28, 2025

bdewilde commented Jan 29, 2025

bdewilde commented Jan 25, 2025 •

edited

Loading

vishpillai123 Jan 28, 2025 •

edited

Loading