Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pdp] Refine data assessment template nb #59

Open
wants to merge 6 commits into
base: develop
Choose a base branch
from

Conversation

bdewilde
Copy link
Member

@bdewilde bdewilde commented Jan 25, 2025

changes

updates pdp data assessment template nb to leverage new project configs, fixes a few minor bugs, and adds more instruction in the middle and end wrt action items

context

trying to standardize the full pipeline, leverage project configs, and make everything work in both training and prediction runs

a subset of changes in PR #55, broken out for reviewability

questions

@bdewilde bdewilde marked this pull request as ready for review January 25, 2025 17:03
@bdewilde bdewilde requested a review from nm3224 January 25, 2025 17:03
Copy link

@vishpillai123 vishpillai123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi this looks good for the most part, though i was curious why we wanted to remove the f-strings and variable names?


# COMMAND ----------

catalog = "sst_dev"
Copy link

@vishpillai123 vishpillai123 Jan 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did we want to remove the catalog, read_schema and write_schema variables? Just curious because I am seeing the CATALOG, INST_NAME_bronze, and INST_NAME_silver strings below.

I like having the variables and f-strings because I think it's less manual and not as much typing within each cell/function, but just wanted to understand the motivation of removing and having the users input directly into the different paths below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you realized why this was done later in the review -- project configs are replacing much of the hard-coded "magic" variables+logic. Will loop back round if I've misread your later comments.

# MAGIC %md
# MAGIC ### filter invalid rows(?)

# COMMAND ----------

# this is probably a filter you'll want to apply
# these courses known to be an issue w/ PDP data
df_course_valid = df_course.loc[df_course["course_number"].notna(), :]
df_course_valid
df_course_filtered = df_course.loc[df_course["course_number"].notna(), :]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like calling this filtered vs. valid!

@@ -574,6 +589,7 @@

ax = sb.histplot(
df_course.sort_values(by="academic_year"),
# df_course_filtered.sort_values(by="academic_year"),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this! I kept going back and forth with looking at df_course and df_course_valid, so I like adding this comment for each plot.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no prob! always on the lookout for minor quality-of-life improvements

@vishpillai123
Copy link

Oh! I'm seeing that this was removed because we have config files now, which is good! So then the CATALOG, INST_NAME_bronze, and INST_NAME_silver are for the exception cases, which is I'm assuming when the config for a specific school hasn't been defined. Am I understanding correctly?

@bdewilde
Copy link
Member Author

Oh! I'm seeing that this was removed because we have config files now, which is good! So then the CATALOG, INST_NAME_bronze, and INST_NAME_silver are for the exception cases, which is I'm assuming when the config for a specific school hasn't been defined. Am I understanding correctly?

Yeah, there's a workflow for starting with hard-coded variable names, then migrating values to the project config when you're ready, for safekeeping. I repeat this pattern in the following notebooks as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants