Improve pdp project config structure + functionality #56

bdewilde · 2025-01-23T02:26:18Z

changes

adds a "v2" pdp project configuration schema, which differs from the previous iteration in structure and scope, plus additional validation and documentation
adds an example template file for the v2 config
adds a unit test to check that the schema is working correctly

Out-of-scope: Using this config in nbs. I tried that in another PR, but it got much too big.

context

This file is meant to record the full set of configuration options for a given school's pipeline in a way that's consolidated, human-readable, and not just hard-coded magic variables spread across multiple notebooks. It's especially useful when parameters must be shared and consistent across nbs; consolidating them here makes that task much easier and less error-prone.

The "v1" config was a WIP, but then a couple schools over in the private repo started using it, and I didn't want to break their shit. So, here's a "v2". In the next minor (not patch) release of this package, I'll finish the transition and make this v2 into the only option. A bit awkward, but only temporary, for the sake of a smooth transition.

questions

What do you think of the config's structure and contents?
Is there additional validation I should be doing, to ensure things make sense?

vishpillai123

This looks good! I had a few questions but overall I'm good with merging.

vishpillai123 · 2025-01-23T15:06:29Z

notebooks/pdp/config-v2-TEMPLATE.toml

+pred_col = "pred"
+pred_prob_col = "pred_prob"
+pos_label = true
+random_state = 12345


Do lines 4-12 need to go under modeling.training? OR are they meant to be used for more than just modeling.training?

Yup! These get used in both preprocessing and model training.

vishpillai123 · 2025-01-23T15:07:44Z

src/student_success_tool/configs/schemas/pdp_v2.py

+        default=None,
+        description="One or more column names in dataset to exclude from training.",
+    )
+    time_col: t.Optional[str] = pyd.Field(


This is interesting using a chronology column. How is this typically used?

afaik We haven't used this configuration in our models, but we could. It's supported by AutoML (see the reference link), so included for completeness.

vishpillai123 · 2025-01-23T15:08:18Z

src/student_success_tool/configs/schemas/pdp_v2.py

+
+
+class InferenceConfig(pyd.BaseModel):
+    num_top_features: int = pyd.Field(default=5)


We may also want to add support_threshold here as an optional parameter.

This is included with the trained model itself, see min_prob_pos_label

vishpillai123 · 2025-01-23T15:08:52Z

src/student_success_tool/configs/schemas/pdp_v2.py

+class DatasetConfig(pyd.BaseModel):
+    raw_course: DatasetIOConfig
+    raw_cohort: DatasetIOConfig
+    preprocessed: t.Optional[DatasetIOConfig] = None


would preprocessed = training dataset?

In the case of the labeled dataset, yes. For unlabeled data, it would be the dataset to be given to the model as input in order to produce predictions. (Does that have a name?)

vishpillai123 · 2025-01-23T15:11:18Z

Oh also I wanted to ask though this is probably outside of the scope - if a school has multiple models (a retention and graduation model for example), would we want multiple configs or do we need to enable that within a config?

I'm assuming multiple configs would be easier.

bdewilde · 2025-01-23T15:26:33Z

Oh also I wanted to ask though this is probably outside of the scope - if a school has multiple models (a retention and graduation model for example), would we want multiple configs or do we need to enable that within a config?

I'm assuming multiple configs would be easier.

Good question! But yes, out of scope -- it hasn't ever come for a PDP school, right? :)

I'd probably follow the example of datasets, and make a top-level models dict with keys as model names and values as corresponding model configuration. I could do that as a fast follow-up, if you want.

bdewilde added 3 commits January 22, 2025 20:50

Add wip v2 of pdp project config schema

90d304b

Add template v2 pdp proj confg

a4d8b9f

tests: Add unit tests for v2 cfg validation

31d5687

bdewilde marked this pull request as ready for review January 23, 2025 02:30

bdewilde requested review from kaylawilding and vishpillai123 as code owners January 23, 2025 02:30

vishpillai123 approved these changes Jan 23, 2025

View reviewed changes

bdewilde merged commit 0900dee into develop Jan 23, 2025
5 checks passed

bdewilde deleted the iterate-pdp-project-config branch January 23, 2025 16:38

bdewilde mentioned this pull request Jan 24, 2025

[pdp] Tweak project config + template + test case #57

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve pdp project config structure + functionality #56

Improve pdp project config structure + functionality #56

bdewilde commented Jan 23, 2025 •

edited

Loading

vishpillai123 left a comment

vishpillai123 Jan 23, 2025

bdewilde Jan 23, 2025

vishpillai123 Jan 23, 2025

bdewilde Jan 23, 2025

vishpillai123 Jan 23, 2025

bdewilde Jan 23, 2025

vishpillai123 Jan 23, 2025

bdewilde Jan 23, 2025

vishpillai123 commented Jan 23, 2025

bdewilde commented Jan 23, 2025



		class InferenceConfig(pyd.BaseModel):
		num_top_features: int = pyd.Field(default=5)

Improve pdp project config structure + functionality #56

Improve pdp project config structure + functionality #56

Conversation

bdewilde commented Jan 23, 2025 • edited Loading

changes

context

questions

vishpillai123 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vishpillai123 commented Jan 23, 2025

bdewilde commented Jan 23, 2025

bdewilde commented Jan 23, 2025 •

edited

Loading