Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve pdp project config structure + functionality #56

Merged
merged 3 commits into from
Jan 23, 2025

Conversation

bdewilde
Copy link
Member

@bdewilde bdewilde commented Jan 23, 2025

changes

  • adds a "v2" pdp project configuration schema, which differs from the previous iteration in structure and scope, plus additional validation and documentation
  • adds an example template file for the v2 config
  • adds a unit test to check that the schema is working correctly

Out-of-scope: Using this config in nbs. I tried that in another PR, but it got much too big.

context

This file is meant to record the full set of configuration options for a given school's pipeline in a way that's consolidated, human-readable, and not just hard-coded magic variables spread across multiple notebooks. It's especially useful when parameters must be shared and consistent across nbs; consolidating them here makes that task much easier and less error-prone.

The "v1" config was a WIP, but then a couple schools over in the private repo started using it, and I didn't want to break their shit. So, here's a "v2". In the next minor (not patch) release of this package, I'll finish the transition and make this v2 into the only option. A bit awkward, but only temporary, for the sake of a smooth transition.

questions

  • What do you think of the config's structure and contents?
  • Is there additional validation I should be doing, to ensure things make sense?

@bdewilde bdewilde marked this pull request as ready for review January 23, 2025 02:30
Copy link
Contributor

@vishpillai123 vishpillai123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good! I had a few questions but overall I'm good with merging.

pred_col = "pred"
pred_prob_col = "pred_prob"
pos_label = true
random_state = 12345
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do lines 4-12 need to go under modeling.training? OR are they meant to be used for more than just modeling.training?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup! These get used in both preprocessing and model training.

default=None,
description="One or more column names in dataset to exclude from training.",
)
time_col: t.Optional[str] = pyd.Field(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting using a chronology column. How is this typically used?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

afaik We haven't used this configuration in our models, but we could. It's supported by AutoML (see the reference link), so included for completeness.



class InferenceConfig(pyd.BaseModel):
num_top_features: int = pyd.Field(default=5)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may also want to add support_threshold here as an optional parameter.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is included with the trained model itself, see min_prob_pos_label

class DatasetConfig(pyd.BaseModel):
raw_course: DatasetIOConfig
raw_cohort: DatasetIOConfig
preprocessed: t.Optional[DatasetIOConfig] = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would preprocessed = training dataset?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of the labeled dataset, yes. For unlabeled data, it would be the dataset to be given to the model as input in order to produce predictions. (Does that have a name?)

@vishpillai123
Copy link
Contributor

Oh also I wanted to ask though this is probably outside of the scope - if a school has multiple models (a retention and graduation model for example), would we want multiple configs or do we need to enable that within a config?

I'm assuming multiple configs would be easier.

@bdewilde
Copy link
Member Author

Oh also I wanted to ask though this is probably outside of the scope - if a school has multiple models (a retention and graduation model for example), would we want multiple configs or do we need to enable that within a config?

I'm assuming multiple configs would be easier.

Good question! But yes, out of scope -- it hasn't ever come for a PDP school, right? :)

I'd probably follow the example of datasets, and make a top-level models dict with keys as model names and values as corresponding model configuration. I could do that as a fast follow-up, if you want.

@bdewilde bdewilde merged commit 0900dee into develop Jan 23, 2025
5 checks passed
@bdewilde bdewilde deleted the iterate-pdp-project-config branch January 23, 2025 16:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants