Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pdp] Add inference template nb #64

Merged
merged 6 commits into from
Feb 5, 2025
Merged

Conversation

bdewilde
Copy link
Member

@bdewilde bdewilde commented Jan 31, 2025

changes

  • adds pdp model predict + explain template nb consistent with preceding steps of the pipeline
  • adds a utility function to load mlflow models from the registry
  • adds a hacky plot to model eval nb

context

trying to standardize the full pipeline, leverage project configs, and make everything work in both training and prediction runs

final (and original intended) component of PR #55

questions

please see the various TODOs embedded within the nb

@bdewilde bdewilde marked this pull request as ready for review January 31, 2025 01:40
@bdewilde bdewilde requested a review from nm3224 January 31, 2025 01:40
Copy link
Contributor

@vishpillai123 vishpillai123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @bdewilde, this is in great shape overall, but I just left some suggestions/comments.

# MAGIC %md
# MAGIC TODO: See about adding permutation importance and/or global SHAP feature importance evaluation here
# TODO TODO TODO
result = sklearn.inspection.permutation_importance(
Copy link
Contributor

@vishpillai123 vishpillai123 Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an alternative from SHAP? What's the performance like?

I think we may want to align as a team before including this in the template since SHAP is the standard at the moment. Maybe comment this out for now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an excellent way to get global feature importances, in a way that's performant (i.e. negligible in this context) and more statistically robust than getting importances from the model's coefficients or importances attributes. This isn't a replacement for SHAP, it's a complement.

# if not, assume this is a prediction workflow
try:
run_type = dbutils.widgets.get("run_type")
dataset_name = dbutils.widgets.get("dataset_name")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the dataset_name, this is something you'll have to assign in the notebook before this: '02-prepare-modeling-dataset-TEMPLATE.py'. So for example, we can do this at the end of the notebook:

dbutils.jobs.taskValues.set(key="dataset_name", value=dataset_name) # noqa: F821

where dataset_name points to a dataframe etc.

We don't need to do this for run_type since this will be pulled down as one of the job parameters.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does one get run_type in the code, if not via this call?

I'm still a bit unsure about the best way to get parameters from a databricks "job" vs manually specifying. Here, I'm just following the example set in prior nbs.

Copy link
Contributor

@vishpillai123 vishpillai123 Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

run_type is defined in the job parameters. dataset_name can be defined in the prior notebook when you create the dataset. Then you can define this as one of the task parameters. You can check out some of the DB workflows that I shared a couple weeks ago to get some examples of both of these.

Screenshot 2025-02-04 at 12 09 14 PM

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting aside which parameters need to be set and in what form, the big question for me is how are we supposed to get and then leverage these parameters in notebooks. Could you catch me up?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This source here suggests that dbutils.widgets.get() is the correct way to get job parameters in a notebook. Is that not so?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

various confusions resolved elsewhere

We'll work out the kinks here in subsequent work.

try:
run_type = dbutils.widgets.get("run_type")
dataset_name = dbutils.widgets.get("dataset_name")
model_name = dbutils.widgets.get("model_name")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does the model_name represent here and how are we planning on retrieving this info? Would you want to be assigned via AutoML artifacts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

model_name is analogous to dataset_name in that both are keys in the project config's models and datasets parameters, respectively.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see what you mean. Okay yeah we can refer to the config instead of dbutils.widgets here for at least the model_name. I think with dataset_name, it makes more sense to me to have that as a task parameter given that this will depend on whether we are training vs. predicting on a dataset (the path will be different).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say for the sake of this PR, we can have model_name just pulled from the config. Are you okay with changing that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will require a change to the project config, so I'd prefer to punt that to a separate PR / chunk of work. Okay with you?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's fair! I'll approve then and we can adjust later.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have an idea for how this will work 👍



# TODO: get this functionality into public repo's modeling.inference?
def predict_proba(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps, this can go in modeling.utils... what do you think? Since this notebook is on the longer side.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went back and forth on it... For now, I'd prefer to leave it here, with the TODO reminding us to clean it up later :)

if framework == "xgboost"
else mlflow.lightgbm.load_model
if framework == "lightgbm"
else mlflow.pyfunc.load_model
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this logic with loading based on model type, though in the final else case... I've personally had some issues with mlflow.pyfunc.load_model. Also when would we fall into this issue? I thought decision tree, logreg would be sklearn?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just covering the usual bases here. Strangely, it seemed like lightgbm and xgboost models were actually being saved using the "sklearn" framework, possibly through sklearn's built-in wrappers for those external libs, but I didn't delve too deeply. This function should work regardless if you give it the proper framework!

@bdewilde bdewilde merged commit 224f845 into develop Feb 5, 2025
5 checks passed
@bdewilde bdewilde deleted the pdp-add-inference-template-nb branch February 5, 2025 16:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants