[pdp] Add inference template nb #64

bdewilde · 2025-01-31T01:33:52Z

changes

adds pdp model predict + explain template nb consistent with preceding steps of the pipeline
adds a utility function to load mlflow models from the registry
adds a hacky plot to model eval nb

context

trying to standardize the full pipeline, leverage project configs, and make everything work in both training and prediction runs

final (and original intended) component of PR #55

questions

please see the various TODOs embedded within the nb

vishpillai123

Hi @bdewilde, this is in great shape overall, but I just left some suggestions/comments.

vishpillai123 · 2025-02-04T15:54:19Z

notebooks/pdp/03-train-evaluate-model-TEMPLATE.py

-# MAGIC %md
-# MAGIC TODO: See about adding permutation importance and/or global SHAP feature importance evaluation here
+# TODO TODO TODO
+result = sklearn.inspection.permutation_importance(


Is this an alternative from SHAP? What's the performance like?

I think we may want to align as a team before including this in the template since SHAP is the standard at the moment. Maybe comment this out for now?

This is an excellent way to get global feature importances, in a way that's performant (i.e. negligible in this context) and more statistically robust than getting importances from the model's coefficients or importances attributes. This isn't a replacement for SHAP, it's a complement.

vishpillai123 · 2025-02-04T16:00:01Z

notebooks/pdp/04-make-explain-predictions-TEMPLATE.py

+# if not, assume this is a prediction workflow
+try:
+    run_type = dbutils.widgets.get("run_type")
+    dataset_name = dbutils.widgets.get("dataset_name")


For the dataset_name, this is something you'll have to assign in the notebook before this: '02-prepare-modeling-dataset-TEMPLATE.py'. So for example, we can do this at the end of the notebook:

dbutils.jobs.taskValues.set(key="dataset_name", value=dataset_name) # noqa: F821

where dataset_name points to a dataframe etc.

We don't need to do this for run_type since this will be pulled down as one of the job parameters.

How does one get run_type in the code, if not via this call?

I'm still a bit unsure about the best way to get parameters from a databricks "job" vs manually specifying. Here, I'm just following the example set in prior nbs.

run_type is defined in the job parameters. dataset_name can be defined in the prior notebook when you create the dataset. Then you can define this as one of the task parameters. You can check out some of the DB workflows that I shared a couple weeks ago to get some examples of both of these.

Setting aside which parameters need to be set and in what form, the big question for me is how are we supposed to get and then leverage these parameters in notebooks. Could you catch me up?

This source here suggests that dbutils.widgets.get() is the correct way to get job parameters in a notebook. Is that not so?

various confusions resolved elsewhere

We'll work out the kinks here in subsequent work.

vishpillai123 · 2025-02-04T16:02:04Z

notebooks/pdp/04-make-explain-predictions-TEMPLATE.py

+try:
+    run_type = dbutils.widgets.get("run_type")
+    dataset_name = dbutils.widgets.get("dataset_name")
+    model_name = dbutils.widgets.get("model_name")


What does the model_name represent here and how are we planning on retrieving this info? Would you want to be assigned via AutoML artifacts?

model_name is analogous to dataset_name in that both are keys in the project config's models and datasets parameters, respectively.

Oh I see what you mean. Okay yeah we can refer to the config instead of dbutils.widgets here for at least the model_name. I think with dataset_name, it makes more sense to me to have that as a task parameter given that this will depend on whether we are training vs. predicting on a dataset (the path will be different).

I would say for the sake of this PR, we can have model_name just pulled from the config. Are you okay with changing that?

This will require a change to the project config, so I'd prefer to punt that to a separate PR / chunk of work. Okay with you?

Yes, that's fair! I'll approve then and we can adjust later.

I have an idea for how this will work 👍

vishpillai123 · 2025-02-04T16:05:18Z

notebooks/pdp/04-make-explain-predictions-TEMPLATE.py

+
+
+# TODO: get this functionality into public repo's modeling.inference?
+def predict_proba(


Perhaps, this can go in modeling.utils... what do you think? Since this notebook is on the longer side.

I went back and forth on it... For now, I'd prefer to leave it here, with the TODO reminding us to clean it up later :)

vishpillai123 · 2025-02-04T16:07:07Z

src/student_success_tool/modeling/utils.py

+        if framework == "xgboost"
+        else mlflow.lightgbm.load_model
+        if framework == "lightgbm"
+        else mlflow.pyfunc.load_model


I like this logic with loading based on model type, though in the final else case... I've personally had some issues with mlflow.pyfunc.load_model. Also when would we fall into this issue? I thought decision tree, logreg would be sklearn?

Just covering the usual bases here. Strangely, it seemed like lightgbm and xgboost models were actually being saved using the "sklearn" framework, possibly through sklearn's built-in wrappers for those external libs, but I didn't delve too deeply. This function should work regardless if you give it the proper framework!

bdewilde and others added 5 commits January 30, 2025 20:24

Add func to load registered mlflow model

644e608

add nicer logging to dataio read funcs

c3793aa

Add permutation importance plot hack

c530540

add (WIP) inference nb template

ac7179c

style: fix linting issues

b8c58ad

bdewilde marked this pull request as ready for review January 31, 2025 01:40

bdewilde requested review from kaylawilding and vishpillai123 as code owners January 31, 2025 01:40

bdewilde requested a review from nm3224 January 31, 2025 01:40

Merge branch 'develop' into pdp-add-inference-template-nb

f100dcc

vishpillai123 requested changes Feb 4, 2025

View reviewed changes

bdewilde requested a review from vishpillai123 February 5, 2025 00:37

vishpillai123 approved these changes Feb 5, 2025

View reviewed changes

bdewilde merged commit 224f845 into develop Feb 5, 2025
5 checks passed

bdewilde deleted the pdp-add-inference-template-nb branch February 5, 2025 16:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pdp] Add inference template nb #64

[pdp] Add inference template nb #64

bdewilde commented Jan 31, 2025 •

edited

Loading

vishpillai123 left a comment

vishpillai123 Feb 4, 2025 •

edited

Loading

bdewilde Feb 4, 2025

vishpillai123 Feb 4, 2025

bdewilde Feb 4, 2025

vishpillai123 Feb 4, 2025 •

edited

Loading

bdewilde Feb 5, 2025

bdewilde Feb 5, 2025

bdewilde Feb 5, 2025

vishpillai123 Feb 4, 2025

bdewilde Feb 4, 2025

vishpillai123 Feb 4, 2025

vishpillai123 Feb 4, 2025

bdewilde Feb 5, 2025

vishpillai123 Feb 5, 2025

bdewilde Feb 5, 2025

vishpillai123 Feb 4, 2025

bdewilde Feb 4, 2025

vishpillai123 Feb 4, 2025

bdewilde Feb 4, 2025



		# TODO: get this functionality into public repo's modeling.inference?
		def predict_proba(

[pdp] Add inference template nb #64

[pdp] Add inference template nb #64

Conversation

bdewilde commented Jan 31, 2025 • edited Loading

changes

context

questions

vishpillai123 left a comment

Choose a reason for hiding this comment

vishpillai123 Feb 4, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vishpillai123 Feb 4, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bdewilde commented Jan 31, 2025 •

edited

Loading

vishpillai123 Feb 4, 2025 •

edited

Loading

vishpillai123 Feb 4, 2025 •

edited

Loading