-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[pdp] Add inference template nb #64
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @bdewilde, this is in great shape overall, but I just left some suggestions/comments.
# MAGIC %md | ||
# MAGIC TODO: See about adding permutation importance and/or global SHAP feature importance evaluation here | ||
# TODO TODO TODO | ||
result = sklearn.inspection.permutation_importance( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this an alternative from SHAP? What's the performance like?
I think we may want to align as a team before including this in the template since SHAP is the standard at the moment. Maybe comment this out for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an excellent way to get global feature importances, in a way that's performant (i.e. negligible in this context) and more statistically robust than getting importances from the model's coefficients or importances attributes. This isn't a replacement for SHAP, it's a complement.
# if not, assume this is a prediction workflow | ||
try: | ||
run_type = dbutils.widgets.get("run_type") | ||
dataset_name = dbutils.widgets.get("dataset_name") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the dataset_name
, this is something you'll have to assign in the notebook before this: '02-prepare-modeling-dataset-TEMPLATE.py'. So for example, we can do this at the end of the notebook:
dbutils.jobs.taskValues.set(key="dataset_name", value=dataset_name) # noqa: F821
where dataset_name points to a dataframe etc.
We don't need to do this for run_type
since this will be pulled down as one of the job parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does one get run_type
in the code, if not via this call?
I'm still a bit unsure about the best way to get parameters from a databricks "job" vs manually specifying. Here, I'm just following the example set in prior nbs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting aside which parameters need to be set and in what form, the big question for me is how are we supposed to get and then leverage these parameters in notebooks. Could you catch me up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This source here suggests that dbutils.widgets.get()
is the correct way to get job parameters in a notebook. Is that not so?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
various confusions resolved elsewhere
We'll work out the kinks here in subsequent work.
try: | ||
run_type = dbutils.widgets.get("run_type") | ||
dataset_name = dbutils.widgets.get("dataset_name") | ||
model_name = dbutils.widgets.get("model_name") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does the model_name
represent here and how are we planning on retrieving this info? Would you want to be assigned via AutoML artifacts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
model_name
is analogous to dataset_name
in that both are keys in the project config's models
and datasets
parameters, respectively.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see what you mean. Okay yeah we can refer to the config instead of dbutils.widgets
here for at least the model_name
. I think with dataset_name
, it makes more sense to me to have that as a task parameter given that this will depend on whether we are training vs. predicting on a dataset (the path will be different).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say for the sake of this PR, we can have model_name
just pulled from the config. Are you okay with changing that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will require a change to the project config, so I'd prefer to punt that to a separate PR / chunk of work. Okay with you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's fair! I'll approve then and we can adjust later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have an idea for how this will work 👍
|
||
|
||
# TODO: get this functionality into public repo's modeling.inference? | ||
def predict_proba( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps, this can go in modeling.utils
... what do you think? Since this notebook is on the longer side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went back and forth on it... For now, I'd prefer to leave it here, with the TODO reminding us to clean it up later :)
if framework == "xgboost" | ||
else mlflow.lightgbm.load_model | ||
if framework == "lightgbm" | ||
else mlflow.pyfunc.load_model |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this logic with loading based on model type, though in the final else case... I've personally had some issues with mlflow.pyfunc.load_model
. Also when would we fall into this issue? I thought decision tree, logreg would be sklearn?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just covering the usual bases here. Strangely, it seemed like lightgbm and xgboost models were actually being saved using the "sklearn" framework, possibly through sklearn's built-in wrappers for those external libs, but I didn't delve too deeply. This function should work regardless if you give it the proper framework!
changes
context
trying to standardize the full pipeline, leverage project configs, and make everything work in both training and prediction runs
final (and original intended) component of PR #55
questions
please see the various TODOs embedded within the nb