-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(weave): Allow back-compat for "old-style" scorers and evaluations using model_output
kwarg
#2806
Conversation
Preview this PR with FeatureBee: https://beta.wandb.ai/?betaVersion=0739aa021e7a4a97c769350d1c86ac1302303640 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file re-implements the old-style tests, specifically:
- outputs containing
model_output
; and - scorers that take a
model_output
param
@@ -153,29 +152,3 @@ def score(self, target, output): | |||
"mean": pytest.approx(0, abs=1), | |||
}, | |||
} | |||
|
|||
|
|||
def test_multiclass_f1_score(client): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is moved to test_evaluate_oldstyle
@@ -16,7 +16,7 @@ class Scorer(Object): | |||
description="A mapping from column names in the dataset to the names expected by the scorer", | |||
) | |||
|
|||
def score(self, input: Any, target: Any, output: Any) -> Any: | |||
def score(self, *, output: Any, **kwargs: Any) -> Any: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is more representative of the actual score func signature. Some scorers don't take input, or may take other args
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree, and input
is not necessary an available col.
weave/flow/eval.py
Outdated
util.warn_once( | ||
logger, "model_output is deprecated, please use output instead" | ||
) | ||
self._output_key = "model_output" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens if some scores use the new output and some use the old? I don't think this should be stored on the eval. just return output
from predict and score
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you saying for all scorers to return output
instead of model_output
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm saying that self._output_key
does not make sense. The output key for the arg is per-scorer, not global to the eval
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it does make sense to keep the output key consistent per eval though. Otherwise if you mix and match scorers, should you get both output
and model_output
keys? That doesn't seem right to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think it is worth being clear here - all we are talking about is the key for the output of predict and score (i think?). Which if you miz am match is going to flip around given current implementation. the call stack is
- predict_and_score
- predict
- score 1
- score 2
- score 3
the output of predict_and_score
should not be keyed based on the last scorer param
|
||
# Determine output key based on scorer types | ||
if has_oldstyle_scorers(scorers): | ||
self._output_key = "model_output" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we please fix this so we can merge it? it should not be assigning to self
@@ -397,7 +420,7 @@ async def predict_and_score( | |||
scores[scorer_name] = result | |||
|
|||
return { | |||
"output": model_output, | |||
self._output_key: model_output, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should always be "output" even if you have some scorers that use "model_output"
@@ -441,7 +463,7 @@ async def eval_example(example: dict) -> dict: | |||
except Exception as e: | |||
print("Predict and score failed") | |||
traceback.print_exc() | |||
return {"output": None, "scores": {}} | |||
return {self._output_key: None, "scores": {}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should always be "output" even if you have some scorers that use "model_output"
@@ -458,7 +480,7 @@ async def eval_example(example: dict) -> dict: | |||
# f"Evaluating... {duration:.2f}s [{n_complete} / {len(self.dataset.rows)} complete]" # type:ignore | |||
# ) | |||
if eval_row is None: | |||
eval_row = {"output": None, "scores": {}} | |||
eval_row = {self._output_key: None, "scores": {}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should always be "output" even if you have some scorers that use "model_output"
@@ -370,7 +393,7 @@ async def predict_and_score( | |||
for param in score_signature.parameters.values() | |||
if param.default == inspect.Parameter.empty | |||
] | |||
required_arg_names.remove("output") | |||
required_arg_names.remove(self._output_key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should dynamic based on the scorer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be required_arg_names.remove(score_output_name)
instead or we could just remove this line.
weave/flow/eval.py
Outdated
) | ||
else: | ||
self._output_key = "output" | ||
util.warn_once( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we warning here?
tests/trace/test_evaluate.py
Outdated
@@ -155,27 +154,77 @@ def score(self, target, output): | |||
} | |||
|
|||
|
|||
def test_multiclass_f1_score(client): | |||
@pytest.mark.asyncio | |||
async def test_basic_evaluation_with_mixed_scorer_styles(client): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andrewtruong i added this to show the expected behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you're trying to show here and I'm ok with this behaviour, but this will break for anyone who currently relies on the "output" key being model_output
.
The alternative is safer -- if we detect any "old-style" scorers, then we use model_output
(does not break older code). If only new-style, then we use output
. If there's a mix, we use output
and give a warning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, i don't love this, but i see your perspective.
model_output
kwarg
Description
Allows for back-compat with old-style scorers that used the
model_output
kwarg.model_output
and a warning is displayed to the user.output
"old-style" scorers are defined as:
model_output
argument; orscore
method with amodel_output
argument."new-style" scorers have renamed
model_output -> output
Testing
Unit tests with both old-style and new-style scorers, and combinations of new-style, old-style, and classes.