Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(weave): Allow back-compat for "old-style" scorers and evaluations using model_output kwarg #2806

Merged
merged 12 commits into from
Oct 30, 2024

Conversation

andrewtruong
Copy link
Collaborator

@andrewtruong andrewtruong commented Oct 29, 2024

Description

Allows for back-compat with old-style scorers that used the model_output kwarg.

  1. If any "old-style" scorers are used, the evaluation will return model_output and a warning is displayed to the user.
  2. If ALL "new-style" scorers are used, the evaluation will return output

"old-style" scorers are defined as:

  1. scoring functions that have a model_output argument; or
  2. scoring classes that have a score method with a model_output argument.

"new-style" scorers have renamed model_output -> output

Testing

Unit tests with both old-style and new-style scorers, and combinations of new-style, old-style, and classes.

@circle-job-mirror
Copy link

circle-job-mirror bot commented Oct 29, 2024

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file re-implements the old-style tests, specifically:

  1. outputs containing model_output; and
  2. scorers that take a model_output param

@@ -153,29 +152,3 @@ def score(self, target, output):
"mean": pytest.approx(0, abs=1),
},
}


def test_multiclass_f1_score(client):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is moved to test_evaluate_oldstyle

@@ -16,7 +16,7 @@ class Scorer(Object):
description="A mapping from column names in the dataset to the names expected by the scorer",
)

def score(self, input: Any, target: Any, output: Any) -> Any:
def score(self, *, output: Any, **kwargs: Any) -> Any:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more representative of the actual score func signature. Some scorers don't take input, or may take other args

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree, and input is not necessary an available col.

@andrewtruong andrewtruong marked this pull request as ready for review October 29, 2024 01:20
@andrewtruong andrewtruong requested a review from a team as a code owner October 29, 2024 01:20
util.warn_once(
logger, "model_output is deprecated, please use output instead"
)
self._output_key = "model_output"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if some scores use the new output and some use the old? I don't think this should be stored on the eval. just return output from predict and score

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you saying for all scorers to return output instead of model_output?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm saying that self._output_key does not make sense. The output key for the arg is per-scorer, not global to the eval

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it does make sense to keep the output key consistent per eval though. Otherwise if you mix and match scorers, should you get both output and model_output keys? That doesn't seem right to me.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it is worth being clear here - all we are talking about is the key for the output of predict and score (i think?). Which if you miz am match is going to flip around given current implementation. the call stack is

  • predict_and_score
    • predict
    • score 1
    • score 2
    • score 3

the output of predict_and_score should not be keyed based on the last scorer param


# Determine output key based on scorer types
if has_oldstyle_scorers(scorers):
self._output_key = "model_output"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please fix this so we can merge it? it should not be assigning to self

@@ -397,7 +420,7 @@ async def predict_and_score(
scores[scorer_name] = result

return {
"output": model_output,
self._output_key: model_output,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should always be "output" even if you have some scorers that use "model_output"

@@ -441,7 +463,7 @@ async def eval_example(example: dict) -> dict:
except Exception as e:
print("Predict and score failed")
traceback.print_exc()
return {"output": None, "scores": {}}
return {self._output_key: None, "scores": {}}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should always be "output" even if you have some scorers that use "model_output"

@@ -458,7 +480,7 @@ async def eval_example(example: dict) -> dict:
# f"Evaluating... {duration:.2f}s [{n_complete} / {len(self.dataset.rows)} complete]" # type:ignore
# )
if eval_row is None:
eval_row = {"output": None, "scores": {}}
eval_row = {self._output_key: None, "scores": {}}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should always be "output" even if you have some scorers that use "model_output"

@@ -370,7 +393,7 @@ async def predict_and_score(
for param in score_signature.parameters.values()
if param.default == inspect.Parameter.empty
]
required_arg_names.remove("output")
required_arg_names.remove(self._output_key)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should dynamic based on the scorer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be required_arg_names.remove(score_output_name) instead or we could just remove this line.

)
else:
self._output_key = "output"
util.warn_once(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we warning here?

@@ -155,27 +154,77 @@ def score(self, target, output):
}


def test_multiclass_f1_score(client):
@pytest.mark.asyncio
async def test_basic_evaluation_with_mixed_scorer_styles(client):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrewtruong i added this to show the expected behavior.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you're trying to show here and I'm ok with this behaviour, but this will break for anyone who currently relies on the "output" key being model_output.

The alternative is safer -- if we detect any "old-style" scorers, then we use model_output (does not break older code). If only new-style, then we use output. If there's a mix, we use output and give a warning.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, i don't love this, but i see your perspective.

@andrewtruong andrewtruong changed the title fix(weave): Allow back-compat for Evaluation model_output kwarg fix(weave): Allow back-compat for "old-style" scorers and evaluations using model_output kwarg Oct 30, 2024
@andrewtruong andrewtruong enabled auto-merge (squash) October 30, 2024 02:20
@andrewtruong andrewtruong merged commit 3fd1b01 into master Oct 30, 2024
117 checks passed
@andrewtruong andrewtruong deleted the andrew/deprecate-output branch October 30, 2024 02:28
@github-actions github-actions bot locked and limited conversation to collaborators Oct 30, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants