fix(weave): Allow back-compat for "old-style" scorers and evaluations using `model_output` kwarg #2806

andrewtruong · 2024-10-29T00:05:05Z

Description

Allows for back-compat with old-style scorers that used the model_output kwarg.

If any "old-style" scorers are used, the evaluation will return model_output and a warning is displayed to the user.
If ALL "new-style" scorers are used, the evaluation will return output

"old-style" scorers are defined as:

scoring functions that have a model_output argument; or
scoring classes that have a score method with a model_output argument.

"new-style" scorers have renamed model_output -> output

Testing

Unit tests with both old-style and new-style scorers, and combinations of new-style, old-style, and classes.

circle-job-mirror · 2024-10-29T00:05:32Z

Preview this PR with FeatureBee: https://beta.wandb.ai/?betaVersion=0739aa021e7a4a97c769350d1c86ac1302303640

andrewtruong · 2024-10-29T01:16:54Z

tests/trace/test_evaluate_oldstyle.py

This file re-implements the old-style tests, specifically:

outputs containing model_output; and

scorers that take a model_output param

andrewtruong · 2024-10-29T01:17:24Z

tests/trace/test_evaluate.py

@@ -153,29 +152,3 @@ def score(self, target, output):
            "mean": pytest.approx(0, abs=1),
        },
    }
-
-
-def test_multiclass_f1_score(client):


This test is moved to test_evaluate_oldstyle

andrewtruong · 2024-10-29T01:20:04Z

weave/scorers/base_scorer.py

@@ -16,7 +16,7 @@ class Scorer(Object):
        description="A mapping from column names in the dataset to the names expected by the scorer",
    )

-    def score(self, input: Any, target: Any, output: Any) -> Any:
+    def score(self, *, output: Any, **kwargs: Any) -> Any:


This is more representative of the actual score func signature. Some scorers don't take input, or may take other args

agree, and input is not necessary an available col.

tssweeney · 2024-10-29T03:17:14Z

weave/flow/eval.py

+                util.warn_once(
+                    logger, "model_output is deprecated, please use output instead"
+                )
+                self._output_key = "model_output"


what happens if some scores use the new output and some use the old? I don't think this should be stored on the eval. just return output from predict and score

are you saying for all scorers to return output instead of model_output?

i'm saying that self._output_key does not make sense. The output key for the arg is per-scorer, not global to the eval

I think it does make sense to keep the output key consistent per eval though. Otherwise if you mix and match scorers, should you get both output and model_output keys? That doesn't seem right to me.

i think it is worth being clear here - all we are talking about is the key for the output of predict and score (i think?). Which if you miz am match is going to flip around given current implementation. the call stack is

predict_and_score

predict

score 1

score 2

score 3

the output of predict_and_score should not be keyed based on the last scorer param

tssweeney · 2024-10-29T23:37:48Z

weave/flow/eval.py

+
+        # Determine output key based on scorer types
+        if has_oldstyle_scorers(scorers):
+            self._output_key = "model_output"


Can we please fix this so we can merge it? it should not be assigning to self

tssweeney · 2024-10-29T23:49:50Z

weave/flow/eval.py

@@ -397,7 +420,7 @@ async def predict_and_score(
            scores[scorer_name] = result

        return {
-            "output": model_output,
+            self._output_key: model_output,


this should always be "output" even if you have some scorers that use "model_output"

tssweeney · 2024-10-29T23:49:53Z

weave/flow/eval.py

@@ -441,7 +463,7 @@ async def eval_example(example: dict) -> dict:
            except Exception as e:
                print("Predict and score failed")
                traceback.print_exc()
-                return {"output": None, "scores": {}}
+                return {self._output_key: None, "scores": {}}


this should always be "output" even if you have some scorers that use "model_output"

tssweeney · 2024-10-29T23:49:56Z

weave/flow/eval.py

@@ -458,7 +480,7 @@ async def eval_example(example: dict) -> dict:
            #     f"Evaluating... {duration:.2f}s [{n_complete} / {len(self.dataset.rows)} complete]"  # type:ignore
            # )
            if eval_row is None:
-                eval_row = {"output": None, "scores": {}}
+                eval_row = {self._output_key: None, "scores": {}}


this should always be "output" even if you have some scorers that use "model_output"

tssweeney · 2024-10-29T23:51:07Z

weave/flow/eval.py

@@ -370,7 +393,7 @@ async def predict_and_score(
                    for param in score_signature.parameters.values()
                    if param.default == inspect.Parameter.empty
                ]
-                required_arg_names.remove("output")
+                required_arg_names.remove(self._output_key)


this should dynamic based on the scorer

I think this should be required_arg_names.remove(score_output_name) instead or we could just remove this line.

tssweeney · 2024-10-29T23:51:23Z

weave/flow/eval.py

+            )
+        else:
+            self._output_key = "output"
+            util.warn_once(


why are we warning here?

tssweeney · 2024-10-30T00:04:25Z

tests/trace/test_evaluate.py

@@ -155,27 +154,77 @@ def score(self, target, output):
    }


-def test_multiclass_f1_score(client):
+@pytest.mark.asyncio
+async def test_basic_evaluation_with_mixed_scorer_styles(client):


@andrewtruong i added this to show the expected behavior.

I see what you're trying to show here and I'm ok with this behaviour, but this will break for anyone who currently relies on the "output" key being model_output.

The alternative is safer -- if we detect any "old-style" scorers, then we use model_output (does not break older code). If only new-style, then we use output. If there's a mix, we use output and give a warning.

ok, i don't love this, but i see your perspective.

test

9d3feaa

andrewtruong added 2 commits October 28, 2024 21:10

tests

6594c4b

tidy

3240f4a

andrewtruong commented Oct 29, 2024

View reviewed changes

andrewtruong marked this pull request as ready for review October 29, 2024 01:20

andrewtruong requested a review from a team as a code owner October 29, 2024 01:20

tidy

3dc5131

andrewtruong requested a review from tssweeney October 29, 2024 01:32

tssweeney reviewed Oct 29, 2024

View reviewed changes

tcapelle added 3 commits October 29, 2024 22:08

fix backward compat

78c41e7

add mix scorer styles

1fd8d04

Merge remote-tracking branch 'origin' into andrew/deprecate-output

bc8e939

tssweeney reviewed Oct 29, 2024

View reviewed changes

weave/flow/eval.py Outdated

)

else:

self._output_key = "output"

util.warn_once(

Copy link

Collaborator

tssweeney Oct 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we warning here?

Added unit test

b27b93f

tssweeney reviewed Oct 30, 2024

View reviewed changes

tssweeney approved these changes Oct 30, 2024

View reviewed changes

andrewtruong added 4 commits October 29, 2024 21:53

parametrized tests

eb5d607

has_oldstyle_scorers -> _has_oldstyle_scorers

184a431

tidy

d57b59c

Merge branch 'master' into andrew/deprecate-output

c37b746

andrewtruong changed the title ~~fix(weave): Allow back-compat for Evaluation model_output kwarg~~ fix(weave): Allow back-compat for "old-style" scorers and evaluations using model_output kwarg Oct 30, 2024

andrewtruong enabled auto-merge (squash) October 30, 2024 02:20

andrewtruong merged commit 3fd1b01 into master Oct 30, 2024
117 checks passed

andrewtruong deleted the andrew/deprecate-output branch October 30, 2024 02:28

github-actions bot locked and limited conversation to collaborators Oct 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(weave): Allow back-compat for "old-style" scorers and evaluations using `model_output` kwarg #2806

fix(weave): Allow back-compat for "old-style" scorers and evaluations using `model_output` kwarg #2806

andrewtruong commented Oct 29, 2024 •

edited

Loading

circle-job-mirror bot commented Oct 29, 2024 •

edited

Loading

andrewtruong Oct 29, 2024

andrewtruong Oct 29, 2024

andrewtruong Oct 29, 2024

tcapelle Oct 29, 2024

tssweeney Oct 29, 2024

andrewtruong Oct 29, 2024

tssweeney Oct 29, 2024

andrewtruong Oct 29, 2024

tssweeney Oct 29, 2024

tssweeney Oct 29, 2024

tssweeney Oct 29, 2024

tssweeney Oct 29, 2024

tssweeney Oct 29, 2024

tssweeney Oct 29, 2024

tcapelle Oct 30, 2024

tssweeney Oct 29, 2024

tssweeney Oct 30, 2024

andrewtruong Oct 30, 2024

tssweeney Oct 30, 2024

fix(weave): Allow back-compat for "old-style" scorers and evaluations using model_output kwarg #2806

fix(weave): Allow back-compat for "old-style" scorers and evaluations using model_output kwarg #2806

Conversation

andrewtruong commented Oct 29, 2024 • edited Loading

Description

Testing

circle-job-mirror bot commented Oct 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fix(weave): Allow back-compat for "old-style" scorers and evaluations using `model_output` kwarg #2806

fix(weave): Allow back-compat for "old-style" scorers and evaluations using `model_output` kwarg #2806

andrewtruong commented Oct 29, 2024 •

edited

Loading

circle-job-mirror bot commented Oct 29, 2024 •

edited

Loading