chore(weave): Implement enhaced feedback structure and mvp filter/query layer #2865

tssweeney · 2024-11-05T03:22:11Z

This PR lays the groundwork for the next leg of feedback types in our system. Specifically, we have two "classes" of feedback: runnables and annotations. "Runnables" are feedbacks that are generated by running a program (think: Op, Configured Action, Scorer), while "Annotations" are feedbacks created by humans with specific types (aka human in the loop, aka custom columns, etc...).

There were three problems to solve with this emerging data model:

How do we store additional metadata about these feedbacks linking them to other objects in our system.
How do we group / collect feedbacks that belong to the same logical "column" or concept?
Given the above, how can we filter/sort without reading large JSON dumped columns?

After much iteration and discussion, the solution that seemed most suitable is as follows:

feedback_type how has 2 special prefixes: wandb.runnable and wandb.annotation, where the total type should be wandb.runnable.RUNNABLE_NAME or wandb.annotation.ANNOTATION_NAME. here, RUNNABLE_NAME or ANNOTATION_NAME are the name (aka object_id) components of the backing Object or Op. This is the most common group key and indexed in Clickhouse already.
I have added 4 new columns to the feedback table. Note, I originally had these as fields in the payload itself, but this would result in more complex, heavy lookups & more ridged structure over the payload itself. This approach allows us to put our foreign keys in columns that can be indexed in the future if needed.:
- annotation_ref: The ref pointing to the annotation definition for this feedback.
- runnable_ref: The ref pointing to the runnable definition for this feedback.
- call_ref: The ref pointing to the resulting call associated with generating this feedback.
- trigger_ref: The ref pointing to the trigger definition which resulted in this feedback.
When these types of feedback are entered into the DB, our server now will enforce that these ref values are filled out when required and matched the correct format. Moreover, the payloads themselves conform to a very simple structure:

class AnnotationPayloadSchema(BaseModel):
    value: Any


class RunnablePayloadSchema(BaseModel):
    output: Any

Finally, I implement basic (not optimized yet) support for filter and sort on the calls query using the notation feedback[feedback_type].payload.json.selector. This allows us to specify the feedback type (while supporting dots) and match our other field access patterns.

With all of this together, we can have code like:

@weave.op
def my_scorer(x: int, output: int) -> int:
    expected = ["a", "b", "c", "d"][x]
    return {
        "model_output": output,
        "match": output == expected,
    }

@weave.op
def my_model(x: int) -> str:
    return [
        "a",
        "x",  # intentional "mistake"
        "c",
        "y",  # intentional "mistake"
    ][x]

ids = []
for x in range(4):
    _, c = my_model.call(x)
    ids.append(c.id)
    # Note: `_apply_scorer` is not user-facing (yet!) but will be made public during the eval api project.
    c._apply_scorer(my_scorer)

... then query ...

calls = client.server.calls_query_stream(
    tsi.CallsQueryReq(
        project_id=client._project_id(),
        filter=tsi.CallsFilter(op_names=[get_ref(my_model).uri()]),
        # Filter down to just correct matches
        query={
            "$expr": {
                "$eq": [
                    {
                        "$getField": "feedback.[wandb.runnable.my_scorer].payload.output.match"
                    },
                    {"$literal": "true"},
                ]
            }
        },
        # Sort by the model output desc
        sort_by=[
            {
                "field": "feedback.[wandb.runnable.my_scorer].payload.output.model_output",
                "direction": "desc",
            }
        ],
    )
)

This can easily be extended to support different aggregation logic and specific version selectors.

circle-job-mirror · 2024-11-05T03:23:46Z

Preview this PR with FeatureBee: https://beta.wandb.ai/?betaVersion=b418e59d89734fd30a1cedeb8e63879b483f1b03

tssweeney · 2024-11-05T04:29:18Z

tests/trace/test_evaluations.py

-    assert feedback["payload"]["name"] == "score"
-    assert feedback["payload"]["op_ref"] == get_ref(score).uri()
-    assert feedback["payload"]["results"] == True
+    assert feedback["feedback_type"] == "wandb.runnable.score"


This is ok to change as the UI/query layer does not consume it yet.

tssweeney · 2024-11-05T04:29:47Z

tests/trace/test_scores.py

@@ -39,9 +38,8 @@ def my_score(input_x: int, model_output: int) -> int:

    assert len(calls) == 2
    feedback = calls[0].summary["weave"]["feedback"][0]
-    assert feedback["feedback_type"] == SCORE_TYPE_NAME
+    assert feedback["feedback_type"] == "wandb.runnable.my_score"


again, safe to change now that we have a good format

tssweeney · 2024-11-05T04:31:08Z

weave/trace/feedback_types/score.py

-# We're using "beta.1" to indicate that this is a pre-release version.
-from typing import TypedDict
-
-SCORE_TYPE_NAME = "wandb.score.beta.1"


we learned from this - no longer needed

gtarpenning

This new feedback query is going to be spicy in big projects, but looks good. The calls query builder is also feeling... clunky. Generally this makes sense, I wonder how much of the implementation we can abstract away from the user when adding, but still create an intuitive way for them to get the data out. It's possible that we might want to have some way of auto-constructing queries client-side, i'm imagining users not finding the following easy to use....
"$getField": "feedback.[wandb.runnable.my_scorer].payload.output.match"

weave/trace_server/calls_query_builder.py

gtarpenning · 2024-11-05T18:26:12Z

weave/trace_server/calls_query_builder.py

+            )
+            feedback_join_sql = f"""
+            LEFT JOIN feedback
+            ON (feedback.weave_ref = concat('weave-trace-internal:///', {_param_slot(project_param, 'String')}, '/call/', calls_merged.id))


any reason to do this concat in the query vs outside and pass it in?

i don't think so? but not sure. We have to do a concat either way since the last part is dynamic

weave/trace_server/feedback.py

weave/trace_server/orm.py

gtarpenning · 2024-11-05T18:32:06Z

weave/trace_server/trace_server_interface.py

@@ -686,6 +686,18 @@ class FeedbackCreateReq(BaseModel):
            }
        ]
    )
+    annotation_ref: Optional[str] = Field(


it would be nice if we could type this to a kind of ref, like objectRef, with a pydantic validator and then check its construction in the client.

tssweeney added 12 commits November 4, 2024 08:46

Initial Migration

e41ce26

Interface and basic validation

3b6cef9

Added tests and Assertions

3ae1ca2

Modify scorers and uptake changes - make initial test changes

1dfcd7a

Implemented initial query-side improvements

7d03c21

Implemented initial feedback query tests (failing)

906a448

Implemented initial feedback query tests (failing)

c0dc641

Merge branch 'master' into tim/enhanced_feedback_data_model

5efcf9c

Initial sort implementation

00fd587

Other Sort Tests

29ef40b

Initial filter tests

2af8f15

Finished filter tests

643ea0e

tssweeney requested a review from a team as a code owner November 5, 2024 03:22

tssweeney added 3 commits November 4, 2024 19:27

Test fix

79e83fe

Fixed sqlite tests

b4963d5

Fixed sqlite tests 2

22e2982

tssweeney commented Nov 5, 2024

View reviewed changes

tssweeney added 3 commits November 4, 2024 20:54

added one more test

2ac2557

Merge branch 'master' into tim/enhanced_feedback_data_model

3aca1eb

Merge branch 'master' into tim/enhanced_feedback_data_model

16fa9b1

gtarpenning approved these changes Nov 5, 2024

View reviewed changes

tssweeney mentioned this pull request Nov 5, 2024

chore(weave): Online Evals Part 1: Backend support for on-demand LLM Judges #2882

Merged

tssweeney added 4 commits November 5, 2024 16:13

Initial Master Merge

6435506

Lint merge

2b0cf0c

Initial Comments

95d463b

Moved Feedback Symbols

a6dc197

tssweeney merged commit 87f3eef into master Nov 6, 2024
115 checks passed

tssweeney deleted the tim/enhanced_feedback_data_model branch November 6, 2024 02:31

github-actions bot locked and limited conversation to collaborators Nov 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(weave): Implement enhaced feedback structure and mvp filter/query layer #2865

chore(weave): Implement enhaced feedback structure and mvp filter/query layer #2865

tssweeney commented Nov 5, 2024 •

edited

Loading

circle-job-mirror bot commented Nov 5, 2024 •

edited

Loading

tssweeney Nov 5, 2024

tssweeney Nov 5, 2024

tssweeney Nov 5, 2024

gtarpenning left a comment

gtarpenning Nov 5, 2024

tssweeney Nov 6, 2024

gtarpenning Nov 5, 2024

tssweeney Nov 6, 2024

chore(weave): Implement enhaced feedback structure and mvp filter/query layer #2865

chore(weave): Implement enhaced feedback structure and mvp filter/query layer #2865

Conversation

tssweeney commented Nov 5, 2024 • edited Loading

circle-job-mirror bot commented Nov 5, 2024 • edited Loading

tssweeney Nov 5, 2024

Choose a reason for hiding this comment

tssweeney Nov 5, 2024

Choose a reason for hiding this comment

tssweeney Nov 5, 2024

Choose a reason for hiding this comment

gtarpenning left a comment

Choose a reason for hiding this comment

gtarpenning Nov 5, 2024

Choose a reason for hiding this comment

tssweeney Nov 6, 2024

Choose a reason for hiding this comment

gtarpenning Nov 5, 2024

Choose a reason for hiding this comment

tssweeney Nov 6, 2024

Choose a reason for hiding this comment

tssweeney commented Nov 5, 2024 •

edited

Loading

circle-job-mirror bot commented Nov 5, 2024 •

edited

Loading