Skip to content

Commit

Permalink
small docs fixes
Browse files Browse the repository at this point in the history
  • Loading branch information
morganmcg1 committed Oct 16, 2024
1 parent 13cee24 commit 6dcbbbe
Showing 1 changed file with 24 additions and 13 deletions.
37 changes: 24 additions & 13 deletions docs/docs/guides/evaluation/scorers.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,14 @@
# Evaluation Metrics

## Evaluations in Weave
In Weave, Scorers are tools that help you evaluate the outputs from your AI system. They take the AI's output, analyze it, and return a dictionary with the results. Scorers can also use other parts of your input data and can also provide extra information, such as explanations of their evaluations.
In Weave, Scorers are used to evaluate AI outputs and return evaluation metrics They take the AI's output, analyze it, and return a dictionary with the results. Scorers can use your input data as a reference if needed and can also output extra information, such as explanations or reasonings from the evaluation.

Scorers are passed to a `weave.Evaluation` object during evaluation. There are two types of Scorers:
Scorers are passed to a `weave.Evaluation` object during evaluation. There are two types of Scorers in weave:

1. **Function-based Scorers:** Simple Python functions decorated with `@weave.op`.
2. **Class-based Scorers:** Python classes that inherit from `weave.Scorer` for more complex evaluations.

Scorers must return a dictionary and can return multiple metrics, nested metrics and non-numeric values such as texted returned from a LLM-evaluator about its reasoning.

## Function-based Scorers
These are functions, decorated with @weave.op that return a dictionary. They're great for simple evaluations:
Expand All @@ -20,10 +21,10 @@ def evaluate_uppercase(text: str):
eval = weave.Evaluations(..., scorers=[evaluate_uppercase])
```

In this example, evaluate_uppercase checks if the text is all uppercase.
When the evaluation is run, `evaluate_uppercase` checks if the text is all uppercase.

## Class-based Scorers
For more advanced evaluations, especially when you need to keep track of additional scorer metadata, try different scorer prompt or make multiple function calls, you can use the `Scorer` class.
For more advanced evaluations, especially when you need to keep track of additional scorer metadata, try different prompts for your LLM-evaluators or make multiple function calls, you can use the `Scorer` class.

**Requirements:**
- Inherit from `weave.Scorer`.
Expand All @@ -37,15 +38,28 @@ Example:
from weave import Scorer

class SummarizationScorer(Scorer):
model_id : "the LLM model to use"
model_id: str = "the LLM model to use"
system_prompt: str = "Evaluate whether the summary is good."

@weave.op
def some_complicated_preprocessing(text):
...
return text

@weave.op
def llm_call(summary, text):
res = create(self.system_prompt, summary, text)
return {"summary_quality": res}

@weave.op
def score(output, text)
'''
output: The summary generated by an AI system
text: The original text being summarised
'''
... # evaluate the quality of the summary
text = some_complicated_preprocessing(text)
eval_result = call_llm(summary, text, self.prompt)
return {"summary_quality": eval_result}

summarization_scorer = SummarizationScorer(model_id="o2")
eval = weave.Evaluations(..., scorers=[summarization_scorer])
Expand All @@ -65,7 +79,7 @@ For example, if your dataset has a "news_article" column, you can access it in t
### Mapping Column Names
Sometimes, the scorer's parameter names don't match the column names in your dataset. You can fix this using a `column_map`.

If you're using a class-based scorer, pass a dictionary to the `column_map` attribute when you create the scorer. This dictionary maps your scorer's parameter names to the dataset's column names, in the order: `{scorer keyword argument : dataset column name}`. `column_map` is an attribute of the `Scorer` class.
If you're using a class-based scorer, pass a dictionary to the `column_map` attribute of `Scorer` when you initialise your scorer class. This dictionary maps your scorer's parameter names to the dataset's column names, in the order: `{scorer keyword argument : dataset column name}`.

Example:

Expand All @@ -89,13 +103,10 @@ class SummarizationScorer(Scorer)
"""
... # evaluate the quality of the summary
# create a scorer with a column map
# create a scorer with a column mapping the `text` parameter to the `news_article` data column
scorer = SummarizationScorer(column_map = {"text" : "news_article"})
```
Here, the text parameter in the score method will receive data from the `news_article` column.
In this case, weave Evaluations will automatically pass the output of the LLM summarization system to the `output` parameter in the `score` function and will extract the value for the `news_article` key in the input dataset row and pass it to the `text` parameter in the scorer.
Here, the `text` parameter in the score method will receive data from the `news_article` column.
## Predefined Scorers
Expand Down Expand Up @@ -127,7 +138,7 @@ scorer = HallucinationFreeScorer(
---
## `SummarizationScorer`
### `SummarizationScorer`
Use an LLM to compare a summary to the original text and evaluate the quality of the summary.
Expand Down

0 comments on commit 6dcbbbe

Please sign in to comment.