small docs fixes

wandb · Oct 16, 2024 · 6dcbbbe · 6dcbbbe
1 parent 13cee24
commit 6dcbbbe
Showing 1 changed file with 24 additions and 13 deletions.
diff --git a/docs/docs/guides/evaluation/scorers.md b/docs/docs/guides/evaluation/scorers.md
@@ -1,13 +1,14 @@
 # Evaluation Metrics
 
 ## Evaluations in Weave
-In Weave, Scorers are tools that help you evaluate the outputs from your AI system. They take the AI's output, analyze it, and return a dictionary with the results. Scorers can also use other parts of your input data and can also provide extra information, such as explanations of their evaluations.
+In Weave, Scorers are used to evaluate AI outputs and return evaluation metrics They take the AI's output, analyze it, and return a dictionary with the results. Scorers can use your input data as a reference if needed and can also output extra information, such as explanations or reasonings from the evaluation.
 
-Scorers are passed to a `weave.Evaluation` object during evaluation. There are two types of Scorers:
+Scorers are passed to a `weave.Evaluation` object during evaluation. There are two types of Scorers in weave:
 
 1. **Function-based Scorers:** Simple Python functions decorated with `@weave.op`.
 2. **Class-based Scorers:** Python classes that inherit from `weave.Scorer` for more complex evaluations.
 
+Scorers must return a dictionary and can return multiple metrics, nested metrics and non-numeric values such as texted returned from a LLM-evaluator about its reasoning.
 
 ## Function-based Scorers
 These are functions, decorated with @weave.op that return a dictionary. They're great for simple evaluations:
@@ -20,10 +21,10 @@ def evaluate_uppercase(text: str):
 eval = weave.Evaluations(..., scorers=[evaluate_uppercase])
 ```
 
-In this example, evaluate_uppercase checks if the text is all uppercase.
+When the evaluation is run, `evaluate_uppercase` checks if the text is all uppercase.
 
 ## Class-based Scorers
-For more advanced evaluations, especially when you need to keep track of additional scorer metadata, try different scorer prompt or make multiple function calls, you can use the `Scorer` class.
+For more advanced evaluations, especially when you need to keep track of additional scorer metadata, try different prompts for your LLM-evaluators or make multiple function calls, you can use the `Scorer` class.
 
 **Requirements:**
 - Inherit from `weave.Scorer`.
@@ -37,15 +38,28 @@ Example:
 from weave import Scorer
 
 class SummarizationScorer(Scorer):
-    model_id : "the LLM model to use"
+    model_id: str = "the LLM model to use"
+    system_prompt: str = "Evaluate whether the summary is good."
+
+    @weave.op
+    def some_complicated_preprocessing(text):
+        ...
+        return text
+
+    @weave.op
+    def llm_call(summary, text):
+        res = create(self.system_prompt, summary, text)
+        return {"summary_quality": res}
 
     @weave.op
     def score(output, text)
         '''
             output: The summary generated by an AI system
             text: The original text being summarised
         '''
-        ...  # evaluate the quality of the summary
+        text = some_complicated_preprocessing(text)
+        eval_result = call_llm(summary, text, self.prompt)
+        return {"summary_quality": eval_result}
 
 summarization_scorer = SummarizationScorer(model_id="o2")
 eval = weave.Evaluations(..., scorers=[summarization_scorer])
@@ -65,7 +79,7 @@ For example, if your dataset has a "news_article" column, you can access it in t
 ### Mapping Column Names
 Sometimes, the scorer's parameter names don't match the column names in your dataset. You can fix this using a `column_map`.
 
-If you're using a class-based scorer, pass a dictionary to the `column_map` attribute when you create the scorer. This dictionary maps your scorer's parameter names to the dataset's column names, in the order: `{scorer keyword argument : dataset column name}`. `column_map` is an attribute of the `Scorer` class.
+If you're using a class-based scorer, pass a dictionary to the `column_map` attribute of `Scorer` when you initialise your scorer class. This dictionary maps your scorer's parameter names to the dataset's column names, in the order: `{scorer keyword argument : dataset column name}`.
 
 Example:
 
@@ -89,13 +103,10 @@ class SummarizationScorer(Scorer)
         """
         ...  # evaluate the quality of the summary
 
-# create a scorer with a column map
+# create a scorer with a column mapping the `text` parameter to the `news_article` data column
 scorer = SummarizationScorer(column_map = {"text" : "news_article"})
 ```
-Here, the text parameter in the score method will receive data from the `news_article` column.
-
-
-In this case, weave Evaluations will automatically pass the output of the LLM summarization system to the `output` parameter in the `score` function and will extract the value for the `news_article` key in the input dataset row and pass it to the `text` parameter in the scorer.
+Here, the `text` parameter in the score method will receive data from the `news_article` column.
 
 
 ## Predefined Scorers
@@ -127,7 +138,7 @@ scorer = HallucinationFreeScorer(
 
 ---
 
-## `SummarizationScorer`
+### `SummarizationScorer`
 
 Use an LLM to compare a summary to the original text and evaluate the quality of the summary.