[Evals] Evaluation docs improvements #985

ssbushi · 2024-10-01T14:53:22Z

No description provided.

ssbushi · 2024-10-01T14:59:52Z

Autogenerated from Gemini:

This text reveals several areas where the documentation for Genkit, particularly around evaluation, could be improved:

* **Clarify how evaluators are standardized.**  The text acknowledges that while evaluation metrics like Faithfulness and Answer Relevance are becoming standardized, their implementation can vary. The documentation should provide more concrete information on this, perhaps by:
    *  Giving specific examples of how implementations can differ.
    *  Offering guidance on choosing the best implementation for different use cases.
    *  Explaining how Genkit handles these variations to ensure consistency.

* **Provide more guidance on quantifying output variables.** The text mentions that users can define custom evaluation metrics, but it should offer more support on how to do this effectively.  Consider adding:
    *  Examples of quantifying different types of outputs.
    *  Best practices for designing custom metrics.
    *  A step-by-step guide to implementing custom evaluators.

* **Expand on the scope of pre-defined evaluators.** Users need a clearer understanding of what metrics like "Maliciousness" actually measure. The documentation should:
    *  Provide detailed explanations of each pre-defined metric.
    *  Clarify which RAGAS metrics are included in Genkit.
    *  Offer examples of how these metrics are used in practice.

* **Improve the description of "Maliciousness"**. The current explanation is vague. The documentation should clearly define what constitutes "maliciousness" in the context of LLMs and how the evaluator identifies it.

* **Clarify the analogy to testing.** While the text likens evaluators to E2E testing, it could be more explicit about how they fit into the development process. This could involve:
    *  Explaining when and how to use evaluators during development.
    *  Providing examples of how evaluators can help identify regressions.
    *  Discussing how evaluators can be integrated into a CI/CD pipeline.

By addressing these points, the documentation can better support users in understanding and effectively using Genkit's evaluation features.

Context: https://discord.com/channels/1255578482214305893/1281391213550895124/1282325935038926868

odbol · 2024-12-11T22:56:43Z

Also, making the example code actually compile would be nice.

odbol · 2024-12-12T23:56:41Z

Made the code compile: https://github.com/firebase/genkit/pull/1497/files

ssbushi added this to Genkit Backlog Oct 1, 2024

ssbushi self-assigned this Oct 1, 2024

ssbushi converted this from a draft issue Oct 1, 2024

ssbushi added the docs label Oct 1, 2024

ssbushi moved this to In Progress in Genkit Backlog Dec 12, 2024

MichaelDoyle added the devui label Dec 19, 2024

ssbushi mentioned this issue Dec 30, 2024

docs(evals): Updated docs for evals #1512

Merged

3 tasks

ssbushi closed this as completed Jan 20, 2025

github-project-automation bot moved this from In Progress to Done in Genkit Backlog Jan 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Evals] Evaluation docs improvements #985

[Evals] Evaluation docs improvements #985

ssbushi commented Oct 1, 2024

ssbushi commented Oct 1, 2024

odbol commented Dec 11, 2024

odbol commented Dec 12, 2024

[Evals] Evaluation docs improvements #985

[Evals] Evaluation docs improvements #985

Comments

ssbushi commented Oct 1, 2024

ssbushi commented Oct 1, 2024

odbol commented Dec 11, 2024

odbol commented Dec 12, 2024