Merge branch 'master' into DOCS-1217

wandb · Jan 27, 2025 · e054a26 · e054a26
2 parents bfd6df3 + ce57a2e
commit e054a26
Show file tree

Hide file tree

Showing 31 changed files with 914 additions and 109 deletions.
diff --git a/.github/workflows/test.yaml b/.github/workflows/test.yaml
@@ -180,6 +180,7 @@ jobs:
           yarn tslint
           yarn prettier
           yarn run tsc
+          yarn testci
 
   # ==== Trace Jobs ====
   lint:

diff --git a/docs/docs/guides/core-types/evaluations.md b/docs/docs/guides/core-types/evaluations.md
@@ -186,3 +186,71 @@ def function_to_evaluate(question: str):
 
 asyncio.run(evaluation.evaluate(function_to_evaluate))
 ```
+
+## Advanced evaluation usage
+
+### Using `preprocess_model_input` to format dataset rows before evaluating
+
+The `preprocess_model_input` parameter allows you to transform your dataset examples before they are passed to your evaluation function. This is useful when you need to:
+- Rename fields to match your model's expected input
+- Transform data into the correct format
+- Add or remove fields
+- Load additional data for each example
+
+Here's a simple example that shows how to use `preprocess_model_input` to rename fields:
+
+```python
+import weave
+from weave import Evaluation
+import asyncio
+
+# Our dataset has "input_text" but our model expects "question"
+examples = [
+    {"input_text": "What is the capital of France?", "expected": "Paris"},
+    {"input_text": "Who wrote 'To Kill a Mockingbird'?", "expected": "Harper Lee"},
+    {"input_text": "What is the square root of 64?", "expected": "8"},
+]
+
+@weave.op()
+def preprocess_example(example):
+    # Rename input_text to question
+    return {
+        "question": example["input_text"]
+    }
+
+@weave.op()
+def match_score(expected: str, model_output: dict) -> dict:
+    return {'match': expected == model_output['generated_text']}
+
+@weave.op()
+def function_to_evaluate(question: str):
+    return {'generated_text': f'Answer to: {question}'}
+
+# Create evaluation with preprocessing
+evaluation = Evaluation(
+    dataset=examples,
+    scorers=[match_score],
+    preprocess_model_input=preprocess_example
+)
+
+# Run the evaluation
+weave.init('preprocessing-example')
+asyncio.run(evaluation.evaluate(function_to_evaluate))
+```
+
+In this example, our dataset contains examples with an `input_text` field, but our evaluation function expects a `question` parameter. The `preprocess_example` function transforms each example by renaming the field, allowing the evaluation to work correctly.
+
+The preprocessing function:
+1. Receives the raw example from your dataset
+2. Returns a dictionary with the fields your model expects
+3. Is applied to each example before it's passed to your evaluation function
+
+This is particularly useful when working with external datasets that may have different field names or structures than what your model expects.
+
+### Using HuggingFace Datasets with evaluations
+
+We are continuously improving our integrations with third-party services and libraries. 
+
+While we work on building more seamless integrations, you can use `preprocess_model_input` as a temporary workaround for using HuggingFace Datasets in Weave evaluations. 
+
+See our [Using HuggingFace Datasets in evaluations cookbook](/reference/gen_notebooks/hf_dataset_evals) for the current approach.
diff --git a/docs/docs/guides/integrations/anthropic.md b/docs/docs/guides/integrations/anthropic.md
@@ -2,6 +2,10 @@
 
 Weave automatically tracks and logs LLM calls made via the [Anthropic Python library](https://github.com/anthropics/anthropic-sdk-python), after `weave.init()` is called.
 
+:::note
+Do you want to experiment with Anthropic models on Weave without any set up? Try the [LLM Playground](../tools/playground.md).
+:::
+
 ## Traces
 
 It’s important to store traces of LLM applications in a central database, both during development and in production. You’ll use these traces for debugging, and as a dataset that will help you improve your application.

diff --git a/docs/docs/guides/integrations/azure.md b/docs/docs/guides/integrations/azure.md
@@ -2,6 +2,10 @@
 
 Weights & Biases integrates with Microsoft Azure OpenAI services, helping teams to manage, debug, and optimize their Azure AI workflows at scale. This guide introduces the W&B integration, what it means for Weave users, its key features, and how to get started.
 
+:::tip
+For the latest tutorials, visit [Weights & Biases on Microsoft Azure](https://wandb.ai/site/partners/azure).
+:::
+
 ## Key features
 
 - **LLM evaluations**: Evaluate and monitor LLM-powered applications using Weave, optimized for Azure infrastructure.  

diff --git a/docs/docs/guides/integrations/bedrock.md b/docs/docs/guides/integrations/bedrock.md
@@ -2,6 +2,14 @@
 
 Weave automatically tracks and logs LLM calls made via Amazon Bedrock, AWS's fully managed service that offers foundation models from leading AI companies through a unified API.
 
+:::tip
+For the latest tutorials, visit [Weights & Biases on Amazon Web Services](https://wandb.ai/site/partners/aws/).
+:::
+
+:::note
+Do you want to experiment with Amazon Bedrock models on Weave without any set up? Try the [LLM Playground](../tools/playground.md).
+:::
+
 ## Traces
 
 Weave will automatically capture traces for Bedrock API calls. You can use the Bedrock client as usual after initializing Weave and patching the client:

diff --git a/docs/docs/guides/integrations/google-gemini.md b/docs/docs/guides/integrations/google-gemini.md
@@ -1,5 +1,13 @@
 # Google Gemini
 
+:::tip
+For the latest tutorials, visit [Weights & Biases on Google Cloud](https://wandb.ai/site/partners/googlecloud/).
+:::
+
+:::note
+Do you want to experiment with Google Gemini models on Weave without any set up? Try the [LLM Playground](../tools/playground.md).
+:::
+
 Google offers two ways of calling Gemini via API:
 
 1. Via the [Vertex APIs](https://cloud.google.com/vertex-ai/docs).

diff --git a/docs/docs/guides/integrations/groq.md b/docs/docs/guides/integrations/groq.md
@@ -1,5 +1,9 @@
 # Groq
 
+:::note
+Do you want to experiment with Groq models on Weave without any set up? Try the [LLM Playground](../tools/playground.md).
+:::
+
 [Groq](https://groq.com/) is the AI infrastructure company that delivers fast AI inference. The LPU™ Inference Engine by Groq is a hardware and software platform that delivers exceptional compute speed, quality, and energy efficiency. Weave automatically tracks and logs calls made using Groq chat completion calls.
 
 ## Tracing

diff --git a/docs/docs/guides/integrations/index.md b/docs/docs/guides/integrations/index.md
@@ -10,20 +10,19 @@ Weave provides automatic logging integrations for popular LLM providers and orch
 
 LLM providers are the vendors that offer access to large language models for generating predictions. Weave integrates with these providers to log and trace the interactions with their APIs:
 
-- **[OpenAI](/guides/integrations/openai)**
+- **[Amazon Bedrock](/guides/integrations/bedrock)**
 - **[Anthropic](/guides/integrations/anthropic)**
 - **[Cerebras](/guides/integrations/cerebras)**
 - **[Cohere](/guides/integrations/cohere)**
-- **[MistralAI](/guides/integrations/mistral)**
-- **[Microsoft Azure](/guides/integrations/azure)**
 - **[Google Gemini](/guides/integrations/google-gemini)**
-- **[Together AI](/guides/integrations/together_ai)**
 - **[Groq](/guides/integrations/groq)**
-- **[Open Router](/guides/integrations/openrouter)**
 - **[LiteLLM](/guides/integrations/litellm)**
+- **[Microsoft Azure](/guides/integrations/azure)**
+- **[MistralAI](/guides/integrations/mistral)**
 - **[NVIDIA NIM](/guides/integrations/nvidia_nim)**
-
-
+- **[OpenAI](/guides/integrations/openai)**
+- **[Open Router](/guides/integrations/openrouter)**
+- **[Together AI](/guides/integrations/together_ai)**
 
 **[Local Models](/guides/integrations/local_models)**: For when you're running models on your own infrastructure.
 

diff --git a/docs/docs/guides/integrations/nvidia_nim.md b/docs/docs/guides/integrations/nvidia_nim.md
@@ -5,6 +5,10 @@ import TabItem from '@theme/TabItem';
 
 Weave automatically tracks and logs LLM calls made via the [ChatNVIDIA](https://python.langchain.com/docs/integrations/chat/nvidia_ai_endpoints/) library, after `weave.init()` is called.
 
+:::tip
+For the latest tutorials, visit [Weights & Biases on NVIDIA](https://wandb.ai/site/partners/nvidia).
+:::
+
 ## Tracing
 
 It’s important to store traces of LLM applications in a central database, both during development and in production. You’ll use these traces for debugging and to help build a dataset of tricky examples to evaluate against while improving your application.

diff --git a/docs/docs/guides/integrations/openai.md b/docs/docs/guides/integrations/openai.md
@@ -3,6 +3,10 @@ import TabItem from '@theme/TabItem';
 
 # OpenAI
 
+:::note
+Do you want to experiment with OpenAI models on Weave without any set up? Try the [LLM Playground](../tools/playground.md).
+:::
+
 ## Tracing
 
 It’s important to store traces of LLM applications in a central database, both during development and in production. You’ll use these traces for debugging and to help build a dataset of tricky examples to evaluate against while improving your application.

diff --git a/docs/docs/guides/tools/playground.md b/docs/docs/guides/tools/playground.md
@@ -2,7 +2,7 @@
 
 > **The LLM Playground is currently in preview.**
 
-Evaluating LLM prompts and responses is challenging. The Weave Playground is designed to simplify the process of iterating on LLM prompts and responses, making it easier to experiment with different models and prompts. With features like prompt editing, message retrying, and model comparison, Playground helps you to quickly test and improve your LLM applications. Playground currently supports OpenAI, Anthropic, Gemini, and Groq.
+Evaluating LLM prompts and responses is challenging. The Weave Playground is designed to simplify the process of iterating on LLM prompts and responses, making it easier to experiment with different models and prompts. With features like prompt editing, message retrying, and model comparison, Playground helps you to quickly test and improve your LLM applications. Playground currently supports OpenAI, Anthropic, Google Gemini, Groq, and Amazon Bedrock models.
 
 ## Features
 
@@ -37,7 +37,7 @@ To use one of the available models, add the appropriate information to your team
 
 - OpenAI: `OPENAI_API_KEY`
 - Anthropic: `ANTHROPIC_API_KEY`
-- Gemini: `GOOGLE_API_KEY`
+- Google Gemini: `GOOGLE_API_KEY`
 - Groq: `GEMMA_API_KEY`
 - Amazon Bedrock:
    - `AWS_ACCESS_KEY_ID`
@@ -60,117 +60,105 @@ There are two ways to access the Playground:
 
 You can switch the LLM using the dropdown menu in the top left. The available models from various providers are listed below:
 
-- [AI21](#ai21)
-- [Amazon](#amazon)
+- [Amazon Bedrock](#amazon-bedrock)
 - [Anthropic](#anthropic)
-- [Cohere](#cohere)
-- [Google](#google)
+- [Google Gemini](#gemini)
 - [Groq](#groq)
-- [Meta](#meta)
-- [Mistral](#mistral)
 - [OpenAI](#openai)
 - [X.AI](#xai)
 
+### [Amazon Bedrock](../integrations/bedrock.md)
 
-### AI21
 - ai21.j2-mid-v1
 - ai21.j2-ultra-v1
-
-### Amazon
-- amazon.nova-lite
-- amazon.nova-micro
-- amazon.nova-pro
-- amazon.titan-text-express-v1
+- amazon.nova-micro-v1:0
+- amazon.nova-lite-v1:0
+- amazon.nova-pro-v1:0
 - amazon.titan-text-lite-v1
-
-### Anthropic
+- amazon.titan-text-express-v1
+- mistral.mistral-7b-instruct-v0:2
+- mistral.mixtral-8x7b-instruct-v0:1
+- mistral.mistral-large-2402-v1:0
+- mistral.mistral-large-2407-v1:0
+- anthropic.claude-3-sonnet-20240229-v1:0
 - anthropic.claude-3-5-sonnet-20240620-v1:0
 - anthropic.claude-3-haiku-20240307-v1:0
 - anthropic.claude-3-opus-20240229-v1:0
-- anthropic.claude-3-sonnet-20240229-v1:0
-- anthropic.claude-instant-v1
 - anthropic.claude-v2
 - anthropic.claude-v2:1
+- anthropic.claude-instant-v1
+- cohere.command-text-v14
+- cohere.command-light-text-v14
+- cohere.command-r-plus-v1:0
+- cohere.command-r-v1:0
+- meta.llama2-13b-chat-v1
+- meta.llama2-70b-chat-v1
+- meta.llama3-8b-instruct-v1:0
+- meta.llama3-70b-instruct-v1:0
+- meta.llama3-1-8b-instruct-v1:0
+- meta.llama3-1-70b-instruct-v1:0
+- meta.llama3-1-405b-instruct-v1:0
+
+### [Anthropic](../integrations/anthropic.md)
+
+- claude-3-5-sonnet-20240620
 - claude-3-5-sonnet-20241022
 - claude-3-haiku-20240307
 - claude-3-opus-20240229
 - claude-3-sonnet-20240229
 
-### Cohere
-- cohere.command-light-text-v14
-- cohere.command-r-plus-v1:0
-- cohere.command-r-v1:0
-- cohere.command-text-v14
+### [Google Gemini](../integrations/google-gemini.md)
 
-### Google
-- gemini/gemini-1.5-flash
 - gemini/gemini-1.5-flash-001
 - gemini/gemini-1.5-flash-002
 - gemini/gemini-1.5-flash-8b-exp-0827
 - gemini/gemini-1.5-flash-8b-exp-0924
 - gemini/gemini-1.5-flash-exp-0827
 - gemini/gemini-1.5-flash-latest
-- gemini/gemini-1.5-pro
+- gemini/gemini-1.5-flash
 - gemini/gemini-1.5-pro-001
 - gemini/gemini-1.5-pro-002
 - gemini/gemini-1.5-pro-exp-0801
 - gemini/gemini-1.5-pro-exp-0827
 - gemini/gemini-1.5-pro-latest
+- gemini/gemini-1.5-pro
 - gemini/gemini-pro
 
-### Groq
+### [Groq](../integrations/groq.md)
+
 - groq/gemma-7b-it
 - groq/gemma2-9b-it
-- groq/llama-3.1-70b-versatile
 - groq/llama-3.1-8b-instant
 - groq/llama3-70b-8192
 - groq/llama3-8b-8192
-- groq/llama3-groq-70b-8192-tool-use-preview
 - groq/llama3-groq-8b-8192-tool-use-preview
 - groq/mixtral-8x7b-32768
 
-### Meta
-- meta.llama2-13b-chat-v1
-- meta.llama2-70b-chat-v1
-- meta.llama3-1-405b-instruct-v1:0
-- meta.llama3-1-70b-instruct-v1:0
-- meta.llama3-1-8b-instruct-v1:0
-- meta.llama3-70b-instruct-v1:0
-- meta.llama3-8b-instruct-v1:0
-
-### Mistral
-- mistral.mistral-7b-instruct-v0:2
-- mistral.mistral-large-2402-v1:0
-- mistral.mistral-large-2407-v1:0
-- mistral.mixtral-8x7b-instruct-v0:1
+### [OpenAI](../integrations/openai.md)
 
-### OpenAI
-- gpt-3.5-turbo
+- gpt-4o-mini
 - gpt-3.5-turbo-0125
 - gpt-3.5-turbo-1106
-- gpt-3.5-turbo-16k
-- gpt-4
-- gpt-4-0125-preview
-- gpt-4-0314
-- gpt-4-0613
 - gpt-4-1106-preview
 - gpt-4-32k-0314
-- gpt-4-turbo
 - gpt-4-turbo-2024-04-09
 - gpt-4-turbo-preview
-- gpt-40-2024-05-13
-- gpt-40-2024-08-06
-- gpt-40-mini
-- gpt-40-mini-2024-07-18
+- gpt-4-turbo
+- gpt-4
+- gpt-4o-2024-05-13
+- gpt-4o-2024-08-06
+- gpt-4o-mini-2024-07-18
 - gpt-4o
-- o1-mini
+- gpt-4o-2024-11-20
 - o1-mini-2024-09-12
-- o1-preview
+- o1-mini
 - o1-preview-2024-09-12
+- o1-preview
+- o1-2024-12-17
 
 ### X.AI
-- xai/grok-beta
 
+- xai/grok-beta
 
 ## Adjust LLM parameters