-
Notifications
You must be signed in to change notification settings - Fork 14
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #27 from hack-dance/adna--port
initial port of edna evals
- Loading branch information
Showing
23 changed files
with
672 additions
and
13 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
# evalz | ||
|
||
## Introduction | ||
|
||
### Overview | ||
**evalz** is a TypeScript package designed to facilitate model-graded evaluations with a focus on structured output. Leveraging Zod schemas, **evalz** streamline s the evaluation of AI-generated responses. It provides a set of tools to assess the quality of responses based on custom criteria such as relevance, fluency, and completeness. The package leverages OpenAI's GPT models to perform evaluations, offering both simple and weighted evaluation mechanisms. | ||
|
||
### Key Features | ||
- **Structured Evaluation Models**: Define your evaluation logic using Zod schemas to ensure data integrity throughout your application. | ||
- **Flexible Evaluation Strategies**: Supports various evaluation strategies, including score-based and binary evaluations, with customizable evaluators. | ||
- **Easy Integration**: Designed to integrate seamlessly with existing TypeScript projects, enhancing AI and data processing workflows with minimal setup. | ||
- **Custom Evaluations**: Define evaluation criteria tailored to your specific requirements. | ||
- **Weighted Evaluations**: Combine multiple evaluations with custom weights to calculate a composite score. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -27,7 +27,6 @@ | |
"engines": { | ||
"node": ">=18" | ||
}, | ||
"dependencies": {}, | ||
"packageManager": "[email protected]", | ||
"workspaces": [ | ||
"apps/*", | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
const { resolve } = require("node:path") | ||
|
||
const project = resolve(__dirname, "tsconfig.lint.json") | ||
|
||
/** @type {import("eslint").Linter.Config} */ | ||
module.exports = { | ||
root: true, | ||
ignorePatterns: [".eslintrc.cjs"], | ||
extends: ["@repo/eslint-config/react-internal.js"], | ||
parser: "@typescript-eslint/parser", | ||
parserOptions: { | ||
project | ||
}, | ||
overrides: [ | ||
{ | ||
extends: ["plugin:@typescript-eslint/disable-type-checked"], | ||
files: ["./**/*.js", "*.js"] | ||
} | ||
] | ||
} |
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,93 @@ | ||
# evalz | ||
|
||
<div align="center"> | ||
<img alt="GitHub issues" src="https://img.shields.io/github/issues/hack-dance/island-ai.svg?style=flat-square&labelColor=000000"> | ||
<img alt="NPM version" src="https://img.shields.io/npm/v/evalz.svg?style=flat-square&logo=npm&labelColor=000000&label=evalz"> | ||
<img alt="License" src="https://img.shields.io/npm/l/evalz.svg?style=flat-square&labelColor=000000"> | ||
</div> | ||
|
||
**evalz** is a TypeScript package designed to facilitate model-graded evaluations with a focus on structured output. Leveraging Zod schemas, **evalz** streamline s the evaluation of AI-generated responses. It provides a set of tools to assess the quality of responses based on custom criteria such as relevance, fluency, and completeness. The package leverages OpenAI's GPT models to perform evaluations, offering both simple and weighted evaluation mechanisms. | ||
|
||
## Features | ||
|
||
- **Structured Evaluation Models**: Define your evaluation logic using Zod schemas to ensure data integrity throughout your application. | ||
- **Flexible Evaluation Strategies**: Supports various evaluation strategies, including score-based and binary evaluations, with customizable evaluators. | ||
- **Easy Integration**: Designed to integrate seamlessly with existing TypeScript projects, enhancing AI and data processing workflows with minimal setup. | ||
- **Custom Evaluations**: Define evaluation criteria tailored to your specific requirements. | ||
- **Weighted Evaluations**: Combine multiple evaluations with custom weights to calculate a composite score. | ||
|
||
|
||
## Installation | ||
|
||
Install `evalz` using your preferred package manager: | ||
|
||
```bash | ||
npm install evalz openai zod | ||
|
||
bun add evalz openai zod | ||
|
||
pnpm add evalz openai zod | ||
``` | ||
|
||
## Basic Usage | ||
|
||
### Creating an Evaluator | ||
|
||
First, create an evaluator for assessing a single aspect of a response, such as its relevance: | ||
|
||
```typescript | ||
import { createEvaluator } from "evalz"; | ||
import OpenAI from "openai"; | ||
|
||
const oai = new OpenAI({ | ||
apiKey: process.env["OPENAI_API_KEY"], | ||
organization: process.env["OPENAI_ORG_ID"] | ||
}); | ||
|
||
function relevanceEval() { | ||
return createEvaluator({ | ||
client: oai, | ||
model: "gpt-4-1106-preview", | ||
evaluationDescription: "Rate the relevance from 0 to 1." | ||
}); | ||
} | ||
``` | ||
|
||
### Conducting an Evaluation | ||
|
||
Evaluate AI-generated content by passing the response data to your evaluator: | ||
|
||
```typescript | ||
const evaluator = relevanceEval(); | ||
|
||
const result = await evaluator({ data: yourResponseData }); | ||
console.log(result.scoreResults); | ||
``` | ||
|
||
### Weighted Evaluation | ||
|
||
Combine multiple evaluators with specified weights for a comprehensive assessment: | ||
|
||
```typescript | ||
import { createWeightedEvaluator } from "evalz"; | ||
|
||
const weightedEvaluator = createWeightedEvaluator({ | ||
evaluators: { | ||
relevance: relevanceEval(), | ||
fluency: fluencyEval(), | ||
completeness: completenessEval() | ||
}, | ||
weights: { | ||
relevance: 0.25, | ||
fluency: 0.25, | ||
completeness: 0.5 | ||
} | ||
}); | ||
|
||
const result = await weightedEvaluator({ data: yourResponseData }); | ||
console.log(result.scoreResults); | ||
``` | ||
|
||
## Contributing | ||
|
||
Contributions are welcome! Please submit a pull request or open an issue to propose changes or additions. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,67 @@ | ||
{ | ||
"name": "evalz", | ||
"version": "0.0.1--alpha.1", | ||
"description": "Model graded evals with typescript", | ||
"publishConfig": { | ||
"access": "public" | ||
}, | ||
"type": "module", | ||
"main": "./dist/index.js", | ||
"module": "./dist/index.js", | ||
"exports": { | ||
".": { | ||
"require": "./dist/index.cjs", | ||
"import": "./dist/index.js", | ||
"default": "./dist/index.js", | ||
"types": "./dist/index.d.ts" | ||
} | ||
}, | ||
"files": [ | ||
"dist/**" | ||
], | ||
"typings": "./dist/index.d.ts", | ||
"scripts": { | ||
"test": "bun test --coverage --verbose", | ||
"build": "tsup", | ||
"dev": "tsup --watch", | ||
"lint": "TIMING=1 eslint src/**/*.ts* --fix", | ||
"clean": "rm -rf .turbo && rm -rf node_modules && rm -rf dist", | ||
"type-check": "tsc --noEmit" | ||
}, | ||
"repository": { | ||
"directory": "public-packages/edna", | ||
"type": "git", | ||
"url": "git+https://github.com/hack-dance/island-ai.git" | ||
}, | ||
"keywords": [ | ||
"llm", | ||
"structured output", | ||
"streaming", | ||
"evals", | ||
"openai", | ||
"zod" | ||
], | ||
"license": "MIT", | ||
"author": "Dimitri Kennedy <[email protected]> (https://hack.dance)", | ||
"homepage": "https://island.novy.work", | ||
"dependencies": { | ||
"zod-stream": "workspace:*" | ||
}, | ||
"peerDependencies": { | ||
"openai": ">=4.24.1", | ||
"zod": ">=3.22.4" | ||
}, | ||
"devDependencies": { | ||
"@repo/eslint-config": "workspace:*", | ||
"@repo/typescript-config": "workspace:*", | ||
"zod-stream": "workspace:*", | ||
"@turbo/gen": "^1.10.12", | ||
"@types/node": "^20.5.2", | ||
"@types/eslint": "^8.44.7", | ||
"eslint": "^8.53.0", | ||
"tsup": "^8.0.1", | ||
"typescript": "^5.2.2", | ||
"ramda": "^0.29.0", | ||
"zod": "3.22.4" | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
import { ResultsType } from "@/types" | ||
|
||
export const CUSTOM_EVALUATOR_IDENTITY = | ||
"You are an AI evaluator tasked with scoring a language model's responses. You'll be presented with a 'prompt:' and 'response:' pair (and optionally an 'expectedResponse') and should evaluate based on the criteria provided in the subsequent system prompts. Provide only a numerical score in the range defined, not a descriptive response and no other prose." | ||
|
||
export const RESPONSE_TYPE_EVALUATOR_SCORE = | ||
"Your task is to provide a numerical score ranging from 0 to 1 based on the criteria in the subsequent system prompts. The score should precisely reflect the performance of the language model's response. Do not provide any text explanation or feedback, only the numerical score." | ||
|
||
export const RESPONSE_TYPE_EVALUATOR_BINARY = | ||
"Your task is to provide a binary score of either 0 or 1 based on the criteria in the subsequent system prompts. This should precisely reflect the language model's performance. Do not provide any text explanation or feedback, only a singular digit: 1 or 0." | ||
|
||
export const RESULTS_TYPE_PROMPT: Record<ResultsType, string> = { | ||
score: RESPONSE_TYPE_EVALUATOR_SCORE, | ||
binary: RESPONSE_TYPE_EVALUATOR_BINARY | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,114 @@ | ||
import { EvaluationResponse, Evaluator, ExecuteEvalParams, ResultsType } from "@/types" | ||
import z from "zod" | ||
import { createAgent, type CreateAgentParams } from "zod-stream" | ||
|
||
import { CUSTOM_EVALUATOR_IDENTITY, RESULTS_TYPE_PROMPT } from "@/constants/prompts" | ||
|
||
const scoringSchema = z.object({ | ||
score: z.number() | ||
}) | ||
|
||
export function createEvaluator<T extends ResultsType>({ | ||
resultsType = "score" as T, | ||
evaluationDescription, | ||
model, | ||
messages, | ||
client | ||
}: { | ||
resultsType?: T | ||
evaluationDescription: string | ||
model?: CreateAgentParams["defaultClientOptions"]["model"] | ||
messages?: CreateAgentParams["defaultClientOptions"]["messages"] | ||
client: CreateAgentParams["client"] | ||
}): Evaluator<T> { | ||
if (!evaluationDescription || typeof evaluationDescription !== "string") { | ||
throw new Error("Evaluation description was not provided.") | ||
} | ||
|
||
const execute = async ({ data }: ExecuteEvalParams): Promise<EvaluationResponse<T>> => { | ||
const agent = createAgent({ | ||
client, | ||
response_model: { | ||
schema: scoringSchema, | ||
name: "Scoring" | ||
}, | ||
defaultClientOptions: { | ||
model: model ?? "gpt-4-1106-preview", | ||
messages: [ | ||
{ | ||
role: "system", | ||
content: CUSTOM_EVALUATOR_IDENTITY | ||
}, | ||
{ | ||
role: "system", | ||
content: RESULTS_TYPE_PROMPT[resultsType] | ||
}, | ||
{ | ||
role: "system", | ||
content: evaluationDescription | ||
}, | ||
...(messages ?? []) | ||
] | ||
} | ||
}) | ||
|
||
const evaluationResults = await Promise.all( | ||
data.map(async item => { | ||
const { prompt, completion, expectedCompletion } = item | ||
|
||
const response = await agent.completion({ | ||
messages: [ | ||
{ | ||
role: "system", | ||
content: `prompt: ${prompt} \n completion: ${completion}\n ${expectedCompletion?.length ? `expectedCompletion: ${expectedCompletion}\n` : " "}Please provide your score now:` | ||
} | ||
] | ||
}) | ||
|
||
return { | ||
score: response["score"], | ||
item | ||
} | ||
}) | ||
) | ||
|
||
let resultObject | ||
|
||
if (resultsType === "score") { | ||
const avgScore = | ||
evaluationResults.reduce((sum, { score = 0 }) => sum + score, 0) / evaluationResults.length | ||
|
||
resultObject = { | ||
results: evaluationResults, | ||
scoreResults: { | ||
value: avgScore | ||
} | ||
} | ||
} | ||
|
||
if (resultsType === "binary") { | ||
const binaryResults = evaluationResults.reduce( | ||
(acc, { score }) => { | ||
if (score >= 0) { | ||
acc.trueCount++ | ||
} else { | ||
acc.falseCount++ | ||
} | ||
return acc | ||
}, | ||
{ trueCount: 0, falseCount: 0 } | ||
) | ||
|
||
resultObject = { | ||
results: evaluationResults, | ||
binaryResults | ||
} | ||
} | ||
|
||
if (!resultObject) throw new Error("No result object was created") | ||
|
||
return resultObject as unknown as EvaluationResponse<T> | ||
} | ||
|
||
return execute | ||
} |
Oops, something went wrong.