AI API Eval Framework #618

ashitaprasad · 2025-02-23T16:12:38Z

Tell us about the task you want to perform and are unable to do so because the feature is not available

Develop an end-to-end AI API eval framework and integrate it in API Dash. This framework should (list is suggestive, not exhaustive):

Provide an intuitive interface for configuring API requests, where users can input test datasets, configure request parameters, and send queries to various AI API services
Support evaluation AI APIs (text, multimedia, etc) across various industry task benchmarks
Allow users to add custom dataset/benchmark & criteria for evaluation. This custom scoring mechanisms allow tailored evaluations based on specific project needs
Visualize the results of API eval via tables, charts, and graphs, making it easy to identify trends, outliers, and performance variations
Allow execution of batch evaluations
Work with both offline & online models and datasets

f-ei8ht · 2025-03-01T12:19:01Z

Hi! @ashitaprasad I find this issue very interesting and would love to work on it as part of GSoC. The idea of building an AI API evaluation framework and integrating it into API Dash aligns well with my skills in AI, Python, and Flutter. I have experience in developing evaluation frameworks and data visualization tools.

I’d like to discuss the project in more detail. Are there any specific AI APIs or benchmarks that should be prioritized? Also, should the evaluation framework support parallel execution for batch processing?

Looking forward to contributing!

ashitaprasad · 2025-03-01T18:13:36Z

Hi! @ashitaprasad I find this issue very interesting and would love to work on it as part of GSoC. The idea of building an AI API evaluation framework and integrating it into API Dash aligns well with my skills in AI, Python, and Flutter. I have experience in developing evaluation frameworks and data visualization tools.

I’d like to discuss the project in more detail. Are there any specific AI APIs or benchmarks that should be prioritized? Also, should the evaluation framework support parallel execution for batch processing?

@f-ei8ht Currently, lm-evaluation-harness is the most popular LLM eval framework which supports the evaluation of models served via several commercial APIs or local inference APIs. But, it is not user friendly and requires coding background to use.

This project is trying to solve the issue of providing an easy way to evaluate the AI API responses for any task benchmark.

Read LLM Benchmarks Explained: Everything on MMLU, HellaSwag, BBH, and Beyond

Let us take for example of MMLU benchmark and test a model served using Ollama local API.

In this feature, the user should be able to select the benchmark (MMLU) for which the API is being evaluated. API Dash will read the benchmark datasets (download if not available), process it and create the API requests which will be executed. The API response received will be processed and used to calculate the benchmark score.

Everything happens in a user-friendly manner, where the user is able to see the progress of evaluations, pause/resume evaluation, visualize the end result easily.

ashitaprasad added the enhancement New feature or request label Feb 23, 2025

ashitaprasad added good first issue Good for newcomers and removed enhancement New feature or request labels Mar 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AI API Eval Framework #618

AI API Eval Framework #618

ashitaprasad commented Feb 23, 2025

f-ei8ht commented Mar 1, 2025 •

edited

Loading

ashitaprasad commented Mar 1, 2025 •

edited

Loading

AI API Eval Framework #618

AI API Eval Framework #618

Comments

ashitaprasad commented Feb 23, 2025

Tell us about the task you want to perform and are unable to do so because the feature is not available

f-ei8ht commented Mar 1, 2025 • edited Loading

ashitaprasad commented Mar 1, 2025 • edited Loading

f-ei8ht commented Mar 1, 2025 •

edited

Loading

ashitaprasad commented Mar 1, 2025 •

edited

Loading