Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AI API Eval Framework #618

Open
ashitaprasad opened this issue Feb 23, 2025 · 2 comments
Open

AI API Eval Framework #618

ashitaprasad opened this issue Feb 23, 2025 · 2 comments
Labels
good first issue Good for newcomers

Comments

@ashitaprasad
Copy link
Member

Tell us about the task you want to perform and are unable to do so because the feature is not available

Develop an end-to-end AI API eval framework and integrate it in API Dash. This framework should (list is suggestive, not exhaustive):

  • Provide an intuitive interface for configuring API requests, where users can input test datasets, configure request parameters, and send queries to various AI API services
  • Support evaluation AI APIs (text, multimedia, etc) across various industry task benchmarks
  • Allow users to add custom dataset/benchmark & criteria for evaluation. This custom scoring mechanisms allow tailored evaluations based on specific project needs
  • Visualize the results of API eval via tables, charts, and graphs, making it easy to identify trends, outliers, and performance variations
  • Allow execution of batch evaluations
  • Work with both offline & online models and datasets
@ashitaprasad ashitaprasad added the enhancement New feature or request label Feb 23, 2025
@f-ei8ht
Copy link

f-ei8ht commented Mar 1, 2025

Hi! @ashitaprasad I find this issue very interesting and would love to work on it as part of GSoC. The idea of building an AI API evaluation framework and integrating it into API Dash aligns well with my skills in AI, Python, and Flutter. I have experience in developing evaluation frameworks and data visualization tools.

I’d like to discuss the project in more detail. Are there any specific AI APIs or benchmarks that should be prioritized? Also, should the evaluation framework support parallel execution for batch processing?

Looking forward to contributing!

@ashitaprasad
Copy link
Member Author

ashitaprasad commented Mar 1, 2025

Hi! @ashitaprasad I find this issue very interesting and would love to work on it as part of GSoC. The idea of building an AI API evaluation framework and integrating it into API Dash aligns well with my skills in AI, Python, and Flutter. I have experience in developing evaluation frameworks and data visualization tools.

I’d like to discuss the project in more detail. Are there any specific AI APIs or benchmarks that should be prioritized? Also, should the evaluation framework support parallel execution for batch processing?

@f-ei8ht Currently, lm-evaluation-harness is the most popular LLM eval framework which supports the evaluation of models served via several commercial APIs or local inference APIs. But, it is not user friendly and requires coding background to use.

This project is trying to solve the issue of providing an easy way to evaluate the AI API responses for any task benchmark.

Read LLM Benchmarks Explained: Everything on MMLU, HellaSwag, BBH, and Beyond

Let us take for example of MMLU benchmark and test a model served using Ollama local API.

In this feature, the user should be able to select the benchmark (MMLU) for which the API is being evaluated. API Dash will read the benchmark datasets (download if not available), process it and create the API requests which will be executed. The API response received will be processed and used to calculate the benchmark score.

Everything happens in a user-friendly manner, where the user is able to see the progress of evaluations, pause/resume evaluation, visualize the end result easily.

@ashitaprasad ashitaprasad added good first issue Good for newcomers and removed enhancement New feature or request labels Mar 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants