BFCL setup instruction is very difficult to follow #501

HuanzhiMao · 2024-07-06T09:32:13Z

We need to make it much more user-friendly!

**Purpose**: - One step addressing #501 **Changes**: - remove steps of cloning repo, symlinks, builds - add `tree-sitter-java` and `tree-sitter-javascript` in requirements.txt with pip install **Test**: - Tested end-to-end BFCL generation and checked on gorilla-openfunctions-v2 and llama3-70B. Co-authored-by: Shishir Patil <[email protected]>

…il#505) **Purpose**: - One step addressing ShishirPatil#501 **Changes**: - remove steps of cloning repo, symlinks, builds - add `tree-sitter-java` and `tree-sitter-javascript` in requirements.txt with pip install **Test**: - Tested end-to-end BFCL generation and checked on gorilla-openfunctions-v2 and llama3-70B. Co-authored-by: Shishir Patil <[email protected]>

Previously, the test dataset was stored on HuggingFace, requiring users to clone this repository, download the dataset separately from HuggingFace, and then run the evaluation pipeline. This has caused many inconveniences and confusion within the community. Users often struggled with the inconsistency of having the possible answers within the repository while the test dataset was missing. Partially addresses #501.

…l_data_compilation Step (#512) Currently, for OSS models, users must run `eval_data_compilation.py` to merge all datasets into a single `data_total.json` file before running inference. This requires inferring on all datasets at once, without the option to run inference on individual datasets or subsets. This PR addresses this limitation by allowing users to perform inference on specific datasets directly, removing the need for the `eval_data_compilation` step. Note: hosted models don't have this limitation. Partially addresses #501 and #502.

…ility (#508) Currently, the `apply_function_credential_config.py` requires the user to manually provide the input file path and it only takes one single file path at a time, which is very cumbersome and user-unfriendly, considering that the user must do this 14 times for each of the test dataset files. This PR refactors the script to improve its usability: 1. The script no longer requires an input file path. It automatically searches for datasets in the default `./data/` directory. 2. The script now can process all files within the specified location (when `--input-path` is set to a directory path), thereby supporting bulk operations. 3. Single file path input is still supported. This PR partially addresses #501 and #502.

…ns_evaluation.py (#506) There are inconsistencies between the `test_category` argument that's used by `eval_checker/eval_runner.py` and `openfunctions_evaluation.py`. This PR partially addresses #501 and #502. --------- Co-authored-by: Shishir Patil <[email protected]>

…taset; Handle vLLM Benign Error (#540) In this PR: 1. **Support Multi-Model Multi-Category Generation**: - The `openfunctions_evaluation.py` can now take a list of model names and a list of test categories as command line input. - Partially address #501. 2. **Handling vLLM's Error**: - A benign error would occur during the cleanup phase after completing a generation task, causing the pipeline to fail despite generating model results. This issue stems from vLLM and is outside our control. [See this issue](vllm-project/vllm#6145) from the vLLM repo. - This is annoying because when users attempt category-specific generation for locally-hosted models (as supported in #512), only the first category result for the first model is generated since the error occurs immediately after. - To improve the user experience, we now combine all selected test categories into one task and submit that single task to vLLM, splitting the results afterwards. - Note: If multiple locally-hosted models are queued for inference, only the tasks of the first model will complete. Subsequent tasks will still fail due to the cleanup phase error from the first model. Therefore, we recommend running the inference command for one model at a time until vLLM rolls out the fix. 3. **Adding Index to Dataset**: - Each test file and possible_answer file now includes an index to help match entries. This PR **will not** affect the leaderboard score.

shizhediao · 2024-08-04T22:06:54Z

Is there a way to evaluate BFCL without using vllm? for example, can I use huggingface transformer's generate()?

…il#505) **Purpose**: - One step addressing ShishirPatil#501 **Changes**: - remove steps of cloning repo, symlinks, builds - add `tree-sitter-java` and `tree-sitter-javascript` in requirements.txt with pip install **Test**: - Tested end-to-end BFCL generation and checked on gorilla-openfunctions-v2 and llama3-70B. Co-authored-by: Shishir Patil <[email protected]>

Previously, the test dataset was stored on HuggingFace, requiring users to clone this repository, download the dataset separately from HuggingFace, and then run the evaluation pipeline. This has caused many inconveniences and confusion within the community. Users often struggled with the inconsistency of having the possible answers within the repository while the test dataset was missing. Partially addresses ShishirPatil#501.

…l_data_compilation Step (ShishirPatil#512) Currently, for OSS models, users must run `eval_data_compilation.py` to merge all datasets into a single `data_total.json` file before running inference. This requires inferring on all datasets at once, without the option to run inference on individual datasets or subsets. This PR addresses this limitation by allowing users to perform inference on specific datasets directly, removing the need for the `eval_data_compilation` step. Note: hosted models don't have this limitation. Partially addresses ShishirPatil#501 and ShishirPatil#502.

…ility (ShishirPatil#508) Currently, the `apply_function_credential_config.py` requires the user to manually provide the input file path and it only takes one single file path at a time, which is very cumbersome and user-unfriendly, considering that the user must do this 14 times for each of the test dataset files. This PR refactors the script to improve its usability: 1. The script no longer requires an input file path. It automatically searches for datasets in the default `./data/` directory. 2. The script now can process all files within the specified location (when `--input-path` is set to a directory path), thereby supporting bulk operations. 3. Single file path input is still supported. This PR partially addresses ShishirPatil#501 and ShishirPatil#502.

…ns_evaluation.py (ShishirPatil#506) There are inconsistencies between the `test_category` argument that's used by `eval_checker/eval_runner.py` and `openfunctions_evaluation.py`. This PR partially addresses ShishirPatil#501 and ShishirPatil#502. --------- Co-authored-by: Shishir Patil <[email protected]>

…taset; Handle vLLM Benign Error (ShishirPatil#540) In this PR: 1. **Support Multi-Model Multi-Category Generation**: - The `openfunctions_evaluation.py` can now take a list of model names and a list of test categories as command line input. - Partially address ShishirPatil#501. 2. **Handling vLLM's Error**: - A benign error would occur during the cleanup phase after completing a generation task, causing the pipeline to fail despite generating model results. This issue stems from vLLM and is outside our control. [See this issue](vllm-project/vllm#6145) from the vLLM repo. - This is annoying because when users attempt category-specific generation for locally-hosted models (as supported in ShishirPatil#512), only the first category result for the first model is generated since the error occurs immediately after. - To improve the user experience, we now combine all selected test categories into one task and submit that single task to vLLM, splitting the results afterwards. - Note: If multiple locally-hosted models are queued for inference, only the tasks of the first model will complete. Subsequent tasks will still fail due to the cleanup phase error from the first model. Therefore, we recommend running the inference command for one model at a time until vLLM rolls out the fix. 3. **Adding Index to Dataset**: - Each test file and possible_answer file now includes an index to help match entries. This PR **will not** affect the leaderboard score.

HuanzhiMao mentioned this issue Jul 6, 2024

[BFCL] Add Test Dataset to Repository #504

Merged

CharlieJCJ mentioned this issue Jul 6, 2024

[BFCL] Improved tree-sitter java, javascript installation #505

Merged

This was referenced Jul 6, 2024

[BFCL] Standardize TEST_CATEGORY Among eval_runner.py and openfunctions_evaluation.py #506

Merged

[BFCL] Overhaul apply_function_credential_config.py for Enhanced Usability #508

Merged

HuanzhiMao mentioned this issue Jul 8, 2024

[BFCL] Support Category-Specific Generation for OSS Model, Remove eval_data_compilation Step #512

Merged

HuanzhiMao closed this as completed Jul 23, 2024

HuanzhiMao mentioned this issue Jul 23, 2024

[BFCL] Support Multi-Model Multi-Category Generation; Add Index to Dataset; Handle vLLM Benign Error #540

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BFCL setup instruction is very difficult to follow #501

BFCL setup instruction is very difficult to follow #501

HuanzhiMao commented Jul 6, 2024

shizhediao commented Aug 4, 2024

BFCL setup instruction is very difficult to follow #501

BFCL setup instruction is very difficult to follow #501

Comments

HuanzhiMao commented Jul 6, 2024

shizhediao commented Aug 4, 2024