Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BFCL setup instruction is very difficult to follow #501

Closed
HuanzhiMao opened this issue Jul 6, 2024 · 1 comment
Closed

BFCL setup instruction is very difficult to follow #501

HuanzhiMao opened this issue Jul 6, 2024 · 1 comment

Comments

@HuanzhiMao
Copy link
Collaborator

We need to make it much more user-friendly!

ShishirPatil added a commit that referenced this issue Jul 7, 2024
**Purpose**:
- One step addressing #501 

**Changes**:
- remove steps of cloning repo, symlinks, builds
- add `tree-sitter-java` and `tree-sitter-javascript` in
requirements.txt with pip install

**Test**:
- Tested end-to-end BFCL generation and checked on
gorilla-openfunctions-v2 and llama3-70B.

Co-authored-by: Shishir Patil <[email protected]>
devanshamin pushed a commit to devanshamin/gorilla that referenced this issue Jul 9, 2024
…il#505)

**Purpose**:
- One step addressing ShishirPatil#501 

**Changes**:
- remove steps of cloning repo, symlinks, builds
- add `tree-sitter-java` and `tree-sitter-javascript` in
requirements.txt with pip install

**Test**:
- Tested end-to-end BFCL generation and checked on
gorilla-openfunctions-v2 and llama3-70B.

Co-authored-by: Shishir Patil <[email protected]>
ShishirPatil pushed a commit that referenced this issue Jul 11, 2024
Previously, the test dataset was stored on HuggingFace, requiring users
to clone this repository, download the dataset separately from
HuggingFace, and then run the evaluation pipeline. This has caused many
inconveniences and confusion within the community. Users often struggled
with the inconsistency of having the possible answers within the
repository while the test dataset was missing.

Partially addresses #501.
ShishirPatil pushed a commit that referenced this issue Jul 17, 2024
…l_data_compilation Step (#512)

Currently, for OSS models, users must run `eval_data_compilation.py` to
merge all datasets into a single `data_total.json` file before running
inference. This requires inferring on all datasets at once, without the
option to run inference on individual datasets or subsets. This PR
addresses this limitation by allowing users to perform inference on
specific datasets directly, removing the need for the
`eval_data_compilation` step.
Note: hosted models don't have this limitation. 

Partially addresses #501 and #502.
ShishirPatil pushed a commit that referenced this issue Jul 17, 2024
…ility (#508)

Currently, the `apply_function_credential_config.py` requires the user
to manually provide the input file path and it only takes one single
file path at a time, which is very cumbersome and user-unfriendly,
considering that the user must do this 14 times for each of the test
dataset files.

This PR refactors the script to improve its usability:

1. The script no longer requires an input file path. It automatically
searches for datasets in the default `./data/` directory.
2. The script now can process all files within the specified location
(when `--input-path` is set to a directory path), thereby supporting
bulk operations.
3. Single file path input is still supported. 

This PR partially addresses #501 and #502.
ShishirPatil added a commit that referenced this issue Jul 19, 2024
…ns_evaluation.py (#506)

There are inconsistencies between the `test_category` argument that's
used by `eval_checker/eval_runner.py` and `openfunctions_evaluation.py`.

This PR partially addresses #501 and #502.

---------

Co-authored-by: Shishir Patil <[email protected]>
ShishirPatil pushed a commit that referenced this issue Jul 24, 2024
…taset; Handle vLLM Benign Error (#540)

In this PR:

1. **Support Multi-Model Multi-Category Generation**:
- The `openfunctions_evaluation.py` can now take a list of model names
and a list of test categories as command line input.
   - Partially address #501.

2. **Handling vLLM's Error**:
- A benign error would occur during the cleanup phase after completing a
generation task, causing the pipeline to fail despite generating model
results. This issue stems from vLLM and is outside our control. [See
this issue](vllm-project/vllm#6145) from the
vLLM repo.
- This is annoying because when users attempt category-specific
generation for locally-hosted models (as supported in #512), only the
first category result for the first model is generated since the error
occurs immediately after.
- To improve the user experience, we now combine all selected test
categories into one task and submit that single task to vLLM, splitting
the results afterwards.
- Note: If multiple locally-hosted models are queued for inference, only
the tasks of the first model will complete. Subsequent tasks will still
fail due to the cleanup phase error from the first model. Therefore, we
recommend running the inference command for one model at a time until
vLLM rolls out the fix.

3. **Adding Index to Dataset**:
- Each test file and possible_answer file now includes an index to help
match entries.
  
This PR **will not** affect the leaderboard score.
@shizhediao
Copy link

Is there a way to evaluate BFCL without using vllm? for example, can I use huggingface transformer's generate()?

aw632 pushed a commit to vinaybagade/gorilla that referenced this issue Aug 22, 2024
…il#505)

**Purpose**:
- One step addressing ShishirPatil#501 

**Changes**:
- remove steps of cloning repo, symlinks, builds
- add `tree-sitter-java` and `tree-sitter-javascript` in
requirements.txt with pip install

**Test**:
- Tested end-to-end BFCL generation and checked on
gorilla-openfunctions-v2 and llama3-70B.

Co-authored-by: Shishir Patil <[email protected]>
aw632 pushed a commit to vinaybagade/gorilla that referenced this issue Aug 22, 2024
Previously, the test dataset was stored on HuggingFace, requiring users
to clone this repository, download the dataset separately from
HuggingFace, and then run the evaluation pipeline. This has caused many
inconveniences and confusion within the community. Users often struggled
with the inconsistency of having the possible answers within the
repository while the test dataset was missing.

Partially addresses ShishirPatil#501.
aw632 pushed a commit to vinaybagade/gorilla that referenced this issue Aug 22, 2024
…l_data_compilation Step (ShishirPatil#512)

Currently, for OSS models, users must run `eval_data_compilation.py` to
merge all datasets into a single `data_total.json` file before running
inference. This requires inferring on all datasets at once, without the
option to run inference on individual datasets or subsets. This PR
addresses this limitation by allowing users to perform inference on
specific datasets directly, removing the need for the
`eval_data_compilation` step.
Note: hosted models don't have this limitation. 

Partially addresses ShishirPatil#501 and ShishirPatil#502.
aw632 pushed a commit to vinaybagade/gorilla that referenced this issue Aug 22, 2024
…ility (ShishirPatil#508)

Currently, the `apply_function_credential_config.py` requires the user
to manually provide the input file path and it only takes one single
file path at a time, which is very cumbersome and user-unfriendly,
considering that the user must do this 14 times for each of the test
dataset files.

This PR refactors the script to improve its usability:

1. The script no longer requires an input file path. It automatically
searches for datasets in the default `./data/` directory.
2. The script now can process all files within the specified location
(when `--input-path` is set to a directory path), thereby supporting
bulk operations.
3. Single file path input is still supported. 

This PR partially addresses ShishirPatil#501 and ShishirPatil#502.
aw632 pushed a commit to vinaybagade/gorilla that referenced this issue Aug 22, 2024
…ns_evaluation.py (ShishirPatil#506)

There are inconsistencies between the `test_category` argument that's
used by `eval_checker/eval_runner.py` and `openfunctions_evaluation.py`.

This PR partially addresses ShishirPatil#501 and ShishirPatil#502.

---------

Co-authored-by: Shishir Patil <[email protected]>
aw632 pushed a commit to vinaybagade/gorilla that referenced this issue Aug 22, 2024
…taset; Handle vLLM Benign Error (ShishirPatil#540)

In this PR:

1. **Support Multi-Model Multi-Category Generation**:
- The `openfunctions_evaluation.py` can now take a list of model names
and a list of test categories as command line input.
   - Partially address ShishirPatil#501.

2. **Handling vLLM's Error**:
- A benign error would occur during the cleanup phase after completing a
generation task, causing the pipeline to fail despite generating model
results. This issue stems from vLLM and is outside our control. [See
this issue](vllm-project/vllm#6145) from the
vLLM repo.
- This is annoying because when users attempt category-specific
generation for locally-hosted models (as supported in ShishirPatil#512), only the
first category result for the first model is generated since the error
occurs immediately after.
- To improve the user experience, we now combine all selected test
categories into one task and submit that single task to vLLM, splitting
the results afterwards.
- Note: If multiple locally-hosted models are queued for inference, only
the tasks of the first model will complete. Subsequent tasks will still
fail due to the cleanup phase error from the first model. Therefore, we
recommend running the inference command for one model at a time until
vLLM rolls out the fix.

3. **Adding Index to Dataset**:
- Each test file and possible_answer file now includes an index to help
match entries.
  
This PR **will not** affect the leaderboard score.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants