Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluation indicators need to be updated #119

Closed
JBoRu opened this issue Nov 2, 2023 · 7 comments
Closed

Evaluation indicators need to be updated #119

JBoRu opened this issue Nov 2, 2023 · 7 comments

Comments

@JBoRu
Copy link

JBoRu commented Nov 2, 2023

Hi,
Thanks for this good project! However, the evaluation procedure is incorrect leading to an overestimated result. Specifically, your project uses the test-suit evaluation over the database which is used in original execution accuracy. According to the official evaluation project, you should use the new database_ts instead of the database. Therefore, the results will be lower! Here are my evaluation results of CodeLLama-13B-instruct-lora (the parameter config is same with your provided config) on the original database (78.1) and the correct database_ts (70.9).

截屏2023-11-02 20 31 07
截屏2023-11-02 20 31 24

@wangzaistone
Copy link
Member

wangzaistone commented Nov 3, 2023

Hi, Thanks for this good project! However, the evaluation procedure is incorrect leading to an overestimated result. Specifically, your project uses the test-suit evaluation over the database which is used in original execution accuracy. According to the official evaluation project, you should use the new database_ts instead of the database. Therefore, the results will be lower! Here are my evaluation results of CodeLLama-13B-instruct-lora (the parameter config is same with your provided config) on the original database (78.1) and the correct database_ts (70.9).

截屏2023-11-02 20 31 07 截屏2023-11-02 20 31 24

thanks for your suggestion, but I still don't understand what's the detail about the " new database_ts " , we are followed the same link as you provide. And the dev dataset which include 1034 examples is the linked github gived . If possible , may I get your wechat or other contact way ? And may be we could discuss face to face on weekends .
thanks for your attention again!

@wangzaistone
Copy link
Member

Hi, Thanks for this good project! However, the evaluation procedure is incorrect leading to an overestimated result. Specifically, your project uses the test-suit evaluation over the database which is used in original execution accuracy. According to the official evaluation project, you should use the new database_ts instead of the database. Therefore, the results will be lower! Here are my evaluation results of CodeLLama-13B-instruct-lora (the parameter config is same with your provided config) on the original database (78.1) and the correct database_ts (70.9).

截屏2023-11-02 20 31 07 截屏2023-11-02 20 31 24

the parameter config is same with your provided config
Junior brother, What you has talked is important to the project especially in the expriments, we look forward to discuss with you together if it's possible . And About the metric in spider trained set as the project used now , several guys has reproducible experimental results which are all above 0.789 metric in several wechat group refer the params as we provided . And the adapter I trained has opened in Hugging Face CodeLlama-13b-sql-lora. We cannot ensure that all experiments are the same , After all, the large model itself has a certain degree of randomness, especially such as the temperature setting. It's very nice for you to talk about it .

@wangzaistone
Copy link
Member

If you have better eval code , Could you contribute a pr for us ,thanks a lot ~

@wangzaistone wangzaistone changed the title Incorrect Evaluation Metric! Evaluation indicators need to be updated Nov 4, 2023
@wangzaistone
Copy link
Member

wangzaistone commented Nov 4, 2023

According to what you pointed out, the database in evaluation method is updated,which changed from the database downloaded from the original official yale website(spider dataset 95Mb) to the database pointed to downloaded from the author's github linked to(1.27Gb) , and the index has dropped . The indicators after re-forecasting and evaluating based on the weights we uploaded to HF exe acc is 0.742 as follows, which are slightly different from yours.
image

Thanks again for your reminder.

@wangzaistone
Copy link
Member

I update the ex acc , before we use around 50k data to trained in codellama-13 , the eval score 0.825 in before 95Mb database evaluated, now after change the database as your remind , the exe acc is 0.764。
image

@AlphaNext
Copy link

AlphaNext commented Dec 19, 2023

@wangzaistone Thanks for your nice work, could you add these two different dataset links (spider dataset 95mb and anther 1.27gb), i can't find the 1.27gb version, thanks.

0.825 (95mb) and 0.764(1.27gb) are test on the spider dev dataset or spider test dataset?

@junewgl
Copy link
Collaborator

junewgl commented Dec 21, 2023

@wangzaistone Thanks for your nice work, could you add these two different dataset links (spider dataset 95mb and anther 1.27gb), i can't find the 1.27gb version, thanks.

0.825 (95mb) and 0.764(1.27gb) are test on the spider dev dataset or spider test dataset?

test on dev. ts eval dataset is here:https://github.com/taoyds/test-suite-sql-eval @AlphaNext

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants