Evaluation indicators need to be updated #119

JBoRu · 2023-11-02T12:35:07Z

Hi,
Thanks for this good project! However, the evaluation procedure is incorrect leading to an overestimated result. Specifically, your project uses the test-suit evaluation over the database which is used in original execution accuracy. According to the official evaluation project, you should use the new database_ts instead of the database. Therefore, the results will be lower! Here are my evaluation results of CodeLLama-13B-instruct-lora (the parameter config is same with your provided config) on the original database (78.1) and the correct database_ts (70.9).

wangzaistone · 2023-11-03T09:20:20Z

Hi, Thanks for this good project! However, the evaluation procedure is incorrect leading to an overestimated result. Specifically, your project uses the test-suit evaluation over the database which is used in original execution accuracy. According to the official evaluation project, you should use the new database_ts instead of the database. Therefore, the results will be lower! Here are my evaluation results of CodeLLama-13B-instruct-lora (the parameter config is same with your provided config) on the original database (78.1) and the correct database_ts (70.9).

thanks for your suggestion， but I still don't understand what's the detail about the " new database_ts " , we are followed the same link as you provide. And the dev dataset which include 1034 examples is the linked github gived . If possible , may I get your wechat or other contact way ? And may be we could discuss face to face on weekends .
thanks for your attention again!

wangzaistone · 2023-11-03T12:04:25Z

Hi, Thanks for this good project! However, the evaluation procedure is incorrect leading to an overestimated result. Specifically, your project uses the test-suit evaluation over the database which is used in original execution accuracy. According to the official evaluation project, you should use the new database_ts instead of the database. Therefore, the results will be lower! Here are my evaluation results of CodeLLama-13B-instruct-lora (the parameter config is same with your provided config) on the original database (78.1) and the correct database_ts (70.9).

the parameter config is same with your provided config
Junior brother， What you has talked is important to the project especially in the expriments， we look forward to discuss with you together if it's possible . And About the metric in spider trained set as the project used now , several guys has reproducible experimental results which are all above 0.789 metric in several wechat group refer the params as we provided . And the adapter I trained has opened in Hugging Face CodeLlama-13b-sql-lora. We cannot ensure that all experiments are the same , After all, the large model itself has a certain degree of randomness, especially such as the temperature setting. It's very nice for you to talk about it .

wangzaistone · 2023-11-03T12:09:33Z

If you have better eval code , Could you contribute a pr for us ,thanks a lot ~

wangzaistone · 2023-11-04T07:52:04Z

According to what you pointed out, the database in evaluation method is updated，which changed from the database downloaded from the original official yale website(spider dataset 95Mb) to the database pointed to downloaded from the author's github linked to(1.27Gb） , and the index has dropped . The indicators after re-forecasting and evaluating based on the weights we uploaded to HF exe acc is 0.742 as follows, which are slightly different from yours.

Thanks again for your reminder.

wangzaistone · 2023-11-04T15:08:37Z

I update the ex acc , before we use around 50k data to trained in codellama-13 , the eval score 0.825 in before 95Mb database evaluated, now after change the database as your remind , the exe acc is 0.764。

AlphaNext · 2023-12-19T06:23:28Z

@wangzaistone Thanks for your nice work, could you add these two different dataset links (spider dataset 95mb and anther 1.27gb), i can't find the 1.27gb version, thanks.

0.825 (95mb) and 0.764(1.27gb) are test on the spider dev dataset or spider test dataset?

junewgl · 2023-12-21T03:19:34Z

@wangzaistone Thanks for your nice work, could you add these two different dataset links (spider dataset 95mb and anther 1.27gb), i can't find the 1.27gb version, thanks.

0.825 (95mb) and 0.764(1.27gb) are test on the spider dev dataset or spider test dataset?

test on dev. ts eval dataset is here:https://github.com/taoyds/test-suite-sql-eval @AlphaNext

wangzaistone changed the title ~~Incorrect Evaluation Metric!~~ Evaluation indicators need to be updated Nov 4, 2023

wangzaistone closed this as completed Nov 4, 2023

wangzaistone mentioned this issue Nov 4, 2023

docs , update the evaluate method ,give explanation about our metric #121

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation indicators need to be updated #119

Evaluation indicators need to be updated #119

JBoRu commented Nov 2, 2023

wangzaistone commented Nov 3, 2023 •

edited

Loading

wangzaistone commented Nov 3, 2023

wangzaistone commented Nov 3, 2023

wangzaistone commented Nov 4, 2023 •

edited

Loading

wangzaistone commented Nov 4, 2023

AlphaNext commented Dec 19, 2023 •

edited

Loading

junewgl commented Dec 21, 2023

Evaluation indicators need to be updated #119

Evaluation indicators need to be updated #119

Comments

JBoRu commented Nov 2, 2023

wangzaistone commented Nov 3, 2023 • edited Loading

wangzaistone commented Nov 3, 2023

wangzaistone commented Nov 3, 2023

wangzaistone commented Nov 4, 2023 • edited Loading

wangzaistone commented Nov 4, 2023

AlphaNext commented Dec 19, 2023 • edited Loading

junewgl commented Dec 21, 2023

wangzaistone commented Nov 3, 2023 •

edited

Loading

wangzaistone commented Nov 4, 2023 •

edited

Loading

AlphaNext commented Dec 19, 2023 •

edited

Loading