Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluation gives wrong results #6

Open
ivo-1 opened this issue Jan 3, 2023 · 1 comment
Open

Evaluation gives wrong results #6

ivo-1 opened this issue Jan 3, 2023 · 1 comment

Comments

@ivo-1
Copy link

ivo-1 commented Jan 3, 2023

I think there is a bug with the evaluation.

Consider this minimal example:
expected.tsv

address__post_town=BROADWAY address__postcode=WR12_7NL charity_name=Wormington_Village_Society charity_number=1155074 report_date=2018-07-31
address__post_town=WESTCLIFF-ON-SEA address__postcode=SS0_8HX address__street_line=47_SECOND_AVENUE charity_name=Havens_Christian_Hospice charity_number=1022119 income_annually_in_british_pounds=10348000.00 report_date=2016-03-31 spending_annually_in_british_pounds=9415000.00
address__post_town=CHELTENHAM address__postcode=GL50_3EP address__street_line=BAYSHILL_ROAD charity_name=Cheltenham_Ladies_College charity_number=311722 income_annually_in_british_pounds=32168000.00 report_date=2016-07-31 spending_annually_in_british_pounds=27972000.00

out_1.tsv (1 wrong answer for address__post_town in the first document)

address__post_town=Wrong address__postcode=WR12_7NL charity_name=Wormington_Village_Society charity_number=1155074 report_date=2018-07-31
address__post_town=WESTCLIFF-ON-SEA address__postcode=SS0_8HX address__street_line=47_SECOND_AVENUE charity_name=Havens_Christian_Hospice charity_number=1022119 income_annually_in_british_pounds=10348000.00 report_date=2016-03-31 spending_annually_in_british_pounds=9415000.00
address__post_town=CHELTENHAM address__postcode=GL50_3EP address__street_line=BAYSHILL_ROAD charity_name=Cheltenham_Ladies_College charity_number=311722 income_annually_in_british_pounds=32168000.00 report_date=2016-07-31 spending_annually_in_british_pounds=27972000.00

out_2.tsv (2 wrong answers for address__post_town in the first and the second document)

address__post_town=Wrong address__postcode=WR12_7NL charity_name=Wormington_Village_Society charity_number=1155074 report_date=2018-07-31
address__post_town=Wrong address__postcode=SS0_8HX address__street_line=47_SECOND_AVENUE charity_name=Havens_Christian_Hospice charity_number=1022119 income_annually_in_british_pounds=10348000.00 report_date=2016-03-31 spending_annually_in_british_pounds=9415000.00
address__post_town=CHELTENHAM address__postcode=GL50_3EP address__street_line=BAYSHILL_ROAD charity_name=Cheltenham_Ladies_College charity_number=311722 income_annually_in_british_pounds=32168000.00 report_date=2016-07-31 spending_annually_in_british_pounds=27972000.00

These two out.tsv files yield the same evaluation result for me with the official evaluation script that is used in the README.

The evaluations yield:

        F1      P       R
(UC)    94.4±5.6        94.4±5.6        94.4±5.6
address 86±14   86±14   86±14
money   100±0   100±0   100±0
town    67±33   67±33   67±33
postcode        100±0   100±0   100±0
street  100±0   100±0   100±0
name    100±0   100±0   100±0
number  100±0   100±0   100±0
income  100±0   100±0   100±0
spending        100±0   100±0   100±0
date    100±0   100±0   100±0

F1      94.4±5.6
Accuracy        67±33
Mean-F1 93.3±6.7

It seems like this evaluation would be correct for out_1.tsv as then the Mean-F1 (macro-average over documents) is (4/5 + 8/8 + 8/8)/3 = 0.933. For out_2.tsv this should then be (4/5 + 7/8 + 8/8)/3 = 0.892.

Let me know if you can reproduce this or not.

A separate but related issue: As mentioned in #5 the F1 score at the bottom should be micro-averaged F1 over all predictions. This also doesn't work out for either out_1.tsv or out_2.tsv as it would be 20 (correct key-value pairs)/21 (number of key-value pairs in solution) = 0.952 and 19/21 = 0.905 respectively.

@ivo-1
Copy link
Author

ivo-1 commented Jan 29, 2023

I've additionally validated that the evaluation mis-calculates when the first two rows contain the wrong answer, but not when first and third row contain the wrong answer and the second row is correct. It's also correct when all rows contain exactly one wrong answer.

In the case of first and third row containing exactly one wrong answer (for the post town), the evaluation is correct (although the issue with the micro-averaged F1 persists):

        F1      P       R
(UC)    89.6±6.2        89.6±6.2        89.6±6.2
address 73±16   73±16   73±16
money   100±0   100±0   100±0
town    33±33   33±33   33±33
postcode        100±0   100±0   100±0
street  100±0   100±0   100±0
name    100±0   100±0   100±0
number  100±0   100±0   100±0
income  100±0   100±0   100±0
spending        100±0   100±0   100±0
date    100±0   100±0   100±0

F1      89.6±6.2
Accuracy        33±33
Mean-F1 89.2±6.7

The evaluation is also correct when the first key is wrong in all three documents. So maybe just an edge-case where it goes completely wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant