SeqEvalEvaluator don't take empty label "O" into account. Thus lowering the F1 weighted score. #18

Thibeb · 2024-02-09T10:00:24Z

Finetuning using the medkit library gives worse results than when directly using the hugging face library.

When fine-tuning a BERT model using the HFEntityMatcherTrainable component from the medkit library, the metrics I got afterward when evaluating it were far worst than the same training using the hugging face API only.

This was found in this tutorial were engineers trained the same model with the same corpus than I did and got much bette results.

After reproducing their code on my machine, I got the same results as they did.

They achieved an averaged F1 score of approximately 0.90 were as my medkit trained BERT model was achieving an averaged F1 score of 0.62.

After some investigation

I found out that this gap was explained by the fact that, when they evaluated their model with the classification_report function from sklearn, the NO_ENTITY label (usually noted as "O" was taken into account for the calculation of the averaged F1 score.

The "O" label being logically everywhere in the samples, the F1 score was thus highly augmented.

When discarding this label from the average F1 calculation I found out that they're model performed as much as mine.

In the medkit library, the SeqEvalEvaluator discard the F1 score of the "O" label, which lowers the averaged F1.

It is not a problem but it would be interesting to have the option with SeqEval to choose whether or not to take the "O" label into account to give a bit more of options when evaluating predictions.

ghisvail · 2024-03-19T09:26:32Z

We were given a very insightful talk from @Rian-T recently. He touched on the topic of evaluation, and mentioned this particular point.

From what I remembered, he said that there was not a consensus on whether to include or not class O in the evaluation. Perhaps the best option would be to let users choose via a parameter.

ghisvail added the enhancement New feature or request label Feb 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SeqEvalEvaluator don't take empty label "O" into account. Thus lowering the F1 weighted score. #18

SeqEvalEvaluator don't take empty label "O" into account. Thus lowering the F1 weighted score. #18

Thibeb commented Feb 9, 2024

ghisvail commented Mar 19, 2024

SeqEvalEvaluator don't take empty label "O" into account. Thus lowering the F1 weighted score. #18

SeqEvalEvaluator don't take empty label "O" into account. Thus lowering the F1 weighted score. #18

Comments

Thibeb commented Feb 9, 2024

Finetuning using the medkit library gives worse results than when directly using the hugging face library.

After some investigation

In the medkit library, the SeqEvalEvaluator discard the F1 score of the "O" label, which lowers the averaged F1.

ghisvail commented Mar 19, 2024