You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Finetuning using the medkit library gives worse results than when directly using the hugging face library.
When fine-tuning a BERT model using the HFEntityMatcherTrainable component from the medkit library, the metrics I got afterward when evaluating it were far worst than the same training using the hugging face API only.
This was found in this tutorial were engineers trained the same model with the same corpus than I did and got much bette results.
After reproducing their code on my machine, I got the same results as they did.
They achieved an averaged F1 score of approximately 0.90 were as my medkit trained BERT model was achieving an averaged F1 score of 0.62.
After some investigation
I found out that this gap was explained by the fact that, when they evaluated their model with the classification_report function from sklearn, the NO_ENTITY label (usually noted as "O" was taken into account for the calculation of the averaged F1 score.
The "O" label being logically everywhere in the samples, the F1 score was thus highly augmented.
When discarding this label from the average F1 calculation I found out that they're model performed as much as mine.
In the medkit library, the SeqEvalEvaluator discard the F1 score of the "O" label, which lowers the averaged F1.
It is not a problem but it would be interesting to have the option with SeqEval to choose whether or not to take the "O" label into account to give a bit more of options when evaluating predictions.
The text was updated successfully, but these errors were encountered:
We were given a very insightful talk from @Rian-T recently. He touched on the topic of evaluation, and mentioned this particular point.
From what I remembered, he said that there was not a consensus on whether to include or not class O in the evaluation. Perhaps the best option would be to let users choose via a parameter.
Finetuning using the medkit library gives worse results than when directly using the hugging face library.
When fine-tuning a BERT model using the HFEntityMatcherTrainable component from the medkit library, the metrics I got afterward when evaluating it were far worst than the same training using the hugging face API only.
This was found in this tutorial were engineers trained the same model with the same corpus than I did and got much bette results.
After reproducing their code on my machine, I got the same results as they did.
They achieved an averaged F1 score of approximately 0.90 were as my medkit trained BERT model was achieving an averaged F1 score of 0.62.
After some investigation
I found out that this gap was explained by the fact that, when they evaluated their model with the classification_report function from sklearn, the NO_ENTITY label (usually noted as "O" was taken into account for the calculation of the averaged F1 score.
The "O" label being logically everywhere in the samples, the F1 score was thus highly augmented.
When discarding this label from the average F1 calculation I found out that they're model performed as much as mine.
In the medkit library, the SeqEvalEvaluator discard the F1 score of the "O" label, which lowers the averaged F1.
It is not a problem but it would be interesting to have the option with SeqEval to choose whether or not to take the "O" label into account to give a bit more of options when evaluating predictions.
The text was updated successfully, but these errors were encountered: