Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Full update of the header model #580

Merged
merged 80 commits into from
Aug 11, 2020
Merged

[WIP] Full update of the header model #580

merged 80 commits into from
Aug 11, 2020

Conversation

kermitt2
Copy link
Owner

Here is a largely reviewed header model:

  • use of clusteror to extract labelled results, better alignment
  • removal of lot's of ugly old stuff (mostly not used actually)
  • discard heuristics (but still usable)
  • review of features (fix old features and start adding some new ones related to font size and spacing)
  • new training format (consistent with the other models)
  • adaptation of the xml parser for the new training data format
  • adaptation of the creation of training data for the header model
  • bootstrap and manual annotation of ~300 examples of headers in the new format
  • annotation guidelines for the training data in the new format

These improvements fix or make obsolete in particular:

This makes the header model consistent with the other models in term of annotation approach and should eventually improve the labelling accuracy and the quality of extracted metadata.

The old training data (around 2600 examples) is entirely dropped. 300 examples have been annotated in the new training format with the actual header parts from the segmentation model. As compared to the old model, this provides end-to-end results for PMC already quite similar for title, keywords, and first author, but lower for full authors (authors are more duplicated with the new approach and need a deduplication) and for abstract - this is encouraging. Usage of the heuristics now degrades the end-to-end result by 2 point in f-score in average.

To do:

  • more training data (including for the segmentation model, because the header parts now always come from the segmentation model), at least to reach the old model accuracy
  • deduplicate extracted authors and affiliations
  • add in the result TEI some extracted fields currently ignored (document type, journal title, group/collaboration, ...)
  • remove some labels not used anymore

@kermitt2 kermitt2 merged commit 670d06d into master Aug 11, 2020
@lfoppiano lfoppiano deleted the update_header branch June 9, 2024 21:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Abstract regression on bioRxiv current master publication date not correctly extracted
3 participants