[WIP] Full update of the header model #580

kermitt2 · 2020-05-10T05:23:29Z

Here is a largely reviewed header model:

use of clusteror to extract labelled results, better alignment
removal of lot's of ugly old stuff (mostly not used actually)
discard heuristics (but still usable)
review of features (fix old features and start adding some new ones related to font size and spacing)
new training format (consistent with the other models)
adaptation of the xml parser for the new training data format
adaptation of the creation of training data for the header model
bootstrap and manual annotation of ~300 examples of headers in the new format
annotation guidelines for the training data in the new format

These improvements fix or make obsolete in particular:

duplicated introduction/abstract
some missing text at the beginning of the body sections
robustness with "noisy" content
quite a lot of issues like Different header results using processHeader or processFulltext #281, arXiv identifiers not extracted #275, Update of the header model #136, Header Feature Vector has Static Feature Values #128, Update of the header model #136, train title and abstract only results in poor abstract extraction #430, header training data features columns #531, Annotation Questions on Header #491

This makes the header model consistent with the other models in term of annotation approach and should eventually improve the labelling accuracy and the quality of extracted metadata.

The old training data (around 2600 examples) is entirely dropped. 300 examples have been annotated in the new training format with the actual header parts from the segmentation model. As compared to the old model, this provides end-to-end results for PMC already quite similar for title, keywords, and first author, but lower for full authors (authors are more duplicated with the new approach and need a deduplication) and for abstract - this is encouraging. Usage of the heuristics now degrades the end-to-end result by 2 point in f-score in average.

To do:

more training data (including for the segmentation model, because the header parts now always come from the segmentation model), at least to reach the old model accuracy
deduplicate extracted authors and affiliations
add in the result TEI some extracted fields currently ignored (document type, journal title, group/collaboration, ...)
remove some labels not used anymore

…to update_header

kermitt2 and others added 30 commits April 19, 2020 03:24

add line number support via pdfalto

f383c3a

update pdfalto

f7edbf9

Review pdfalto parameters; review training data involving line numbers

b46e17a

update resources ad pdfalto

5c84f7f

add a feature in fulltext model for superscript tokens

d6d6a8f

update header parser with clusteror; update fields; minor improvements

f2fe98c

some adjustments to avoid regression with PMC 1942

bedb5e3

make the header training xml parser support new standard format

3725b27

update special symbols

8a94bfb

typo

365e1a4

erroneous added delimiters

1a01d39

cleaning

fd3b0af

Merge branch 'line_number_support' into update_header

ef06898

Adding pdfalto for mac

9aaa39a

preparing new header training data format

c93daf9

revert breaking dependency updates

6c510cf

Merge branch 'master' into line_number_support

1884155

update pdfalto

d63b914

fix conflict with master

fbad7ce

Merge branch 'supercript-feature-in-fulltext' into update_header

f47dcc8

keep title as one continuous sequence only

84f9267

cleaning

e51787d

refactoring

4889478

fix test

8f3aa88

big cleaning

7f5a200

fix sync error

34402a9

header heuristics off

ffe16b7

fix rest api citation bug

138b58a

new training data for header model, bootstrapping model

45499e5

updates training data

d8f9543

kermitt2 and others added 18 commits May 25, 2020 01:47

add latest segmentation training files from Luca

33b80e7

updated segmentation model

5aa1352

header training data and guidelines correction

cc9a188

more training

1fad0f0

update XML schemas

baa3e2a

update segmentation model

b8d6dd7

Update label markers of formulas

8decad7

Update figures and tables reference markers

4b39ea8

update header model

d7bb967

Merge branch 'update_header' of https://github.com/kermitt2/grobid in…

d20ca48

…to update_header

review feature regex for url

8f0966d

revert

aa93ff4

training data for segmentation model

1603115

update segmentation model

6cda941

add training data header model

cdd061f

update header model

e4d0bdd

merge with master; parallelize pdf processing in end-to-end evaluation

da0c622

Merge branch 'update_header' of https://github.com/kermitt2/grobid in…

147e89e

…to update_header

kermitt2 mentioned this pull request Jun 13, 2020

Where to place model files for training #587

Closed

kermitt2 and others added 5 commits June 13, 2020 23:17

minor rephrase

4a6de0f

a few more PMC examples

ce313fb

Merge branch 'master' into update_header

1edc2f0

training data

c4f64e4

update models

55e9919

This was linked to issues Jun 23, 2020

publication date not correctly extracted #109

Closed

Abstract regression on bioRxiv current master #555

Closed

kermitt2 added 2 commits June 24, 2020 06:12

add eval for bioRxiv test

c0bfae0

Merge branch 'master' into update_header

c3a4926

kermitt2 merged commit 670d06d into master Aug 11, 2020

lfoppiano deleted the update_header branch June 9, 2024 21:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Full update of the header model #580

[WIP] Full update of the header model #580

kermitt2 commented May 10, 2020

[WIP] Full update of the header model #580

[WIP] Full update of the header model #580

Conversation

kermitt2 commented May 10, 2020