Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Full update of the header model #580

Merged
merged 80 commits into from
Aug 11, 2020
Merged
Changes from all commits
Commits
Show all changes
80 commits
Select commit Hold shift + click to select a range
f383c3a
add line number support via pdfalto
kermitt2 Apr 19, 2020
f7edbf9
update pdfalto
kermitt2 Apr 19, 2020
b46e17a
Review pdfalto parameters; review training data involving line numbers
kermitt2 Apr 19, 2020
5c84f7f
update resources ad pdfalto
kermitt2 Apr 20, 2020
d6d6a8f
add a feature in fulltext model for superscript tokens
kermitt2 Apr 21, 2020
f2fe98c
update header parser with clusteror; update fields; minor improvements
kermitt2 Apr 22, 2020
bedb5e3
some adjustments to avoid regression with PMC 1942
kermitt2 Apr 22, 2020
3725b27
make the header training xml parser support new standard format
kermitt2 Apr 22, 2020
8a94bfb
update special symbols
kermitt2 Apr 22, 2020
365e1a4
typo
kermitt2 Apr 22, 2020
1a01d39
erroneous added delimiters
kermitt2 Apr 22, 2020
fd3b0af
cleaning
kermitt2 Apr 23, 2020
ef06898
Merge branch 'line_number_support' into update_header
kermitt2 Apr 23, 2020
9aaa39a
Adding pdfalto for mac
lfoppiano Apr 23, 2020
c93daf9
preparing new header training data format
kermitt2 Apr 24, 2020
6c510cf
revert breaking dependency updates
kermitt2 Apr 24, 2020
1884155
Merge branch 'master' into line_number_support
kermitt2 Apr 24, 2020
d63b914
update pdfalto
kermitt2 Apr 24, 2020
fbad7ce
fix conflict with master
kermitt2 Apr 25, 2020
f47dcc8
Merge branch 'supercript-feature-in-fulltext' into update_header
kermitt2 Apr 25, 2020
84f9267
keep title as one continuous sequence only
kermitt2 Apr 25, 2020
e51787d
cleaning
kermitt2 Apr 25, 2020
4889478
refactoring
kermitt2 Apr 26, 2020
8f3aa88
fix test
kermitt2 Apr 26, 2020
7f5a200
big cleaning
kermitt2 Apr 27, 2020
34402a9
fix sync error
kermitt2 Apr 27, 2020
ffe16b7
header heuristics off
kermitt2 Apr 27, 2020
138b58a
fix rest api citation bug
kermitt2 Apr 28, 2020
45499e5
new training data for header model, bootstrapping model
kermitt2 Apr 28, 2020
d8f9543
updates training data
kermitt2 Apr 29, 2020
fb2a63b
minor fix training data parser; training data iterative corrections
kermitt2 Apr 29, 2020
1979a31
more training data; start training guidelines
kermitt2 Apr 30, 2020
2e27f21
Merge branch 'master' into update_header
kermitt2 Apr 30, 2020
d137013
update training data
kermitt2 May 2, 2020
e79d9ee
review author sequence processing
kermitt2 May 2, 2020
57e5bc4
training data corrections
kermitt2 May 3, 2020
58d80a1
Merge branch 'master' into update_header
kermitt2 May 3, 2020
35cd599
extend header guidelines
kermitt2 May 6, 2020
c6e5811
interation on new header guidelines
kermitt2 May 7, 2020
2cbc183
fix doc index
kermitt2 May 7, 2020
1fba118
improve header model annotation guidelines
kermitt2 May 7, 2020
a35f51d
add training data, author deduplication
kermitt2 May 14, 2020
25fd5af
review training data and labels
kermitt2 May 15, 2020
933837b
re-merge with line number support
kermitt2 May 15, 2020
a62a8fd
for testing the exploitation of all labels
kermitt2 May 15, 2020
5d2b706
bug in doc
kermitt2 May 16, 2020
e7a206d
improving definition of formula
lfoppiano May 16, 2020
3e50f42
adding information about figures and tables captions for the segmenta…
lfoppiano May 16, 2020
ea9085f
more experiments
kermitt2 May 16, 2020
3563d4f
Merge branch 'update_header' of https://github.com/kermitt2/grobid in…
kermitt2 May 16, 2020
d65c014
a bit of re-formulation
kermitt2 May 16, 2020
509760d
update segmentation model
kermitt2 May 17, 2020
8e81ded
Merge branch 'master' into update_header
kermitt2 May 23, 2020
3c04d9d
add header examples corresponding to segmentation cases
kermitt2 May 24, 2020
3345cf8
fix easymock version
kermitt2 May 24, 2020
33b80e7
add latest segmentation training files from Luca
kermitt2 May 24, 2020
5aa1352
updated segmentation model
kermitt2 May 25, 2020
cc9a188
header training data and guidelines correction
kermitt2 May 25, 2020
1fad0f0
more training
kermitt2 May 26, 2020
baa3e2a
update XML schemas
kermitt2 May 26, 2020
b8d6dd7
update segmentation model
kermitt2 May 27, 2020
8decad7
Update label markers of formulas
lfoppiano May 27, 2020
4b39ea8
Update figures and tables reference markers
lfoppiano May 27, 2020
d7bb967
update header model
kermitt2 May 27, 2020
d20ca48
Merge branch 'update_header' of https://github.com/kermitt2/grobid in…
kermitt2 May 27, 2020
8f0966d
review feature regex for url
kermitt2 May 27, 2020
aa93ff4
revert
kermitt2 May 27, 2020
1603115
training data for segmentation model
kermitt2 Jun 1, 2020
6cda941
update segmentation model
kermitt2 Jun 2, 2020
cdd061f
add training data header model
kermitt2 Jun 2, 2020
e4d0bdd
update header model
kermitt2 Jun 3, 2020
da0c622
merge with master; parallelize pdf processing in end-to-end evaluation
kermitt2 Jun 4, 2020
147e89e
Merge branch 'update_header' of https://github.com/kermitt2/grobid in…
kermitt2 Jun 4, 2020
4a6de0f
minor rephrase
kermitt2 Jun 13, 2020
ce313fb
a few more PMC examples
kermitt2 Jun 15, 2020
1edc2f0
Merge branch 'master' into update_header
kermitt2 Jun 20, 2020
c4f64e4
training data
kermitt2 Jun 21, 2020
55e9919
update models
kermitt2 Jun 23, 2020
c0bfae0
add eval for bioRxiv test
kermitt2 Jun 24, 2020
c3a4926
Merge branch 'master' into update_header
kermitt2 Aug 11, 2020

Sorry, this diff is taking too long to generate.

It may be too large to display on GitHub.