Update of the header model #136

kermitt2 · 2016-10-01T17:08:09Z

The header model requires a big refresh to be consistent with all the evolutions of GROBID during the three last years.

modify pdf2xml for having blocks in reading order at least for the header segment, see fork https://github.com/kermitt2/pdf2xml
use only the Segmentation model for getting the header segment for both training data and files to be processed
modify the training format to be consistent with the rest of GROBID's training
add some new features (line indent, centered, ...)
fix some issues with the training files: see Duplicate ID in Header Corpus #135, + 2 PDF files in training set has no content.
Use full file name as identifier and check for duplicates
review carefully modified training data format
regression test and hopefully some improvements ;)

kermitt2 · 2016-10-01T17:09:57Z

This is work in progress with branch modified-pdf2xml...

lfoppiano · 2019-12-23T01:58:54Z

I'm adding a PDF example as test case for this extension. While the text is correctly extracted from pdfalto, using -readingOrder option, the model is mixing up (I suppose) the various streams.

I guess after updating the header model this pdf should be extracted normally.

https://arxiv.org/ftp/cond-mat/papers/0111/0111388.pdf

kermitt2 self-assigned this Oct 1, 2016

kermitt2 added bug From Hemiptera and especially its suborder Heteroptera enhancement labels Oct 1, 2016

kermitt2 added this to the 0.4.2 milestone Oct 1, 2016

This was referenced Jul 16, 2017

Running 'createTrainingHeader' via grobid-service #200

Closed

Training header does not work for some papers #7

Closed

kermitt2 mentioned this issue Apr 11, 2018

Hyphen at line break removed #180

Open

kermitt2 mentioned this issue Jul 3, 2019

Author affiliations not extracted correctly #451

Open

kermitt2 mentioned this issue Aug 7, 2019

Duplicated abstract #476

Closed

kermitt2 mentioned this issue Aug 27, 2019

Annotation Questions on Header #491

Open

kermitt2 mentioned this issue Nov 27, 2019

Incomplete teiHeader extracted for paper #520

Open

kermitt2 mentioned this issue May 10, 2020

[WIP] Full update of the header model #580

Merged

lfoppiano added the implemented The issue has been implemented label Jun 9, 2024

lfoppiano closed this as completed Nov 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update of the header model #136

Update of the header model #136

kermitt2 commented Oct 1, 2016 •

edited

Loading

kermitt2 commented Oct 1, 2016

lfoppiano commented Dec 23, 2019 •

edited

Loading

Update of the header model #136

Update of the header model #136

Comments

kermitt2 commented Oct 1, 2016 • edited Loading

kermitt2 commented Oct 1, 2016

lfoppiano commented Dec 23, 2019 • edited Loading

kermitt2 commented Oct 1, 2016 •

edited

Loading

lfoppiano commented Dec 23, 2019 •

edited

Loading