Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update of the header model #136

Closed
7 tasks done
kermitt2 opened this issue Oct 1, 2016 · 2 comments
Closed
7 tasks done

Update of the header model #136

kermitt2 opened this issue Oct 1, 2016 · 2 comments
Assignees
Labels
bug From Hemiptera and especially its suborder Heteroptera enhancement implemented The issue has been implemented
Milestone

Comments

@kermitt2
Copy link
Owner

kermitt2 commented Oct 1, 2016

The header model requires a big refresh to be consistent with all the evolutions of GROBID during the three last years.

  • modify pdf2xml for having blocks in reading order at least for the header segment, see fork https://github.com/kermitt2/pdf2xml
  • use only the Segmentation model for getting the header segment for both training data and files to be processed
  • modify the training format to be consistent with the rest of GROBID's training
  • add some new features (line indent, centered, ...)
  • fix some issues with the training files: see Duplicate ID in Header Corpus #135, + 2 PDF files in training set has no content.
    Use full file name as identifier and check for duplicates
  • review carefully modified training data format
  • regression test and hopefully some improvements ;)
@kermitt2 kermitt2 self-assigned this Oct 1, 2016
@kermitt2 kermitt2 added bug From Hemiptera and especially its suborder Heteroptera enhancement labels Oct 1, 2016
@kermitt2 kermitt2 added this to the 0.4.2 milestone Oct 1, 2016
@kermitt2
Copy link
Owner Author

kermitt2 commented Oct 1, 2016

This is work in progress with branch modified-pdf2xml...

@lfoppiano
Copy link
Collaborator

lfoppiano commented Dec 23, 2019

I'm adding a PDF example as test case for this extension. While the text is correctly extracted from pdfalto, using -readingOrder option, the model is mixing up (I suppose) the various streams.

I guess after updating the header model this pdf should be extracted normally.

https://arxiv.org/ftp/cond-mat/papers/0111/0111388.pdf

@lfoppiano lfoppiano added the implemented The issue has been implemented label Jun 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug From Hemiptera and especially its suborder Heteroptera enhancement implemented The issue has been implemented
Projects
None yet
Development

No branches or pull requests

2 participants