Training header does not work for some papers #7

holoxy · 2013-06-08T23:23:41Z

If the bibliographic data is at the second page of the paper, creating the training header file (*.header) doesn´t work properly: There is no information, written at the second page, included in the *.header file. --> training does not work.
That´s also the case for some other articles, there the journal name, written at the first page, is not included in the *.header files.

kermitt2 · 2013-09-24T17:04:08Z

Grobid currently does not support the case of a cover page. More precisely the cover page is taken as any first page of an article, and the extraction tries its best with the content available on it. Then the second page (with the normal header) is ignored. It means that, for the moment, the pre-annotated training data have to be generated with PDF where such cover page has been removed.

We could fix that with an heuristic to detect cover page, but the best solution is certainly to make the full text model robust enough to detect a cover page. The work on the full text model is planned for the next moths!

kermitt2 · 2015-03-13T17:13:30Z

So the segmentation model introduced in version 0.3 is now explicitly identifying the cover page. I am waiting to have more training data on this new model to change the header processing accordingly.

cwenge · 2017-07-13T09:23:51Z

Could you give me a status update on this issue? Is Grobid now capable of looking at the first two pages for bibliographic data?
By the way: how many PDFs/training files would you suggest to train the fulltext model for solving this problem?

kermitt2 · 2017-07-16T11:48:14Z

Hello!

The processing of PDF now starts with a model that identifies the different "zones" (cover page, header, body, bibliographical references) called the segmentation model. Creating the training header file relies first on the segmentation model to identify the header zone, and only what has been identified as header zone will be present in the header training data file.

With the segmentation model, there is no limitation now in term of area for the header - it can starts at the second page for instance, with even some pieces in the last page. If you apply the current "createTrainingHeader" it will use the segmentation model to identify this complex header area (with the limitation of the segmentation model in term of training data an coverage of your particular PDF layouts).

Right now processHeader is not using yet the segmentation model, it uses heuristics to identify the header area. The reason is that the training data is adapted to this kind of "heuristics-based" header. I am working on updating the training data to work with the segmentation model, see issue #136 .

kermitt2 · 2020-08-13T15:11:23Z

No more heuristics, the segmentation model identifies the cover page (when it works!) and the header as distinct zones, the header zone is sent to the header model, same when generating training data which is realized based on the header zone and not the cover page.

kermitt2 self-assigned this Mar 13, 2015

kermitt2 mentioned this issue Aug 7, 2019

Duplicated abstract #476

Closed

kermitt2 added implemented The issue has been implemented and removed enhancement labels Aug 13, 2020

lfoppiano closed this as completed Oct 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training header does not work for some papers #7

Training header does not work for some papers #7

holoxy commented Jun 8, 2013

kermitt2 commented Sep 24, 2013

kermitt2 commented Mar 13, 2015

cwenge commented Jul 13, 2017

kermitt2 commented Jul 16, 2017

kermitt2 commented Aug 13, 2020

Training header does not work for some papers #7

Training header does not work for some papers #7

Comments

holoxy commented Jun 8, 2013

kermitt2 commented Sep 24, 2013

kermitt2 commented Mar 13, 2015

cwenge commented Jul 13, 2017

kermitt2 commented Jul 16, 2017

kermitt2 commented Aug 13, 2020