-
Notifications
You must be signed in to change notification settings - Fork 463
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training header does not work for some papers #7
Comments
Grobid currently does not support the case of a cover page. More precisely the cover page is taken as any first page of an article, and the extraction tries its best with the content available on it. Then the second page (with the normal header) is ignored. It means that, for the moment, the pre-annotated training data have to be generated with PDF where such cover page has been removed. We could fix that with an heuristic to detect cover page, but the best solution is certainly to make the full text model robust enough to detect a cover page. The work on the full text model is planned for the next moths! |
So the segmentation model introduced in version 0.3 is now explicitly identifying the cover page. I am waiting to have more training data on this new model to change the header processing accordingly. |
Could you give me a status update on this issue? Is Grobid now capable of looking at the first two pages for bibliographic data? |
Hello! The processing of PDF now starts with a model that identifies the different "zones" (cover page, header, body, bibliographical references) called the segmentation model. Creating the training header file relies first on the segmentation model to identify the header zone, and only what has been identified as header zone will be present in the header training data file. With the segmentation model, there is no limitation now in term of area for the header - it can starts at the second page for instance, with even some pieces in the last page. If you apply the current "createTrainingHeader" it will use the segmentation model to identify this complex header area (with the limitation of the segmentation model in term of training data an coverage of your particular PDF layouts). Right now |
No more heuristics, the segmentation model identifies the cover page (when it works!) and the header as distinct zones, the header zone is sent to the header model, same when generating training data which is realized based on the header zone and not the cover page. |
If the bibliographic data is at the second page of the paper, creating the training header file (*.header) doesn´t work properly: There is no information, written at the second page, included in the *.header file. --> training does not work.
That´s also the case for some other articles, there the journal name, written at the first page, is not included in the *.header files.
The text was updated successfully, but these errors were encountered: