Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Author affiliations not extracted correctly #451

Open
danielrbrowne opened this issue Jun 28, 2019 · 4 comments
Open

Author affiliations not extracted correctly #451

danielrbrowne opened this issue Jun 28, 2019 · 4 comments
Labels
error cases Some error/test case for future improvements models:affiliation

Comments

@danielrbrowne
Copy link

Attached is an example of a paper (when converted to PDF in LibreOffice) where author affiliations are orphaned from their associated authors (i.e. a separate <author> is present with a nested <affiliation>) as well some of the affiliations being missing altogether. I also noted the last author has not been extracted at all. This was using the /processHeaderDocument endpoint.

Manuscript (1).docx
Manuscript (1).pdf

@danielrbrowne
Copy link
Author

Here is another example where there are orphaned affiliations are extracted from a manuscript. I'm guessing in this specific case the line numbers could be throwing the extraction heuristics off, but I'm not sure?:
671727.full.pdf

@danielrbrowne
Copy link
Author

Another example. I am attaching this one as an extreme example, and I acknowledge there may not be anything sensible that Grobid can do considering there's no actual cross-referencing in the manuscript between the authors and their affiliations. I.e. it's a layout a human can understand easily, but ML maybe less-so:

(asce)1532-3641(2001)1&3c1(21).pdf

@danielrbrowne
Copy link
Author

Another example of poorly extracted title, authors and affiliations. Also the abstract has failed to be extracted at all. This was using the /processHeaderDocument endpoint. I think the running theme with at least some of these documents seems to be the inclusion of line numbers throwing off Grobid?
latex 1.pdf

@kermitt2
Copy link
Owner

kermitt2 commented Jul 3, 2019

Thanks a lot @danielrbrowne for the problematic use cases, it's useful to have them together with a description of the issues.

Indeed the review format with line number is not something supported well by GROBID for the moment and would require some more layout analysis/features. It's not a problem of heuristics, it's really breaking the machine learning which is trained in uninterrupted field sequences. These line numbers explain most of the errors I think (usually fields not interrupted by these numbers are pretty okay).

Regarding the first document, the layout of the header looks simple for us, but a bit unusual when compared to the existing training data (affiliation without address or country like the NIEHS for example). It's an interesting case which could be typically tackled I think by adding a couple examples like that in the training data.

About the third, I think GROBID is doing great given the "no worry" affiliation list without any cross-referencing - this is really a layout never seen in the training data. Covering that would be a more long term goal I think (affiliation attachment is heuristics-based).

Current header model need to be reworked entirely, it's the oldest model and there are quite a lot of new information and improvement that could be used now - in particular new reading order from PDF, spacing, etc. The open issue on this is from 2016 ...
#136

It requires quite a lot of work, in particular updating all the existing training data, so it's hard to plan/execute given that this project remains a side work for the contributors. It's easier to realize small "low hanging fruit" tasks :)

Thanks again for all these test cases, they are always welcome.

@kermitt2 kermitt2 added the error cases Some error/test case for future improvements label Jul 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
error cases Some error/test case for future improvements models:affiliation
Projects
None yet
Development

No branches or pull requests

3 participants