-
Notifications
You must be signed in to change notification settings - Fork 464
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Author affiliations not extracted correctly #451
Comments
Here is another example where there are orphaned affiliations are extracted from a manuscript. I'm guessing in this specific case the line numbers could be throwing the extraction heuristics off, but I'm not sure?: |
Another example. I am attaching this one as an extreme example, and I acknowledge there may not be anything sensible that Grobid can do considering there's no actual cross-referencing in the manuscript between the authors and their affiliations. I.e. it's a layout a human can understand easily, but ML maybe less-so: |
Another example of poorly extracted title, authors and affiliations. Also the abstract has failed to be extracted at all. This was using the |
Thanks a lot @danielrbrowne for the problematic use cases, it's useful to have them together with a description of the issues. Indeed the review format with line number is not something supported well by GROBID for the moment and would require some more layout analysis/features. It's not a problem of heuristics, it's really breaking the machine learning which is trained in uninterrupted field sequences. These line numbers explain most of the errors I think (usually fields not interrupted by these numbers are pretty okay). Regarding the first document, the layout of the header looks simple for us, but a bit unusual when compared to the existing training data (affiliation without address or country like the NIEHS for example). It's an interesting case which could be typically tackled I think by adding a couple examples like that in the training data. About the third, I think GROBID is doing great given the "no worry" affiliation list without any cross-referencing - this is really a layout never seen in the training data. Covering that would be a more long term goal I think (affiliation attachment is heuristics-based). Current header model need to be reworked entirely, it's the oldest model and there are quite a lot of new information and improvement that could be used now - in particular new reading order from PDF, spacing, etc. The open issue on this is from 2016 ... It requires quite a lot of work, in particular updating all the existing training data, so it's hard to plan/execute given that this project remains a side work for the contributors. It's easier to realize small "low hanging fruit" tasks :) Thanks again for all these test cases, they are always welcome. |
Attached is an example of a paper (when converted to PDF in LibreOffice) where author affiliations are orphaned from their associated authors (i.e. a separate
<author>
is present with a nested<affiliation>
) as well some of the affiliations being missing altogether. I also noted the last author has not been extracted at all. This was using the/processHeaderDocument
endpoint.Manuscript (1).docx
Manuscript (1).pdf
The text was updated successfully, but these errors were encountered: