Author affiliations not extracted correctly #451

danielrbrowne · 2019-06-28T10:18:49Z

Attached is an example of a paper (when converted to PDF in LibreOffice) where author affiliations are orphaned from their associated authors (i.e. a separate <author> is present with a nested <affiliation>) as well some of the affiliations being missing altogether. I also noted the last author has not been extracted at all. This was using the /processHeaderDocument endpoint.

Manuscript (1).docx
Manuscript (1).pdf

The text was updated successfully, but these errors were encountered:

danielrbrowne · 2019-06-28T10:31:33Z

Here is another example where there are orphaned affiliations are extracted from a manuscript. I'm guessing in this specific case the line numbers could be throwing the extraction heuristics off, but I'm not sure?:
671727.full.pdf

danielrbrowne · 2019-06-28T10:36:12Z

Another example. I am attaching this one as an extreme example, and I acknowledge there may not be anything sensible that Grobid can do considering there's no actual cross-referencing in the manuscript between the authors and their affiliations. I.e. it's a layout a human can understand easily, but ML maybe less-so:

(asce)1532-3641(2001)1&3c1(21).pdf

danielrbrowne · 2019-07-03T09:47:03Z

Another example of poorly extracted title, authors and affiliations. Also the abstract has failed to be extracted at all. This was using the /processHeaderDocument endpoint. I think the running theme with at least some of these documents seems to be the inclusion of line numbers throwing off Grobid?
latex 1.pdf

kermitt2 · 2019-07-03T22:50:21Z

Thanks a lot @danielrbrowne for the problematic use cases, it's useful to have them together with a description of the issues.

Indeed the review format with line number is not something supported well by GROBID for the moment and would require some more layout analysis/features. It's not a problem of heuristics, it's really breaking the machine learning which is trained in uninterrupted field sequences. These line numbers explain most of the errors I think (usually fields not interrupted by these numbers are pretty okay).

Regarding the first document, the layout of the header looks simple for us, but a bit unusual when compared to the existing training data (affiliation without address or country like the NIEHS for example). It's an interesting case which could be typically tackled I think by adding a couple examples like that in the training data.

About the third, I think GROBID is doing great given the "no worry" affiliation list without any cross-referencing - this is really a layout never seen in the training data. Covering that would be a more long term goal I think (affiliation attachment is heuristics-based).

Current header model need to be reworked entirely, it's the oldest model and there are quite a lot of new information and improvement that could be used now - in particular new reading order from PDF, spacing, etc. The open issue on this is from 2016 ...
#136

It requires quite a lot of work, in particular updating all the existing training data, so it's hard to plan/execute given that this project remains a side work for the contributors. It's easier to realize small "low hanging fruit" tasks :)

Thanks again for all these test cases, they are always welcome.

kermitt2 added the error cases Some error/test case for future improvements label Jul 3, 2019

This was referenced Jul 5, 2019

Author affiliation not associated correctly #309

Open

Several Dates Parsing issue #417

Closed

lfoppiano added the models:affiliation label Nov 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Author affiliations not extracted correctly #451

Author affiliations not extracted correctly #451

danielrbrowne commented Jun 28, 2019

danielrbrowne commented Jun 28, 2019

danielrbrowne commented Jun 28, 2019

danielrbrowne commented Jul 3, 2019

kermitt2 commented Jul 3, 2019

Author affiliations not extracted correctly #451

Author affiliations not extracted correctly #451

Comments

danielrbrowne commented Jun 28, 2019

danielrbrowne commented Jun 28, 2019

danielrbrowne commented Jun 28, 2019

danielrbrowne commented Jul 3, 2019

kermitt2 commented Jul 3, 2019