-
Notifications
You must be signed in to change notification settings - Fork 465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handling Jr., Sr. in names (Affiliation and Citation) #196
Comments
I am trying GROBID with one of your grobid-example\src\test\resources\Wang_paperAVE2008.pdf and copied this pdf in \test\in for my following test. I changed the author name to "Rui Wang Jr". In Windows 7, 64bit with createTrainingHeader: I am looking at the Wang_paperAVE2008.authors.tei.xml file created. Here the results more accurate and are as per my requirement. Most of the content is also present in the output. Then I am running with processHeader: I am looking at the Wang_paperAVE2008.tei.xml file at the author area only. Here the element identification is mostly wrong. Both the commands loads same model files except the first one uses the segmentation\model.wapiti in addition. I am looking at the content of the XML and not worried about the structure. I see a difference where createTrainingHeader works more properly. |
Thanks @dominic-sps for reporting these issues! For your second post, see issue #200 for explanations and how to have the same via For the first one, these are indeed two separate issues:
|
I have access to major STM publishers' (SpringerNature, Elsevier, Wiley & TnF) header XML files but not the PDFs. |
In GROBID, author names in reference citations (your first post in this issue) are structured with a different model than author names in the header. For reference citation, the model You can also simply add examples based on the string of authors in the reference string, see the examples under |
Thank you and noted the citation related training. I would like to know more about training the header part? I am currently trying to structure the raw manuscript (new unpublished) in Word format into usable XML format. I am automatically cleaning the Word document and converting into PDF format. Then I am using GROBID to process the PDF file. At first we are targeting only the header part and not body or references. Now to create the training datasets to process my header part
Appreciate your suggestion on the above. |
Normally, you have to use |
I reopen because I will work on both better post-processing of initials and adding more suffix examples in the training data of authors in header. |
Thank you for reopening this request. Earlier I fixed it temporarily in
I am not sure how to tag the below names in my training corpus |
Hello! I think this is correct to have <biblStruct >
<analytic>
<title level="a" type="main">Climatological observations and predicted sublimation rates at Lake Hoare</title>
<author>
<persName
xmlns="http://www.tei-c.org/ns/1.0">
<forename type="first">G</forename>
<forename type="middle">D</forename>
<surname>Clow</surname>
</persName>
</author>
<author>
<persName
xmlns="http://www.tei-c.org/ns/1.0">
<forename type="first">C</forename>
<forename type="middle">P</forename>
<surname>Mckay</surname>
</persName>
</author>
<author>
<persName
xmlns="http://www.tei-c.org/ns/1.0">
<forename type="first">G</forename>
<forename type="middle">M</forename>
<surname>Simmons</surname>
<genName>Jr</genName>
</persName>
</author>
<author>
<persName
xmlns="http://www.tei-c.org/ns/1.0">
<forename type="first">R</forename>
<forename type="middle">A</forename>
<surname>Wharton</surname>
<genName>Jr</genName>
</persName>
</author>
</analytic>
<monogr>
<title level="j">Antarctica. Journal of Climate</title>
<imprint>
<biblScope unit="volume">1</biblScope>
<biblScope unit="page" from="715" to="728" />
<date type="published" when="1988" />
</imprint>
</monogr>
</biblStruct> I think this corresponds to the expected result and formatting. More training data for suffixes like Sr. Jr would be very welcome, there are almost no example right now. In the training data, I have annotated the sequence
So the block of initials is annotated as As this sequence of names is now present in the training data, it's not a surprise to have the above result, it's a way for checking that correctly tagged sequence get well structured and normalised. I think with new names in a different order with Jr. and Sr. and other suffix, having similar good result in a robust manner will require to have a few more relevant cases in the training data - but only a few! |
Great Thank you. I'll check this out. Regarding suffix samples, there are lot of training data already available Not sure about the "exclude" file purpose. If you want full reference with different prefix and suffix, I'll arrange. |
I assembled this file with suffix and unusual examples of names for this purpose, but using it resulted in a loss of accuracy for author name recognition of 2-4%, so I have excluded it from the training. I suppose the problem is that's only names in isolation, not sequence of names as found in academic papers. It might also create over-representation of this kind of unusual names in the trained model. So lesson learned, the best is to use actual data as found in academic papers, and not artificially compiled stuff like this file ;) |
I am using latest version 0.4.2 and checked the following issues in Windows 7 as well as CentOS 7
Reference Citation Sample checked:
Clow GD, McKay CP, Simmons Jr. GM, and Wharton RA, Jr. 1988. Climatological observations and predicted sublimation rates at Lake Hoare, Antarctica. Journal of Climate 1:715-728.
Issue 1. It changes the forename "GD" as "Gd"; "CP" as "Cp" etc.
Issue 2. Captures Jr. as surname and tags "GM" as separate surname without a forename
For the suffix issue, attached a PDF from NCBI related to
grobid-trainer/resources/dataset/name/header/corpus/1468-6708-3-10.authors.tei.xml
1468-6708-3-10.pdf
The text was updated successfully, but these errors were encountered: