Questions about data annotation in GROBID #1067
Labels
question
There's no such thing as a stupid question
training guidelines
Related to the annotation guidelines for training data
Hello kermitt2,
I'm currently working on annotating data for the GROBID project and have a few questions regarding the annotation process. I would appreciate it if you could provide some guidance on the following issues:
In the
General Principles
section of the documentation, it is mentioned that the text flow should not be changed. Does this mean that the order and content of the text flow in the pre-annotated data cannot be altered or removed? Or does it mean that the internal content of each XML text node cannot be modified, but the external order can be freely adjusted? For example, in the pre-trained data ofreferences.referenceSegmenter.tei.xml
, due to the PDF text editing order issue, the extracted text flow contains the content after each reference's number first, followed by the number's content, and some main text content mixed in between. In this case, am I allowed to:When I finish modifying
segmentation.xml
and proceed to modifyfulltext.xml
, I find that some content inputted intofulltext.xml
does not belong to the body, or some body content is recognized as front during the segmentation stage. In this case, should I remove the content that does not belong to the body and add back the missing content in the body? Additionally, I would like to know if I am allowed to adjust the order of text tokens if the extracted body order does not conform to the human reading order (while still ensuring that the text child nodes remain unchanged)?I look forward to your response, and thank you for your assistance!
Best regards!
The text was updated successfully, but these errors were encountered: