-
Notifications
You must be signed in to change notification settings - Fork 464
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running 'createTrainingHeader' via grobid-service #200
Comments
Hi @dominic-sps, the reason of this functionality is not included in the grobid-service is that at the moment the creation of training data is a separate offline operation from the processing. |
This is what I am trying now. I have merged the author and affiliation training XMLs and created my own xml. The reason is, I noted that the training XMLs has more details than the output created by 'processHeaderDocument' which actually ats up lot of content. |
Ok, what do you mean with The training XML (they are produced also together with a feature list text files) they are meant to be manually corrected by expert users. Could you be also more specific of what you intend to do with the XMLs? I'm asking because I"m not sure I've understand what you are trying to achieve :-) |
Hi Dominic,
Making the header model up-to-date with the segmentation model is a work in progress, see issue #136. It takes time because it requires to refresh the current training data. So having |
@dominic-sps if you want to have the same material in line 81-82 and 88-89 change this retVal = engine.processHeader(originFile.getAbsolutePath(), consolidate, null);
//retVal = engine.segmentAndProcessHeader(originFile, consolidate, null); into this: //retVal = engine.processHeader(originFile.getAbsolutePath(), consolidate, null);
retVal = engine.segmentAndProcessHeader(originFile, consolidate, null); It should work... but given the current training data for the header model, the accuracy of the header model based on areas identified by the segmentation model is ~3% lower than with heuristics-based identification of header area (which is why I have not switched yet to the new approach for header structuring). This ~3% come from the end-to-end evaluation with 1943 PDF files of PubMedCentral. |
@kermitt2 , It does load the segmentation model after the above change but the expected tag is still missing in the output. |
Could you send me maybe an example where |
Screen messages
Refer the person name tag in
Files Created (renamed as txt)
Server side screen messages
Output file: Refer the person name tag in the 22.txt file |
Many thanks ! There was a bug in the way name suffix were set (nothing to do with the model or training data). It is fixed and works with your example after commit b738f1f. |
This is not an issue but missing in the grobid-service module. Is there any way I could run "createTrainingHeader" as a service? Appreciate any help in this regard. Let it merge all the 4 files that it creates currently into a single file and and respond back.
The text was updated successfully, but these errors were encountered: