Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running 'createTrainingHeader' via grobid-service #200

Closed
dominic-sps opened this issue Jul 10, 2017 · 9 comments
Closed

Running 'createTrainingHeader' via grobid-service #200

dominic-sps opened this issue Jul 10, 2017 · 9 comments

Comments

@dominic-sps
Copy link

dominic-sps commented Jul 10, 2017

This is not an issue but missing in the grobid-service module. Is there any way I could run "createTrainingHeader" as a service? Appreciate any help in this regard. Let it merge all the 4 files that it creates currently into a single file and and respond back.

@lfoppiano
Copy link
Collaborator

Hi @dominic-sps, the reason of this functionality is not included in the grobid-service is that at the moment the creation of training data is a separate offline operation from the processing.
What would be the reason/use case for having that directly in the grobid-service?
One solution is that you implement your own service for training data creation and integrate the grobid-core library, in order to have a service for training data generation.

@dominic-sps
Copy link
Author

One solution is that you implement your own service for training data creation and integrate the grobid-core library, in order to have a service for training data generation.

This is what I am trying now. I have merged the author and affiliation training XMLs and created my own xml. The reason is, I noted that the training XMLs has more details than the output created by 'processHeaderDocument' which actually ats up lot of content.

@lfoppiano
Copy link
Collaborator

Ok, what do you mean with the training XML has more details than the otuptu created by processeHeaderDocument?

The training XML (they are produced also together with a feature list text files) they are meant to be manually corrected by expert users.

Could you be also more specific of what you intend to do with the XMLs? I'm asking because I"m not sure I've understand what you are trying to achieve :-)

@kermitt2
Copy link
Owner

Hi Dominic,

createTrainingHeader uses the segmentation model for identifying the header zone, while the current processHeaderDocument does not use it yet. This explain why you see more material with createTrainingHeader at the current time. Using createTrainingHeader is not a reliable solution of course because it's not its purpose to get usable results - simply to get usable training data.

Making the header model up-to-date with the segmentation model is a work in progress, see issue #136. It takes time because it requires to refresh the current training data.

So having createTrainingHeader available as a service for the reason you mention does not make a lot of sense, we just need to improve the normal processHeaderDocument as planed and in progress.

@kermitt2
Copy link
Owner

@dominic-sps if you want to have the same material in processHeaderDocument as with createTrainingHeader, you can use the segmentation model in processHeaderDocument by modifying the file GrobidRestProcessFiles.java (under grobid-service/src/main/java/org/grobid/service/process) as follow:

line 81-82 and 88-89 change this

retVal = engine.processHeader(originFile.getAbsolutePath(), consolidate, null);
//retVal = engine.segmentAndProcessHeader(originFile, consolidate, null);

into this:

//retVal = engine.processHeader(originFile.getAbsolutePath(), consolidate, null);
retVal = engine.segmentAndProcessHeader(originFile, consolidate, null);

It should work... but given the current training data for the header model, the accuracy of the header model based on areas identified by the segmentation model is ~3% lower than with heuristics-based identification of header area (which is why I have not switched yet to the new approach for header structuring). This ~3% come from the end-to-end evaluation with 1943 PDF files of PubMedCentral.

@dominic-sps
Copy link
Author

@kermitt2 , It does load the segmentation model after the above change but the expected tag is still missing in the output.

@kermitt2
Copy link
Owner

Could you send me maybe an example where createTrainingHeader provides more info than processHeaderDocument after the above mentioned change, so that I can recreate the issue?

@dominic-sps
Copy link
Author

  1. Running createTrainingHeader on file Wang_paperAVE2008.pdf

Screen messages

D:\grobid>java -Xmx1024m -jar \grobid\grobid-core\target\grobid-core-0.4.2-SNAPSHOT.one-jar.jar -gH \grobid\grobid-home -gP \grobid\grobid-home\config\grobid.properties -dIn \test\in -dOut \test\out -exe createTrainingHeader
JarClassLoader: Warning: org/apache/lucene/analysis/cn/smart/hhmm/SegTokenFilter.class in lib/lucene-analyzers-smartcn-4.5.1.jar is hidden by lib/wipo-analysers-0.0.1.jar (with different bytecode)
JarClassLoader: Warning: org/w3c/dom/UserDataHandler.class in lib/xom-1.2.5.jar
is hidden by lib/xml-apis-1.4.01.jar (with different bytecode)
Wang_paperAVE2008.pdf
1 files to be processed.
[Wapiti] Loading model: "D:\grobid\grobid-home\models\header\model.wapiti"
Model path: D:\grobid\grobid-home\models\header\model.wapiti
[Wapiti] Loading model: "D:\grobid\grobid-home\models\segmentation\model.wapiti"
Model path: D:\grobid\grobid-home\models\segmentation\model.wapiti
[Wapiti] Loading model: "D:\grobid\grobid-home\models\affiliation-address\model.wapiti"
Model path: D:\grobid\grobid-home\models\affiliation-address\model.wapiti
[Wapiti] Loading model: "D:\grobid\grobid-home\models\name\header\model.wapiti"
Model path: D:\grobid\grobid-home\models\name\header\model.wapiti
[Wapiti] Loading model: "D:\grobid\grobid-home\models\name\citation\model.wapiti"
Model path: D:\grobid\grobid-home\models\name\citation\model.wapiti

Refer the person name tag in Wang_paperAVE2008.authors.tei.xml.txt

<persName>
	<forename>Rui</forename>
	 <surname>Wang</surname>
	 <suffix>Jr</suffix>
	 <marker>1</marker> and
</persName>

Files Created (renamed as txt)
Wang_paperAVE2008.affiliation.tei.xml.txt
Wang_paperAVE2008.authors.tei.xml.txt
Wang_paperAVE2008.header.tei.xml.txt
Wang_paperAVE2008.header.txt

  1. Running service processHeaderDocument . After making your suggested code change I did compile the service and run

curl --form input=@/test/in/Wang_paperAVE2008.pdf -H "Content-Type: multipart/form-data" localhost:8081/processHeaderDocument > output.txt

Server side screen messages

[DEBUG] org.grobid.service.process.GrobidRestProcessFiles: >> org.grobid.service.process.GrobidRestProcessFiles.methodLogIn
[DEBUG] org.grobid.core.utilities.IOUtilities: >> set origin document for stateless service'...
[DEBUG] org.grobid.core.factory.GrobidPoolingFactory: synchronized newPoolInstance
[INFO ] org.grobid.core.factory.GrobidPoolingFactory: Number of Engines in poolactive/max: 1/10
[DEBUG] org.grobid.core.utilities.LanguageUtilities: synchronized getNewInstance
[DEBUG] org.grobid.core.analyzers.GrobidAnalyzer: Get new instance of GrobidAnalyzer
[INFO ] org.grobid.core.jni.WapitiModel: Loading model: D:\grobid\grobid-home\models\header\model.wapiti (size: 36094028)
[Wapiti] Loading model: "D:\grobid\grobid-home\models\header\model.wapiti"
Model path: D:\grobid\grobid-home\models\header\model.wapiti
[DEBUG] org.grobid.core.document.DocumentSource: start pdf2xml
[DEBUG] org.grobid.core.document.DocumentSource: Executing: [D:\grobid\grobid-home\pdf2xml\win-64\pdftoxml_server, -blocks, -noImageInline, -fullFontName, -noImage, -annots, D:\grobid\grobid-home\tmp\origin6806815209628133159.pdf, D:\grobid\grobid-home\tmp\3xStPr6RB5.lxml]
[DEBUG] org.grobid.core.document.DocumentSource: pdf2xml process finished. Timeto process:170ms
[INFO ] org.grobid.core.jni.WapitiModel: Loading model: D:\grobid\grobid-home\models\segmentation\model.wapiti (size: 15755692)
[Wapiti] Loading model: "D:\grobid\grobid-home\models\segmentation\model.wapiti"
Model path: D:\grobid\grobid-home\models\segmentation\model.wapiti
[DEBUG] org.grobid.core.lang.impl.CybozuLanguageDetectorFactory: synchronized getNewInstance
[DEBUG] org.grobid.core.lang.impl.CybozuLanguageDetector: [en:0.9999987071320096]
[INFO ] org.grobid.core.jni.WapitiModel: Loading model: D:\grobid\grobid-home\models\name\header\model.wapiti (size: 2055704)
[Wapiti] Loading model: "D:\grobid\grobid-home\models\name\header\model.wapiti"
Model path: D:\grobid\grobid-home\models\name\header\model.wapiti
[INFO ] org.grobid.core.jni.WapitiModel: Loading model: D:\grobid\grobid-home\models\name\citation\model.wapiti (size: 339957)
[Wapiti] Loading model: "D:\grobid\grobid-home\models\name\citation\model.wapiti"
Model path: D:\grobid\grobid-home\models\name\citation\model.wapiti
[INFO ] org.grobid.core.jni.WapitiModel: Loading model: D:\grobid\grobid-home\models\affiliation-address\model.wapiti (size: 2646298)
[Wapiti] Loading model: "D:\grobid\grobid-home\models\affiliation-address\model.wapiti"
Model path: D:\grobid\grobid-home\models\affiliation-address\model.wapiti
[DEBUG] org.grobid.core.utilities.GrobidProperties: loading GROBID_HOME path
[DEBUG] org.grobid.core.lang.impl.CybozuLanguageDetector: [en:0.999998180987098]
[DEBUG] org.grobid.core.utilities.IOUtilities: Removing D:\grobid\grobid-home\tmp\origin6806815209628133159.pdf
[DEBUG] org.grobid.service.process.GrobidRestProcessFiles: << org.grobid.service.process.GrobidRestProcessFiles.methodLogOut

Output file:
output.txt

Refer the person name tag in the 22.txt file
<persName xmlns="http://www.tei-c.org/ns/1.0"><forename type="first">Rui</forename><surname>Wang</surname></persName>

@kermitt2
Copy link
Owner

Many thanks !

There was a bug in the way name suffix were set (nothing to do with the model or training data). It is fixed and works with your example after commit b738f1f.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants