Running 'createTrainingHeader' via grobid-service #200

dominic-sps · 2017-07-10T07:26:22Z

This is not an issue but missing in the grobid-service module. Is there any way I could run "createTrainingHeader" as a service? Appreciate any help in this regard. Let it merge all the 4 files that it creates currently into a single file and and respond back.

lfoppiano · 2017-07-14T14:38:36Z

Hi @dominic-sps, the reason of this functionality is not included in the grobid-service is that at the moment the creation of training data is a separate offline operation from the processing.
What would be the reason/use case for having that directly in the grobid-service?
One solution is that you implement your own service for training data creation and integrate the grobid-core library, in order to have a service for training data generation.

dominic-sps · 2017-07-15T15:36:47Z

One solution is that you implement your own service for training data creation and integrate the grobid-core library, in order to have a service for training data generation.

This is what I am trying now. I have merged the author and affiliation training XMLs and created my own xml. The reason is, I noted that the training XMLs has more details than the output created by 'processHeaderDocument' which actually ats up lot of content.

lfoppiano · 2017-07-15T17:10:34Z

Ok, what do you mean with the training XML has more details than the otuptu created by processeHeaderDocument?

The training XML (they are produced also together with a feature list text files) they are meant to be manually corrected by expert users.

Could you be also more specific of what you intend to do with the XMLs? I'm asking because I"m not sure I've understand what you are trying to achieve :-)

kermitt2 · 2017-07-16T11:20:10Z

Hi Dominic,

createTrainingHeader uses the segmentation model for identifying the header zone, while the current processHeaderDocument does not use it yet. This explain why you see more material with createTrainingHeader at the current time. Using createTrainingHeader is not a reliable solution of course because it's not its purpose to get usable results - simply to get usable training data.

Making the header model up-to-date with the segmentation model is a work in progress, see issue #136. It takes time because it requires to refresh the current training data.

So having createTrainingHeader available as a service for the reason you mention does not make a lot of sense, we just need to improve the normal processHeaderDocument as planed and in progress.

kermitt2 · 2017-07-16T12:25:36Z

@dominic-sps if you want to have the same material in processHeaderDocument as with createTrainingHeader, you can use the segmentation model in processHeaderDocument by modifying the file GrobidRestProcessFiles.java (under grobid-service/src/main/java/org/grobid/service/process) as follow:

line 81-82 and 88-89 change this

retVal = engine.processHeader(originFile.getAbsolutePath(), consolidate, null);
//retVal = engine.segmentAndProcessHeader(originFile, consolidate, null);

into this:

//retVal = engine.processHeader(originFile.getAbsolutePath(), consolidate, null);
retVal = engine.segmentAndProcessHeader(originFile, consolidate, null);

It should work... but given the current training data for the header model, the accuracy of the header model based on areas identified by the segmentation model is ~3% lower than with heuristics-based identification of header area (which is why I have not switched yet to the new approach for header structuring). This ~3% come from the end-to-end evaluation with 1943 PDF files of PubMedCentral.

dominic-sps · 2017-07-17T06:00:20Z

@kermitt2 , It does load the segmentation model after the above change but the expected tag is still missing in the output.

kermitt2 · 2017-07-17T06:42:19Z

Could you send me maybe an example where createTrainingHeader provides more info than processHeaderDocument after the above mentioned change, so that I can recreate the issue?

dominic-sps · 2017-07-17T07:10:55Z

Running createTrainingHeader on file Wang_paperAVE2008.pdf

Screen messages

D:\grobid>java -Xmx1024m -jar \grobid\grobid-core\target\grobid-core-0.4.2-SNAPSHOT.one-jar.jar -gH \grobid\grobid-home -gP \grobid\grobid-home\config\grobid.properties -dIn \test\in -dOut \test\out -exe createTrainingHeader
JarClassLoader: Warning: org/apache/lucene/analysis/cn/smart/hhmm/SegTokenFilter.class in lib/lucene-analyzers-smartcn-4.5.1.jar is hidden by lib/wipo-analysers-0.0.1.jar (with different bytecode)
JarClassLoader: Warning: org/w3c/dom/UserDataHandler.class in lib/xom-1.2.5.jar
is hidden by lib/xml-apis-1.4.01.jar (with different bytecode)
Wang_paperAVE2008.pdf
1 files to be processed.
[Wapiti] Loading model: "D:\grobid\grobid-home\models\header\model.wapiti"
Model path: D:\grobid\grobid-home\models\header\model.wapiti
[Wapiti] Loading model: "D:\grobid\grobid-home\models\segmentation\model.wapiti"
Model path: D:\grobid\grobid-home\models\segmentation\model.wapiti
[Wapiti] Loading model: "D:\grobid\grobid-home\models\affiliation-address\model.wapiti"
Model path: D:\grobid\grobid-home\models\affiliation-address\model.wapiti
[Wapiti] Loading model: "D:\grobid\grobid-home\models\name\header\model.wapiti"
Model path: D:\grobid\grobid-home\models\name\header\model.wapiti
[Wapiti] Loading model: "D:\grobid\grobid-home\models\name\citation\model.wapiti"
Model path: D:\grobid\grobid-home\models\name\citation\model.wapiti

Refer the person name tag in Wang_paperAVE2008.authors.tei.xml.txt

<persName>
	<forename>Rui</forename>
	 <surname>Wang</surname>
	 <suffix>Jr</suffix>
	 <marker>1</marker> and
</persName>

Files Created (renamed as txt)
Wang_paperAVE2008.affiliation.tei.xml.txt
Wang_paperAVE2008.authors.tei.xml.txt
Wang_paperAVE2008.header.tei.xml.txt
Wang_paperAVE2008.header.txt

Running service processHeaderDocument . After making your suggested code change I did compile the service and run

curl --form input=@/test/in/Wang_paperAVE2008.pdf -H "Content-Type: multipart/form-data" localhost:8081/processHeaderDocument > output.txt

Server side screen messages

[DEBUG] org.grobid.service.process.GrobidRestProcessFiles: >> org.grobid.service.process.GrobidRestProcessFiles.methodLogIn
[DEBUG] org.grobid.core.utilities.IOUtilities: >> set origin document for stateless service'...
[DEBUG] org.grobid.core.factory.GrobidPoolingFactory: synchronized newPoolInstance
[INFO ] org.grobid.core.factory.GrobidPoolingFactory: Number of Engines in poolactive/max: 1/10
[DEBUG] org.grobid.core.utilities.LanguageUtilities: synchronized getNewInstance
[DEBUG] org.grobid.core.analyzers.GrobidAnalyzer: Get new instance of GrobidAnalyzer
[INFO ] org.grobid.core.jni.WapitiModel: Loading model: D:\grobid\grobid-home\models\header\model.wapiti (size: 36094028)
[Wapiti] Loading model: "D:\grobid\grobid-home\models\header\model.wapiti"
Model path: D:\grobid\grobid-home\models\header\model.wapiti
[DEBUG] org.grobid.core.document.DocumentSource: start pdf2xml
[DEBUG] org.grobid.core.document.DocumentSource: Executing: [D:\grobid\grobid-home\pdf2xml\win-64\pdftoxml_server, -blocks, -noImageInline, -fullFontName, -noImage, -annots, D:\grobid\grobid-home\tmp\origin6806815209628133159.pdf, D:\grobid\grobid-home\tmp\3xStPr6RB5.lxml]
[DEBUG] org.grobid.core.document.DocumentSource: pdf2xml process finished. Timeto process:170ms
[INFO ] org.grobid.core.jni.WapitiModel: Loading model: D:\grobid\grobid-home\models\segmentation\model.wapiti (size: 15755692)
[Wapiti] Loading model: "D:\grobid\grobid-home\models\segmentation\model.wapiti"
Model path: D:\grobid\grobid-home\models\segmentation\model.wapiti
[DEBUG] org.grobid.core.lang.impl.CybozuLanguageDetectorFactory: synchronized getNewInstance
[DEBUG] org.grobid.core.lang.impl.CybozuLanguageDetector: [en:0.9999987071320096]
[INFO ] org.grobid.core.jni.WapitiModel: Loading model: D:\grobid\grobid-home\models\name\header\model.wapiti (size: 2055704)
[Wapiti] Loading model: "D:\grobid\grobid-home\models\name\header\model.wapiti"
Model path: D:\grobid\grobid-home\models\name\header\model.wapiti
[INFO ] org.grobid.core.jni.WapitiModel: Loading model: D:\grobid\grobid-home\models\name\citation\model.wapiti (size: 339957)
[Wapiti] Loading model: "D:\grobid\grobid-home\models\name\citation\model.wapiti"
Model path: D:\grobid\grobid-home\models\name\citation\model.wapiti
[INFO ] org.grobid.core.jni.WapitiModel: Loading model: D:\grobid\grobid-home\models\affiliation-address\model.wapiti (size: 2646298)
[Wapiti] Loading model: "D:\grobid\grobid-home\models\affiliation-address\model.wapiti"
Model path: D:\grobid\grobid-home\models\affiliation-address\model.wapiti
[DEBUG] org.grobid.core.utilities.GrobidProperties: loading GROBID_HOME path
[DEBUG] org.grobid.core.lang.impl.CybozuLanguageDetector: [en:0.999998180987098]
[DEBUG] org.grobid.core.utilities.IOUtilities: Removing D:\grobid\grobid-home\tmp\origin6806815209628133159.pdf
[DEBUG] org.grobid.service.process.GrobidRestProcessFiles: << org.grobid.service.process.GrobidRestProcessFiles.methodLogOut

Output file:
output.txt

Refer the person name tag in the 22.txt file
<persName xmlns="http://www.tei-c.org/ns/1.0"><forename type="first">Rui</forename><surname>Wang</surname></persName>

kermitt2 · 2017-07-17T08:02:42Z

Many thanks !

There was a bug in the way name suffix were set (nothing to do with the model or training data). It is fixed and works with your example after commit b738f1f.

kermitt2 mentioned this issue Jul 16, 2017

Handling Jr., Sr. in names (Affiliation and Citation) #196

Open

dominic-sps closed this as completed Jul 21, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running 'createTrainingHeader' via grobid-service #200

Running 'createTrainingHeader' via grobid-service #200

dominic-sps commented Jul 10, 2017 •

edited

Loading

lfoppiano commented Jul 14, 2017

dominic-sps commented Jul 15, 2017

lfoppiano commented Jul 15, 2017

kermitt2 commented Jul 16, 2017

kermitt2 commented Jul 16, 2017

dominic-sps commented Jul 17, 2017

kermitt2 commented Jul 17, 2017

dominic-sps commented Jul 17, 2017

kermitt2 commented Jul 17, 2017

Running 'createTrainingHeader' via grobid-service #200

Running 'createTrainingHeader' via grobid-service #200

Comments

dominic-sps commented Jul 10, 2017 • edited Loading

lfoppiano commented Jul 14, 2017

dominic-sps commented Jul 15, 2017

lfoppiano commented Jul 15, 2017

kermitt2 commented Jul 16, 2017

kermitt2 commented Jul 16, 2017

dominic-sps commented Jul 17, 2017

kermitt2 commented Jul 17, 2017

dominic-sps commented Jul 17, 2017

kermitt2 commented Jul 17, 2017

dominic-sps commented Jul 10, 2017 •

edited

Loading