Skip to content

Commit

Permalink
update doc, cleaning, support python env>3.9
Browse files Browse the repository at this point in the history
  • Loading branch information
kermitt2 committed Jun 16, 2024
1 parent b6a2a20 commit 4675511
Show file tree
Hide file tree
Showing 18 changed files with 61 additions and 100 deletions.
2 changes: 1 addition & 1 deletion Readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,10 +105,10 @@ Detailed end-to-end [benchmarking](https://grobid.readthedocs.io/en/latest/Bench
A series of additional modules have been developed for performing __structure aware__ text mining directly on scholar PDF, reusing GROBID's PDF processing and sequence labelling weaponry:

- [software-mention](https://github.com/ourresearch/software-mentions): recognition of software mentions and associated attributes in scientific literature
- [datastet](https://github.com/kermitt2/datastet): identification of sections and sentences introducing datasets in a scientific article, identification of dataset names and attributes (implict and named datasets) and classification of the type of datasets
- [grobid-quantities](https://github.com/kermitt2/grobid-quantities): recognition and normalization of physical quantities/measurements
- [grobid-superconductors](https://github.com/lfoppiano/grobid-superconductors): recognition of superconductor material and properties in scientific literature
- [entity-fishing](https://github.com/kermitt2/entity-fishing), a tool for extracting Wikidata entities from text and document, which can also use Grobid to pre-process scientific articles in PDF, leading to more precise and relevant entity extraction and the capacity to annotate the PDF with interactive layout
- [datastet](https://github.com/kermitt2/datastet): identification of sections and sentences introducing datasets in a scientific article, identification of dataset names (implict and named datasets) and classification of the type of these datasets
- [grobid-ner](https://github.com/kermitt2/grobid-ner): named entity recognition
- [grobid-astro](https://github.com/kermitt2/grobid-astro): recognition of astronomical entities in scientific papers
- [grobid-bio](https://github.com/kermitt2/grobid-bio): a toy bio-entity tagger using BioNLP/NLPBA 2004 dataset
Expand Down
2 changes: 1 addition & 1 deletion doc/Deep-Learning-models.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ Current neural models can be up to 50 times slower than CRF, depending on the ar

By default, only CRF models are used by Grobid. You need to select the Deep Learning models you would like to use in the GROBID configuration yaml file (`grobid/grobid-home/config/grobid.yaml`). See [here](https://grobid.readthedocs.io/en/latest/Configuration/#configuring-the-models) for more details on how to select these models. The most convenient way to use the Deep Learning models is to use the full GROBID Docker image and pass a configuration file at launch of the container describing the selected models to be used instead of the default CRF ones. Note that the full GROBID Docker image is already configured to use Deep Learning models for bibliographical reference and affiliation-address parsing.

For current GROBID version 0.8.0, we recommend considering the usage of the following Deep Learning models:
For current GROBID version 0.8.1, we recommend considering the usage of the following Deep Learning models:

- `citation` model: for bibliographical parsing, the `BidLSTM_CRF_FEATURES` architecture provides currently the best accuracy, significantly better than CRF (+3 to +5 points in F1-Score). With a GPU, there is normally no runtime impact by selecting this model. SciBERT fine-tuned model performs currently at lower accuracy.

Expand Down
2 changes: 1 addition & 1 deletion doc/Frequently-asked-questions.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ In addition, consider more RAM memory when running Deep Learning model on CPU, e
You will get the embedded images converted into `.png` by using the normal batch command. For instance:

```console
java -Xmx4G -Djava.library.path=grobid-home/lib/lin-64:grobid-home/lib/lin-64/jep -jar grobid-core/build/libs/grobid-core-0.8.0-onejar.jar -gH grobid-home -dIn ~/test/in/ -dOut ~/test/out -exe processFullText
java -Xmx4G -Djava.library.path=grobid-home/lib/lin-64:grobid-home/lib/lin-64/jep -jar grobid-core/build/libs/grobid-core-0.8.1-onejar.jar -gH grobid-home -dIn ~/test/in/ -dOut ~/test/out -exe processFullText
```

There is a web service doing the same, returning everything in a big zip file, `processFulltextAssetDocument`, still usable but deprecated.
Expand Down
30 changes: 15 additions & 15 deletions doc/Grobid-batch.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ The following command display some help for the batch commands:

Be sure to replace `<current version>` with the current version of GROBID that you have installed and built. For example:
```bash
> java -Djava.library.path=grobid-home/lib/lin-64:grobid-home/lib/lin-64/jep -jar grobid-core/build/libs/grobid-core-0.8.0-onejar.jar -h
> java -Djava.library.path=grobid-home/lib/lin-64:grobid-home/lib/lin-64/jep -jar grobid-core/build/libs/grobid-core-0.8.1-onejar.jar -h
```

The available batch commands are listed bellow. For those commands, at least `-Xmx1G` is used to set the JVM memory to avoid *OutOfMemoryException* given the current size of the Grobid models and the crazyness of some PDF. For complete fulltext processing, which involve all the GROBID models, `-Xmx4G` is recommended (although allocating less memory is usually fine).
Expand All @@ -42,7 +42,7 @@ The needed parameters for that command are:

Example:
```bash
> java -Xmx1G -Djava.library.path=grobid-home/lib/lin-64:grobid-home/lib/lin-64/jep -jar grobid-core/build/libs/grobid-core-0.8.0-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -r -exe processHeader
> java -Xmx1G -Djava.library.path=grobid-home/lib/lin-64:grobid-home/lib/lin-64/jep -jar grobid-core/build/libs/grobid-core-0.8.1-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -r -exe processHeader
```

WARNING: the expected extension of the PDF files to be processed is .pdf
Expand All @@ -68,7 +68,7 @@ WARNING: the expected extension of the PDF files to be processed is .pdf

Example:
```bash
> java -Xmx4G -Djava.library.path=grobid-home/lib/lin-64:grobid-home/lib/lin-64/jep -jar grobid-core/build/libs/grobid-core-0.8.0-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe processFullText
> java -Xmx4G -Djava.library.path=grobid-home/lib/lin-64:grobid-home/lib/lin-64/jep -jar grobid-core/build/libs/grobid-core-0.8.1-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe processFullText
```

WARNING: the expected extension of the PDF files to be processed is .pdf
Expand All @@ -82,7 +82,7 @@ WARNING: the expected extension of the PDF files to be processed is .pdf

Example:
```bash
> java -Xmx1G -Djava.library.path=grobid-home/lib/lin-64:grobid-home/lib/lin-64/jep -jar grobid-core/build/libs/grobid-core-0.8.0-onejar.jar -gH grobid-home -exe processDate -s "some date to extract and format"
> java -Xmx1G -Djava.library.path=grobid-home/lib/lin-64:grobid-home/lib/lin-64/jep -jar grobid-core/build/libs/grobid-core-0.8.1-onejar.jar -gH grobid-home -exe processDate -s "some date to extract and format"
```

### processAuthorsHeader
Expand All @@ -94,7 +94,7 @@ Example:

Example:
```bash
> java -Xmx1G -Djava.library.path=grobid-home/lib/lin-64:grobid-home/lib/lin-64/jep -jar grobid-core/build/libs/grobid-core-0.8.0-onejar.jar -gH grobid-home -exe processAuthorsHeader -s "some authors"
> java -Xmx1G -Djava.library.path=grobid-home/lib/lin-64:grobid-home/lib/lin-64/jep -jar grobid-core/build/libs/grobid-core-0.8.1-onejar.jar -gH grobid-home -exe processAuthorsHeader -s "some authors"
```

### processAuthorsCitation
Expand All @@ -106,7 +106,7 @@ Example:

Example:
```bash
> java -Xmx1G -Djava.library.path=grobid-home/lib/lin-64:grobid-home/lib/lin-64/jep -jar grobid-core/build/libs/grobid-core-0.8.0-onejar.jar -gH grobid-home -exe processAuthorsCitation -s "some authors"
> java -Xmx1G -Djava.library.path=grobid-home/lib/lin-64:grobid-home/lib/lin-64/jep -jar grobid-core/build/libs/grobid-core-0.8.1-onejar.jar -gH grobid-home -exe processAuthorsCitation -s "some authors"
```

### processAffiliation
Expand All @@ -118,7 +118,7 @@ Example:

Example:
```bash
> java -Xmx1G -Djava.library.path=grobid-home/lib/lin-64:grobid-home/lib/lin-64/jep -jar grobid-core/build/libs/grobid-core-0.8.0-onejar.jar -gH grobid-home -exe processAffiliation -s "some affiliation"
> java -Xmx1G -Djava.library.path=grobid-home/lib/lin-64:grobid-home/lib/lin-64/jep -jar grobid-core/build/libs/grobid-core-0.8.1-onejar.jar -gH grobid-home -exe processAffiliation -s "some affiliation"
```

### processRawReference
Expand All @@ -130,7 +130,7 @@ Example:

Example:
```bash
> java -Xmx1G -Djava.library.path=grobid-home/lib/lin-64:grobid-home/lib/lin-64/jep -jar grobid-core/build/libs/grobid-core-0.8.0-onejar.jar -gH grobid-home -exe processRawReference -s "a reference string"
> java -Xmx1G -Djava.library.path=grobid-home/lib/lin-64:grobid-home/lib/lin-64/jep -jar grobid-core/build/libs/grobid-core-0.8.1-onejar.jar -gH grobid-home -exe processRawReference -s "a reference string"
```

### processReferences
Expand All @@ -146,7 +146,7 @@ Example:

Example:
```bash
> java -Xmx2G -Djava.library.path=grobid-home/lib/lin-64:grobid-home/lib/lin-64/jep -jar grobid-core/build/libs/grobid-core-0.8.0-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe processReferences
> java -Xmx2G -Djava.library.path=grobid-home/lib/lin-64:grobid-home/lib/lin-64/jep -jar grobid-core/build/libs/grobid-core-0.8.1-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe processReferences
```

WARNING: the expected extension of the PDF files to be processed is `.pdf`
Expand All @@ -162,7 +162,7 @@ WARNING: the expected extension of the PDF files to be processed is `.pdf`

Example:
```bash
> java -Xmx1G -Djava.library.path=grobid-home/lib/lin-64:grobid-home/lib/lin-64/jep -jar grobid-core/build/libs/grobid-core-0.8.0-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe processCitationPatentST36
> java -Xmx1G -Djava.library.path=grobid-home/lib/lin-64:grobid-home/lib/lin-64/jep -jar grobid-core/build/libs/grobid-core-0.8.1-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe processCitationPatentST36
```

WARNING: extension of the ST.36 files to be processed must be `.xml`
Expand All @@ -178,7 +178,7 @@ WARNING: extension of the ST.36 files to be processed must be `.xml`

Example:
```
> java -Xmx1G -Djava.library.path=grobid-home/lib/lin-64:grobid-home/lib/lin-64/jep -jar grobid-core/build/libs/grobid-core-0.8.0-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe processCitationPatentTXT
> java -Xmx1G -Djava.library.path=grobid-home/lib/lin-64:grobid-home/lib/lin-64/jep -jar grobid-core/build/libs/grobid-core-0.8.1-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe processCitationPatentTXT
```

WARNING: extension of the text files to be processed must be `.txt`, and expected encoding is `UTF-8`
Expand All @@ -194,7 +194,7 @@ WARNING: extension of the text files to be processed must be `.txt`, and expecte

Example:
```
> java -Xmx1G -Djava.library.path=grobid-home/lib/lin-64:grobid-home/lib/lin-64/jep -jar grobid-core/build/libs/grobid-core-0.8.0-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe processCitationPatentPDF
> java -Xmx1G -Djava.library.path=grobid-home/lib/lin-64:grobid-home/lib/lin-64/jep -jar grobid-core/build/libs/grobid-core-0.8.1-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe processCitationPatentPDF
```

WARNING: extension of the text files to be processed must be `.pdf`
Expand All @@ -210,7 +210,7 @@ WARNING: extension of the text files to be processed must be `.pdf`

Example:
```bash
> java -Xmx4G -Djava.library.path=grobid-home/lib/lin-64:grobid-home/lib/lin-64/jep -jar grobid-core/build/libs/grobid-core-0.8.0-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe createTraining
> java -Xmx4G -Djava.library.path=grobid-home/lib/lin-64:grobid-home/lib/lin-64/jep -jar grobid-core/build/libs/grobid-core-0.8.1-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe createTraining
```

WARNING: the expected extension of the PDF files to be processed is `.pdf`
Expand All @@ -226,7 +226,7 @@ WARNING: the expected extension of the PDF files to be processed is `.pdf`

Example:
```bash
> java -Xmx4G -Djava.library.path=grobid-home/lib/lin-64:grobid-home/lib/lin-64/jep -jar grobid-core/build/libs/grobid-core-0.8.0-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe createTrainingBlank
> java -Xmx4G -Djava.library.path=grobid-home/lib/lin-64:grobid-home/lib/lin-64/jep -jar grobid-core/build/libs/grobid-core-0.8.1-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe createTrainingBlank
```

WARNING: the expected extension of the PDF files to be processed is `.pdf`
Expand All @@ -244,7 +244,7 @@ The needed parameters for that command are:

Example:
```bash
> java -Xmx2G -Djava.library.path=grobid-home/lib/lin-64:grobid-home/lib/lin-64/jep -jar grobid-core/build/libs/grobid-core-0.8.0-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -r -exe processPDFAnnotation
> java -Xmx2G -Djava.library.path=grobid-home/lib/lin-64:grobid-home/lib/lin-64/jep -jar grobid-core/build/libs/grobid-core-0.8.1-onejar.jar -gH grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -r -exe processPDFAnnotation
```

WARNING: extension of the text files to be processed must be `.pdf`
28 changes: 14 additions & 14 deletions doc/Grobid-docker.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,13 +26,13 @@ The process for retrieving and running the image is as follow:
Current latest version:

```bash
> docker pull grobid/grobid:0.8.0
> docker pull grobid/grobid:0.8.1
```

- Run the container:

```bash
> docker run --rm --gpus all --init --ulimit core=0 -p 8070:8070 grobid/grobid:0.8.0
> docker run --rm --gpus all --init --ulimit core=0 -p 8070:8070 grobid/grobid:0.8.1
```

The image will automatically uses the GPU and CUDA version available on your host machine, but only on Linux. GPU usage via a container on Windows and MacOS machine is currently not supported by Docker. If no GPU are available, CPU will be used.
Expand Down Expand Up @@ -88,7 +88,7 @@ The process for retrieving and running the image is as follow:
Latest version:

```bash
> docker pull lfoppiano/grobid:0.8.0
> docker pull lfoppiano/grobid:0.8.1
```

- Run the container:
Expand All @@ -100,7 +100,7 @@ Latest version:
Latest version:

```bash
> docker run --rm --init --ulimit core=0 -p 8070:8070 lfoppiano/grobid:0.8.0
> docker run --rm --init --ulimit core=0 -p 8070:8070 lfoppiano/grobid:0.8.1
```

Note the default version is running on port `8070`, however it can be mapped on the more traditional port `8080` of your host with the following command:
Expand All @@ -121,7 +121,7 @@ Grobid web services are then available as described in the [service documentatio
The simplest way to pass a modified configuration to the docker image is to mount the yaml GROBID config file `grobid.yaml` when running the image. Modify the config file `grobid/grobid-home/config/grobid.yaml` according to your requirements on the host machine and mount it when running the image as follow:

```bash
docker run --rm --gpus all --init --ulimit core=0 -p 8080:8070 -p 8081:8071 -v /home/lopez/grobid/grobid-home/config/grobid.yaml:/opt/grobid/grobid-home/config/grobid.yaml:ro grobid/grobid:0.8.0
docker run --rm --gpus all --init --ulimit core=0 -p 8080:8070 -p 8081:8071 -v /home/lopez/grobid/grobid-home/config/grobid.yaml:/opt/grobid/grobid-home/config/grobid.yaml:ro grobid/grobid:0.8.1
```

You need to use an absolute path to specify your modified `grobid.yaml` file.
Expand Down Expand Up @@ -222,25 +222,25 @@ Without this requirement, the image might default to CPU, even if GPU are availa
For being able to use both CRF and Deep Learningmodels, use the dockerfile `./Dockerfile.delft`. The only important information then is the version which will be checked out from the tags.

```bash
> docker build -t grobid/grobid:0.8.0 --build-arg GROBID_VERSION=0.8.0 --file Dockerfile.delft .
> docker build -t grobid/grobid:0.8.1 --build-arg GROBID_VERSION=0.8.1 --file Dockerfile.delft .
```

Similarly, if you want to create a docker image from the current master, development version:

```bash
docker build -t grobid/grobid:0.8.1-SNAPSHOT --build-arg GROBID_VERSION=0.8.1-SNAPSHOT --file Dockerfile.delft .
docker build -t grobid/grobid:0.8.2-SNAPSHOT --build-arg GROBID_VERSION=0.8.2-SNAPSHOT --file Dockerfile.delft .
```

In order to run the container of the newly created image, for example for the development version `0.8.1-SNAPSHOT`, using all GPU available:
In order to run the container of the newly created image, for example for the development version `0.8.2-SNAPSHOT`, using all GPU available:

```bash
> docker run --rm --gpus all --init --ulimit core=0 -p 8080:8070 -p 8081:8071 grobid/grobid:0.8.1-SNAPSHOT
> docker run --rm --gpus all --init --ulimit core=0 -p 8080:8070 -p 8081:8071 grobid/grobid:0.8.2-SNAPSHOT
```

In practice, you need to indicate which models should use a Deep Learning model implementation and which ones can remain with a faster CRF model implementation, which is done currently in the `grobid.yaml` file. Modify the config file `grobid/grobid-home/config/grobid.yaml` accordingly on the host machine and mount it when running the image as follow:

```bash
docker run --rm --gpus all --init --ulimit core=0 -p 8080:8070 -p 8081:8071 -v /home/lopez/grobid/grobid-home/config/grobid.yaml:/opt/grobid/grobid-home/config/grobid.yaml:ro grobid/grobid:0.8.1-SNAPSHOT
docker run --rm --gpus all --init --ulimit core=0 -p 8080:8070 -p 8081:8071 -v /home/lopez/grobid/grobid-home/config/grobid.yaml:/opt/grobid/grobid-home/config/grobid.yaml:ro grobid/grobid:0.8.2-SNAPSHOT
```

You need to use an absolute path to specify your modified `grobid.yaml` file.
Expand All @@ -262,19 +262,19 @@ The container name is given by the command:
For building a CRF-only image, the dockerfile to be used is `./Dockerfile.crf`. The only important information then is the version which will be checked out from the tags.

```bash
> docker build -t grobid/grobid:0.8.0 --build-arg GROBID_VERSION=0.8.0 --file Dockerfile.crf .
> docker build -t grobid/grobid:0.8.1 --build-arg GROBID_VERSION=0.8.1 --file Dockerfile.crf .
```

Similarly, if you want to create a docker image from the current master, development version:

```bash
> docker build -t grobid/grobid:0.8.1-SNAPSHOT --build-arg GROBID_VERSION=0.8.1-SNAPSHOT --file Dockerfile.crf .
> docker build -t grobid/grobid:0.8.2-SNAPSHOT --build-arg GROBID_VERSION=0.8.2-SNAPSHOT --file Dockerfile.crf .
```

In order to run the container of the newly created image, for example for version `0.8.1`:
In order to run the container of the newly created image, for example for version `0.8.2-SNAPSHOT`:

```bash
> docker run --rm --init --ulimit core=0 -p 8080:8070 -p 8081:8071 grobid/grobid:0.8.1
> docker run --rm --init --ulimit core=0 -p 8080:8070 -p 8081:8071 grobid/grobid:0.8.2-SNAPSHOT
```

For testing or debugging purposes, you can connect to the container with a bash shell (logs are under `/opt/grobid/logs/`):
Expand Down
Loading

0 comments on commit 4675511

Please sign in to comment.