init code

wenwenyu · Jul 18, 2020 · c7e98fe · c7e98fe
commit c7e98fe
Show file tree

Hide file tree

Showing 47 changed files with 4,445 additions and 0 deletions.
diff --git a/.flake8 b/.flake8
@@ -0,0 +1,6 @@
+[flake8]
+ignore = F401, F403
+max-line-length = 120
+exclude =
+    .git,
+    __pycache__,
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,112 @@
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+
+# C extensions
+*.so
+
+# Distribution / packaging
+.Python
+env/
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+.hypothesis/
+
+# Translations
+*.mo
+*.pot
+
+# Django stuff:
+*.log
+local_settings.py
+
+# Flask stuff:
+instance/
+.webassets-cache
+
+# Scrapy stuff:
+.scrapy
+
+# Sphinx documentation
+docs/_build/
+
+# PyBuilder
+target/
+
+# Jupyter Notebook
+.ipynb_checkpoints
+
+# pyenv
+.python-version
+
+# celery beat schedule file
+celerybeat-schedule
+
+# SageMath parsed files
+*.sage.py
+
+# dotenv
+.env
+
+# virtualenv
+.venv
+venv/
+ENV/
+
+# Spyder project settings
+.spyderproject
+.spyproject
+
+# Rope project settings
+.ropeproject
+
+# mkdocs documentation
+/site
+
+# mypy
+.mypy_cache/
+
+# input data, saved log, checkpoints
+input/
+saved/
+
+# editor, os cache directory
+.vscode/
+.idea/
+__MACOSX/
+/data/train_tickets/
+/config_train_ticket.json
diff --git a/LICENSE b/LICENSE
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2020 Wenwen Yu
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
diff --git a/README.md b/README.md
@@ -0,0 +1,171 @@
+# PICK-PyTorch
+PyTorch reimplementation of ["PICK: Processing Key Information Extraction from Documents using Improved Graph 
+Learning-Convolutional Networks"](https://arxiv.org/abs/2004.07464) (ICPR 2020). This project is different from 
+our original implementation.
+
+<!-- @import "[TOC]" {cmd="toc" depthFrom=1 depthTo=6 orderedList=false} -->
+
+<!-- code_chunk_output -->
+
+* Contents
+    * [Introduction](#introduction)
+	* [Requirements](#requirements)
+	* [Usage](#usage)
+		* [Training with config files](#training-with-config-files)
+		* [Using Multiple GPU](#using-multiple-gpu)
+		* [Resuming from checkpoints](#resuming-from-checkpoints)
+		* [Testing from checkpoints](#testing-from-checkpoints)
+	* [Customization](#customization)
+	    * [Training custom datasets](training-custom-datasets)
+		* [Checkpoints](#checkpoints)
+        * [Tensorboard Visualization](#tensorboard-visualization)
+	* [TODOs](#todos)
+    * [Citations](#citations)
+	* [License](#license)
+	* [Acknowledgements](#acknowledgements)
+
+<!-- /code_chunk_output -->
+
+## Introduction
+PICK is a framework that is effective and robust in handling complex documents layout for Key Information Extraction (KIE) by 
+combining graph learning with graph convolution operation, yielding a richer semantic representation 
+containing the textual and visual features and global layout without ambiguity. Overall architecture shown follows.
+
+![Overall](assets/overall.png)
+
+## Requirements
+* python = 3.6 
+* torchvision = 0.6.1
+* tabulate = 0.8.7
+* overrides = 3.0.0
+* opencv_python = 4.3.0.36
+* numpy = 1.16.4
+* pandas = 1.0.5
+* allennlp = 1.0.0
+* torchtext = 0.6.0
+* tqdm = 4.47.0
+* torch = 1.5.1
+```bash
+pip install -r requirements.txt
+```
+
+## Usage
+
+### Training with config files
+Modify the configurations in `config.json` files, then run:
+
+  ```
+  python train.py --config config.json
+  ```
+
+### Using Multiple GPU
+You can enable one-node multi-GPU training by setting `n_gpu` argument of the config file to larger number.
+If configured to use smaller number of gpu than available, first n devices will be used by default.
+Specify indices of available GPUs by cuda environmental variable.
+  ```
+  python train.py --device 2,3 -c config.json
+  ```
+  This is equivalent to
+  ```
+  CUDA_VISIBLE_DEVICES=2,3 python train.py -c config.py
+  ```
+
+### Resuming from checkpoints
+You can resume from a previously saved checkpoint by:
+
+  ```
+  python train.py --resume path/to/checkpoint -d 2,3
+  ```
+
+### Testing from checkpoints
+You can test from a previously saved checkpoint by:
+
+  ```
+  python test.py --checkpoint path/to/checkpoint --boxes_transcripts path/to/boxes_transcripts \
+                 --images_path path/to/images_path --output_folder path/to/output_folder \
+                 --gpu 0 --batch_size 2
+  ```
+
+## Customization
+
+### Training custom datasets
+You can train you own datasets following the steps outlined below.
+1. Prepare the correct format of files as provided in `data` folder.
+2. Modify `train_dataset` and  `validation_dataset` args in `config.json` file, including `files_name`, 
+`images_folder`, `boxes_and_transcripts_folder`, `entities_folder`, `iob_tagging_type` and `resized_image_size`. 
+3. Modify `Entities_list` in `utils/entities_list.py` file according to the entity type of your dataset.
+4. Modify `MAX_BOXES_NUM` and `MAX_TRANSCRIPT_LEN` in `data_tuils/documents.py` file. (Optional)
+
+**Note**: The self-build datasets our paper used cannot be shared for patient privacy and proprietary issues.
+
+### Checkpoints
+You can specify the name of the training session in `config.json` files:
+  ```json
+  "name": "PICK_Default",
+  "run_id": "test"
+  ```
+
+The checkpoints will be saved in `save_dir/name/run_id_timestamp/checkpoint_epoch_n`, with timestamp in mmdd_HHMMSS format.
+
+A copy of `config.json` file will be saved in the same folder.
+
+**Note**: checkpoints contain:
+  ```python
+  {
+    'arch': arch,
+    'epoch': epoch,
+    'state_dict': self.model.state_dict(),
+    'optimizer': self.optimizer.state_dict(),
+    'monitor_best': self.monitor_best,
+    'config': self.config
+  }
+  ```
+
+### Tensorboard Visualization
+This project supports Tensorboard visualization by using either  `torch.utils.tensorboard` or [TensorboardX](https://github.com/lanpa/tensorboardX).
+
+1. **Install**
+
+    If you are using pytorch 1.1 or higher, install tensorboard by 'pip install tensorboard>=1.14.0'.
+
+    Otherwise, you should install tensorboardx. Follow installation guide in [TensorboardX](https://github.com/lanpa/tensorboardX).
+
+2. **Run training** 
+
+    Make sure that `tensorboard` option in the config file is turned on.
+
+    ```
+     "tensorboard" : true
+    ```
+
+3. **Open Tensorboard server** 
+
+    Type `tensorboard --logdir saved/log/` at the project root, then server will open at `http://localhost:6006`
+
+By default, values of loss  will be logged. If you need more visualizations, use `add_scalar('tag', data)`, `add_image('tag', image)`, etc in the `trainer._train_epoch` method.
+`add_something()` methods in this project are basically wrappers for those of `tensorboardX.SummaryWriter` and `torch.utils.tensorboard.SummaryWriter` modules. 
+
+**Note**: You don't have to specify current steps, since `WriterTensorboard` class defined at `logger/visualization.py` will track current steps.
+
+## TODOs
+- [ ] Multi-node multi-gpu setup (DistributedDataParallel)
+- [ ] Dataset cache mechanism to speed up training loop
+- [x] One-node multi-gpu setup (DataParallel)
+
+## Citations
+If you find this code useful please cite our [paper](https://arxiv.org/abs/2004.07464):
+```bibtex
+@inproceedings{Yu2020PICKPK,
+  title={{PICK}: Processing Key Information Extraction from Documents using 
+  Improved Graph Learning-Convolutional Networks},
+  author={Wenwen Yu and Ning Lu and Xianbiao Qi and Ping Gong and Rong Xiao},
+  booktitle={2020 25th International Conference on Pattern Recognition (ICPR)},
+  year={2020}
+}
+```
+
+## License
+This project is licensed under the MIT License. See LICENSE for more details.
+
+## Acknowledgements
+This project structure takes example by [PyTorch Template Project](https://github.com/victoresque/pytorch-template).
diff --git a/assets/overall.png b/assets/overall.png