这的项目是复现Visual-Semantic Transformer for Scene Text Recognition这篇论文的工作VST
本论文中使用视觉特征去关联它的语义信息,这篇文章中一共包括5个关键的模块ConvNet Module(CNN特征提取), Visual Module(视觉建模), Vsalign Module(Visual Semantic Alignment模块), Iteraction Module(用于两个模态的信息进行类间和类内的交互), Semantic Module(语义推理模块)。
这里我列举一下本项目所使用的packages
torch==1.1.0
torchvision==0.3.0
fastai==1.0.60
LMDB
Pillow
opencv-python
tensorboardX
-
Training datasets
- MJSynth (MJ):
- Use
tools/create_lmdb_dataset.py
to convert images into LMDB dataset - LMDB dataset BaiduNetdisk(passwd:n23k)
- Use
- SynthText (ST):
- Use
tools/crop_by_word_bb.py
to crop images from original SynthText dataset, and convert images into LMDB dataset bytools/create_lmdb_dataset.py
- LMDB dataset BaiduNetdisk(passwd:n23k)
- Use
- MJSynth (MJ):
-
Evaluation datasets, LMDB datasets can be downloaded from BaiduNetdisk(passwd:1dbv), GoogleDrive.
- ICDAR 2013 (IC13)
- ICDAR 2015 (IC15)
- IIIT5K Words (IIIT)
- Street View Text (SVT)
- Street View Text-Perspective (SVTP)
- CUTE80 (CUTE)
-
data
目录的结构是下面的样子:data ├── charset_36.txt ├── evaluation │ ├── CUTE80 │ ├── IC13_857 │ ├── IC15_1811 │ ├── IIIT5k_3000 │ ├── SVT │ └── SVTP |── training │ ├── MJ │ │ ├── MJ_test │ │ ├── MJ_train │ │ └── MJ_valid │ └── ST
- 训练模型:
CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py --config=configs/train_vstnet.yaml
- 模型验证:
附加参数设置:
CUDA_VISIBLE_DEVICES=0 python main.py --config=configs/train_abinet.yaml --phase test
--checkpoint /path/to/checkpoint
set the path of evaluation model--test_root /path/to/dataset
set the path of evaluation dataset--model_eval [alignment|vision]
which sub-model to evaluate