Source code for "Train No Evil: Selective Masking for Task-Guided Pre-Training"
The datasets can be downloaded from this link. The datasets need to be put in data/datasets
.
-
Modify
config/test.json
for input path, output path, BERT model path, GPU usage etc. -
run
bash scripts/run_all_pipeline.sh
.
The meaning of each step can be found in the appendix of our paper. The input/output paths are also set in config/test.json
. Run python3 convert_config.py config/test.json
to convert the .json file to a .sh file.
We use the training scripts from https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/LanguageModeling/BERT for general pre-training.
bash scripts/finetune_origin.sh
bash data/create_data_rule/run.sh.
bash scripts/run_mask_model.sh
bash data/create_data_model/run.sh
bash scripts/run_pretraining.sh
bash scripts/finetune_ckpt_all_seed.sh
python3 gather_results.py $PATH_TO_THE_FINETUNE_OUTPUT
If you use the code, please cite this paper:
@inproceedings{gu2020train,
title={Train No Evil: Selective Masking for Task-Guided Pre-Training},
author={Yuxian Gu and Zhengyan Zhang and Xiaozhi Wang and Zhiyuan Liu and Maosong Sun},
year={2020},
booktitle={Proceedings of EMNLP 2020},
}