-
Notifications
You must be signed in to change notification settings - Fork 0
junkmechanic/text-norm-seq2seq
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
README for Normalization System using Sequence to Sequence Learning By Ankur Khanna (ankur) Following is the description of the system build by me for Normalization of Noisy User-generated text (like tweets, youtube comments etc). All the source code files are version controlled using git. First, let's look at some of the important files: data/ train_data.json -- benchmark training data test_truth.json -- benchmark test data clear_tweets.json -- tweets with no OOVs trans_norm_train.py tf_normalize.py tf_normalize_skel.py tf_normalize_merge.py The main script to train the model is './trans_norm_model.py'. If you supply it with the command line flag '--test' then it will run the model on the test set. The other important script is './tf_normalize.py' which uses the trained model as well as all the rules to run the entire normalization pipeline. If you want to analyze the system and particularly see how the deep learning model performs by itself, then you should run './tf_normalize_skel.py' which does not use the rules. The script './tf_normalize_skel.py' is used to run only the model (no rules) on the test set and create an output file that would be used by 'tf_normalize_merge.py'. The suffix 'skel' stands for skeleton because it does not run the entire normalization system but only the prediction part where the deep learning model is used. The script './tf_normalize_merge.py' then uses the output of the skeleton script to merge it with the rule based modules. The reason why the two modules (prediciton and rule-based) are run separately is becasue the prediction module takes a long time to load the model and then predict the output. And often, you may want to make a small change in the rule- based module and check the change in the output. Hence, it is convenient to just save the output of the prediction as the skeleton output and then run the rule- based module on top of it. All other files are older versions that were used for either inspiration (shamelessly copying) or running/testing/tuning the model. To run the scripts, the first thing you need to do is to change the directory name in the PATHS dictionary for the key 'root'. It should properly reflect the full name of the root of this library. Following is some information on individual scripts: trans_norm_train.py Input files: ./data/clear_tweets.json [global variable 'train_files'] ./data/train_data.json [global variable 'train_files'] ./data/test_truth.json [global variable 'test_files'] Output files: ./train/checkpoint ./train/transNormModel_mixed_3L_1024U.ckpt-xxx ./train/transNormModel_mixed_3L_1024U.ckpt-xxx.meta This script trains the sequence to sequence model for normalization. The first thing it does is preprocess the data from the train and test files to convert it into a binary format which is easier to feed to TensorFlow. You can change the options to change the way the data is preprocessed. To do that, change the arguments given to the 'DataSource' object during initialization. You can refer to the class description in 'dataUtils.py' for more details on all the options available. Next, the learning model is initialized. You can tune the model parameters by changing the values given to the 'TransNormModel' object. I would advise you to change these values in the VARS dictionary in 'utilities.py'. Depending on whether the commandline option '--test' was supplied, the model is either trained or used for prediction on test set. The test set prediction is on a per word basis including words in the vocabulary. So it should not be a surprise that it is a high value. If you want to check the actual performance of the system for Normalization, you should run the scripts 'tf_normalize_skel' and 'tf_normalize_merge' as explained earlier. The option '--test' is used to indicate that the model should use the provided test files and run prediction on them. So you can replace the files with development set files too. The option '--gpu' can be used to specify which GPU should be used to load the model's graph onto. This can be helpful when you want to train multiple models, each with a different set of parameters. This is because the models tend to take up the entire memory on a single GPU, specially if the model has a large number of parameters. tf_normalize.py Input files: ./data/test_truth.json [commandline argument 'test_file'] Output files: ./data/test_out.json [commandline argument 'out_file'] This script can be used to run the trained model on test set (or development set). This will run the entire pipeline for normalization. This script stores the output of the prediction as a json file in the following location : './data/test_out.json'. Along with the predictions, the script also stores flags indicating whether a token was passed to the model or if one of the rules was applied to the token. The purpose of the flags is to help with debugging. This way, you can parse the json and ask different questions such as : 'what words were ignored/in vocab/predicted by the pipeline' or 'of all the words that were predicted by the model, how many were correctly predicted' etc. In such cases, the flag value can be used to filter the results. The option '--gpu' can be used to specify which GPU should be used to load the model's graph onto. This can be helpful when you want to train model(s) on a GPU(s) and test the performance of another on a different GPU. tf_normalize_skel.py Input files: ./data/test_truth.json [specified in loadJSON call in main()] Output files: ./data/test_out_only_oov.json [specified in saveJSON call in main()] This script, as mentioned earlier, should be used to run the trained model on the test set, or any set for that matter. There is a single filter used in this script which checks if an input token is in th vocabulary or not. If the token is in the voacabulary, then it will leave the token. Otherwise, the token will be passed to the model. This script stores the output of the prediction as a json file. Along with the predictions, the script also stores flags indicating whether a token was passed to the model or not. The list of flags gets update later in the pipeline when running the next script. The purpose of the flags is to help with debugging. This way, you can load the json and ask different questions such as : 'what words were ignored/in vocab/predicted by the pipeline' or 'of all the words that were predicted by the model, how many were correctly predicted' etc. In such cases, the flag value can be used to filter the results. The option '--gpu' can be used to specify which GPU should be used to load the model's graph onto. This can be helpful when you want to train model(s) on a GPU(s) and test the performance of another on a different GPU. tf_normalize_merge.py Input files: ./data/test_truth.json [specified in loadJSON call in main()] ./data/test_out_only_oov.json [specified in loadJSON call for global variable 'pred_cache'. This should be the same file as the ouptut of 'tf_normalize_skel.py'] Output files: ./data/test_out.json [specified in saveJSON call in __main__] This is the final script that should be run after the skeleton script (tf_normalize_skel). This combines the deep learning prediction with rule baed prediction. It utilizes the rules defined in 'dict/sub.dict' and others (please check the script for all the other sources) to decide whether the deep learning prediction should be considered or one of the rules is applied. The flag for that token is appropriately changed. It also does one extra step of diff. This is not required for prediction but can help with error analysis. In this step, the current output of the system is compared with the previous run of this script. So, apart from the performance, this script also outputs the number of changes in the prediciton on the same set. It also creates a file './data/diff_errors.txt' with all the changes so you can go through these. I had used it while tuning (adding/removing) rules with the performance on the development set. Please refer to the function 'perform_diff()' in './diff_errors.py' for more details.
About
A sequence to sequence learning model for text normalization with TensorFlow
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published