Skip to content

A sequence to sequence learning model for text normalization with TensorFlow

Notifications You must be signed in to change notification settings

junkmechanic/text-norm-seq2seq

Repository files navigation

README for Normalization System using Sequence to Sequence Learning

By Ankur Khanna (ankur)

Following is the description of the system build by me for Normalization of
Noisy User-generated text (like tweets, youtube comments etc).

All the source code files are version controlled using git. First, let's look at
some of the important files:

data/
    train_data.json                     -- benchmark training data
    test_truth.json                     -- benchmark test data
    clear_tweets.json                   -- tweets with no OOVs
trans_norm_train.py
tf_normalize.py
tf_normalize_skel.py
tf_normalize_merge.py

The main script to train the model is './trans_norm_model.py'. If you supply it
with the command line flag '--test' then it will run the model on the test set.

The other important script is './tf_normalize.py' which uses the trained model
as well as all the rules to run the entire normalization pipeline.

If you want to analyze the system and particularly see how the deep learning
model performs by itself, then you should run './tf_normalize_skel.py' which
does not use the rules.

The script './tf_normalize_skel.py' is used to run only the model (no rules) on
the test set and create an output file that would be used by
'tf_normalize_merge.py'. The suffix 'skel' stands for skeleton because it does
not run the entire normalization system but only the prediction part where the
deep learning model is used.

The script './tf_normalize_merge.py' then uses the output of the skeleton script
to merge it with the rule based modules.

The reason why the two modules (prediciton and rule-based) are run separately is
becasue the prediction module takes a long time to load the model and then
predict the output. And often, you may want to make a small change in the rule-
based module and check the change in the output. Hence, it is convenient to just
save the output of the prediction as the skeleton output and then run the rule-
based module on top of it.

All other files are older versions that were used for either inspiration
(shamelessly copying) or running/testing/tuning the model.

To run the scripts, the first thing you need to do is to change the directory
name in the PATHS dictionary for the key 'root'. It should properly reflect the
full name of the root of this library.

Following is some information on individual scripts:

trans_norm_train.py
    Input files:
        ./data/clear_tweets.json   [global variable 'train_files']
        ./data/train_data.json     [global variable 'train_files']
        ./data/test_truth.json     [global variable 'test_files']
    Output files:
        ./train/checkpoint
        ./train/transNormModel_mixed_3L_1024U.ckpt-xxx
        ./train/transNormModel_mixed_3L_1024U.ckpt-xxx.meta

    This script trains the sequence to sequence model for normalization. The
    first thing it does is preprocess the data from the train and test files to
    convert it into a binary format which is easier to feed to TensorFlow. You
    can change the options to change the way the data is preprocessed. To do
    that, change the arguments given to the 'DataSource' object during
    initialization. You can refer to the class description in 'dataUtils.py'
    for more details on all the options available.

    Next, the learning model is initialized. You can tune the model parameters
    by changing the values given to the 'TransNormModel' object. I would advise
    you to change these values in the VARS dictionary in 'utilities.py'.

    Depending on whether the commandline option '--test' was supplied, the model
    is either trained or used for prediction on test set. The test set
    prediction is on a per word basis including words in the vocabulary. So it
    should not be a surprise that it is a high value. If you want to check the
    actual performance of the system for Normalization, you should run the
    scripts 'tf_normalize_skel' and 'tf_normalize_merge' as explained earlier.

    The option '--test' is used to indicate that the model should use the
    provided test files and run prediction on them. So you can replace the files
    with development set files too.

    The option '--gpu' can be used to specify which GPU should be used to load
    the model's graph onto. This can be helpful when you want to train multiple
    models, each with a different set of parameters. This is because the models
    tend to take up the entire memory on a single GPU, specially if the model
    has a large number of parameters.

tf_normalize.py
    Input files:
        ./data/test_truth.json     [commandline argument 'test_file']
    Output files:
        ./data/test_out.json       [commandline argument 'out_file']

    This script can be used to run the trained model on test set (or development
    set). This will run the entire pipeline for normalization.

    This script stores the output of the prediction as a json file in the
    following location : './data/test_out.json'. Along with the predictions, the
    script also stores flags indicating whether a token was passed to the model
    or if one of the rules was applied to the token.

    The purpose of the flags is to help with debugging. This way, you can parse
    the json and ask different questions such as : 'what words were ignored/in
    vocab/predicted by the pipeline' or 'of all the words that were predicted by
    the model, how many were correctly predicted' etc. In such cases, the flag
    value can be used to filter the results.

    The option '--gpu' can be used to specify which GPU should be used to load
    the model's graph onto. This can be helpful when you want to train model(s)
    on a GPU(s) and test the performance of another on a different GPU.

tf_normalize_skel.py
    Input files:
        ./data/test_truth.json         [specified in loadJSON call in main()]
    Output files:
        ./data/test_out_only_oov.json  [specified in saveJSON call in main()]

    This script, as mentioned earlier, should be used to run the trained model
    on the test set, or any set for that matter. There is a single filter used
    in this script which checks if an input token is in th vocabulary or not.
    If the token is in the voacabulary, then it will leave the token. Otherwise,
    the token will be passed to the model.

    This script stores the output of the prediction as a json file. Along with
    the predictions, the script also stores flags indicating whether a token was
    passed to the model or not. The list of flags gets update later in the
    pipeline when running the next script.

    The purpose of the flags is to help with debugging. This way, you can load
    the json and ask different questions such as : 'what words were ignored/in
    vocab/predicted by the pipeline' or 'of all the words that were predicted by
    the model, how many were correctly predicted' etc. In such cases, the flag
    value can be used to filter the results.

    The option '--gpu' can be used to specify which GPU should be used to load
    the model's graph onto. This can be helpful when you want to train model(s)
    on a GPU(s) and test the performance of another on a different GPU.

tf_normalize_merge.py
    Input files:
        ./data/test_truth.json         [specified in loadJSON call in main()]
        ./data/test_out_only_oov.json  [specified in loadJSON call for global
                                        variable 'pred_cache'. This should be
                                        the same file as the ouptut of
                                        'tf_normalize_skel.py']
    Output files:
        ./data/test_out.json           [specified in saveJSON call in __main__]

    This is the final script that should be run after the skeleton script
    (tf_normalize_skel).  This combines the deep learning prediction with rule
    baed prediction. It utilizes the rules defined in 'dict/sub.dict' and others
    (please check the script for all the other sources) to decide whether the
    deep learning prediction should be considered or one of the rules is
    applied.  The flag for that token is appropriately changed.

    It also does one extra step of diff. This is not required for prediction but
    can help with error analysis. In this step, the current output of the system
    is compared with the previous run of this script. So, apart from the
    performance, this script also outputs the number of changes in the
    prediciton on the same set. It also creates a file './data/diff_errors.txt'
    with all the changes so you can go through these. I had used it while tuning
    (adding/removing) rules with the performance on the development set. Please
    refer to the function 'perform_diff()' in './diff_errors.py' for more
    details.

About

A sequence to sequence learning model for text normalization with TensorFlow

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages