Assemble BiDAF model and add training script #15

Ishitori · 2018-08-02T19:48:38Z

Description

The BiDAF model is assembled and training script can be run for multiple epochs. Need to fix everything that is marked with TODO: in code.

[X] Reduce data processing time (or cache it in file system between runs)
[X] Update evaluation metrics to properly use official evaluation script (metric.py)
[ ] Question max length was set to 400 (equal to paragraph max length) to make sure that dot product similarity doesn't fail. Need to return it back to original 30.
[X] Validation dataset is not used for calculating F1 and Exact Match (EM) scores.

Checklist

Essentials

[] Changes are complete (i.e. I finished coding on this PR)
[] All changes have test coverage
[] Code is well-documented

Changes

Assembled BiDAF model
Training script

cgraywang

I mainly comment on the model part this time. And will continue reviewing the script. Thanks Sergey!

cgraywang · 2018-08-31T19:07:25Z

scripts/question_answering/question_answering.py

+                 dropout=0.2, prefix=None, params=None):
+        super(BiDAFModelingLayer, self).__init__(prefix=prefix, params=params)
+
+        self._modeling_layer = LSTM(hidden_size=input_dim, num_layers=nlayers, dropout=dropout,


It will be more feasible and possibly later help to improve the performance by adding dropout layer in the model layer. I suggest we use mxnet.gluon.nn.HybridSequential to stack the LSTM and Dropout layer.

cgraywang · 2018-08-31T19:10:18Z

scripts/question_answering/question_answering.py

+
+        # TODO: Loss function applies softmax by default, so this code is commented here
+        # Will need to reuse it to actually make predictions
+        # start_index_softmax_output = start_index_dense_output.softmax(axis=1)


Actually we need this, we need this in the evaluation to connect with the official eval script. A suggestion there: we can split the computation of the logits here and use the utils.masked_softmax to generate the probability.

cgraywang · 2018-08-31T19:13:26Z

scripts/question_answering/question_answering.py

+                                               options.highway_num_layers,
+                                               options.embedding_size,
+                                               prefix="question_embedding")
+            self._attention_layer = AttentionFlow(DotProductSimilarity())


We can provide similarity_option here, DotProductSimilarity() is defined by default. Later on we can experiment more similarity options to explore in a more flexiable way.

cgraywang · 2018-08-31T19:19:15Z

scripts/question_answering/question_answering.py

+                                                                  q_embedding_states)
+
+        attention_layer_output = self._attention_layer(ctx_embedding_output, q_embedding_output)
+        modeling_layer_output = self._modeling_layer(attention_layer_output)


Update accordingly with the dropout in the model layer?

cgraywang · 2018-08-31T19:20:54Z

scripts/question_answering/question_answering.py

+        return nd.transpose(output, axes=(1, 0, 2))
+
+
+class BiDAFModel(Block):


Please feel free to add some docs to at least explain the input and the output, and the corresponding dimensions.

Ishitori · 2018-08-31T22:52:38Z

Added multi gpu support and official evaluation script support.

It works only with a small change in SQuAD dataset - #16 because I need to provide json to evaluation script.

I haven't applied fixes to your comments yet.

cgraywang · 2018-09-13T18:02:21Z

scripts/question_answering/question_answering.py

        super(BiDAFOutputLayer, self).__init__(prefix=prefix, params=params)

-        units = 10 * span_start_input_dim if units is None else units
+        units = 4 * span_start_input_dim if units is None else units


Why this is four times?

cgraywang · 2018-09-13T18:10:19Z

scripts/question_answering/bidaf.py

@@ -10,7 +10,7 @@
 #
 #   http://www.apache.org/licenses/LICENSE-2.0
 #
-# Unless required by applicable law or agreed to in writing,
+# Unless required by applicable law or agreed to in writinConvolutionalEncoderg,


typo: writinConvolutionalEncoderg

cgraywang

I think the code for now in general is in a good shape. We might want to improve the code quality further before moving to dmlc. For the now it needs around ~1650 seconds per epoch. We might also improve the performance as well. @Ishitori Thanks for the good work!

cgraywang · 2018-11-02T20:38:50Z

gluonnlp/data/question_answering.py

@@ -80,7 +80,7 @@ def __init__(self, segment='train', root=os.path.join('~', '.mxnet', 'datasets',
        self._segment = segment
        self._get_data()

-        super(SQuAD, self).__init__(self._read_data())
+        super(SQuAD, self).__init__(SQuAD._get_records(self._read_data()))

    def _get_data(self):


Should data always come from the SQuAD._get_records?

Yep.

The thing is that the official evaluation scripts need original JSON content, and instead of trying to figure out the path in FS and loading it manually, it is easier to just have a separate method on reading JSON from disk and parsing it. Then I can just send the JSON as is to official evaluation script without a need to know where MXNet stores the file itself.

I think in MXnet in general, we will normally download the data into mxnet root dir, and then we will identify whether the data is there and load from there if it exists, otherwise we will download the data from s3.

cgraywang · 2018-11-02T20:39:34Z

scripts/question_answering/bidaf.py

@@ -10,7 +10,7 @@
 #
 #   http://www.apache.org/licenses/LICENSE-2.0
 #
-# Unless required by applicable law or agreed to in writing,
+# Unless required by applicable law or agreed to in writinConvolutionalEncoderg,


typo: writinConvolutionalEncoderg

cgraywang · 2018-11-02T21:00:49Z

scripts/question_answering/attention_flow.py

+
+    def hybrid_forward(self, F, matrix_1, matrix_2):
+        # pylint: disable=arguments-differ
+        tiled_matrix_1 = matrix_1.expand_dims(2).broadcast_to(shape=(self._batch_size,


To make this hyridizable, we can use
tiled_matrix_1 = F.expand_dims(matrix_1, 2).broadcast_to(matrix_1, shape=(self._batch_size, self._passage_length, self._question_length, self._embedding_size))

It is hybridizable even if we use expand directly on matrix_1.

Sounds this is a bug in MXnet based on offline discussion, we should have the automatic registration of the operators in mxnet.sym and mxnet.sym.Symbol.

cgraywang · 2018-11-02T21:01:16Z

scripts/question_answering/attention_flow.py

+                                                                     self._passage_length,
+                                                                     self._question_length,
+                                                                     self._embedding_size))
+        tiled_matrix_2 = matrix_2.expand_dims(1).broadcast_to(shape=(self._batch_size,


Same as above.

This block is hybridizable even when calling matrix_2.expand_dims. I would leave it like that, because it is more readable.

cgraywang · 2018-11-02T21:05:05Z

scripts/question_answering/bidaf.py

+        self._passage_length = passage_length
+        self._question_length = question_length
+        self._encoding_dim = encoding_dim
+        self._precision = precision


Since we haven't verified the fp16 setting, I suggest we probably can remove this option for now. Once later we verified the option, probably we can add this back.

cgraywang · 2018-11-02T23:04:56Z

scripts/question_answering/utils.py

+    return _last_dimension_applicator(F, masked_log_softmax, tensor, mask, tensor_shape, mask_shape)
+
+
+def weighted_sum(F, matrix, attention, matrix_shape, attention_shape):


F seems not needed.

It is needed, because of batch_dot.

cgraywang · 2018-11-02T23:05:02Z

scripts/question_answering/utils.py

+    return intermediate.sum(axis=-2)
+
+
+def replace_masked_values(F, tensor, mask, replace_with):


F seems not needed.

It is needed, because of broadcast_add and broadcast_mul.

cgraywang · 2018-11-02T23:09:19Z

scripts/question_answering/similarity_function.py

+        self._scale_output = scale_output
+
+    def hybrid_forward(self, F, array_1, array_2):
+        result = (array_1 * array_2).sum(axis=-1)


To make use of hybridization, we can rewrite to result = F.sum(F.dot(array_1, array_2), axis=-1)

It is not needed, it can be hybridized even like that.

cgraywang · 2018-11-02T23:09:53Z

scripts/question_answering/similarity_function.py

+    def hybrid_forward(self, F, array_1, array_2):
+        normalized_array_1 = F.broadcast_div(array_1, F.norm(array_1, axis=-1, keepdims=True))
+        normalized_array_2 = F.broadcast_div(array_2, F.norm(array_2, axis=-1, keepdims=True))
+        return (normalized_array_1 * normalized_array_2).sum(axis=-1)


Similar to the above comment, we could use F based operators.

Yes we could, and it seems to hybridize just fine even without it.

cgraywang · 2018-11-02T23:12:43Z

scripts/question_answering/train_question_answering.py

+        avg_loss_scalar = avg_loss.asscalar()
+        epoch_time = time() - e_start
+
+        print("\tEPOCH {:2}: train loss {:6.4f} | batch {:4} | lr {:5.3f} | "


We could also print out the throughtput (# of samples/sec), it would help us do better speed and profile analysis later on.

cgraywang

The code in general is good shape, and needs a few improvements. We can decide what to be done in this iteration, what to be done as the next steps. I think you might want to try to improve some of the them before PR to the dmlc feature branch. I have created the feature branch in dmlc: https://github.com/dmlc/gluon-nlp/tree/bidaf

cgraywang · 2018-11-08T23:31:09Z

scripts/question_answering/performance_evaluator.py


            record_index = gluon.utils.split_and_load(record_index, ctx, even_split=False)
            q_words = gluon.utils.split_and_load(q_words, ctx, even_split=False)
            ctx_words = gluon.utils.split_and_load(ctx_words, ctx, even_split=False)
            q_chars = gluon.utils.split_and_load(q_chars, ctx, even_split=False)
            ctx_chars = gluon.utils.split_and_load(ctx_chars, ctx, even_split=False)

-            ctx_embedding_begin_state_list = net.ctx_embedding.begin_state(ctx)


We still need to initialize the begin_state, am I right?

They will be initialized by default. I had to use this code before only to support float16 precision mode - I even have created an issue apache/mxnet#12650 regarding to that, but no progress on that so far.

cgraywang · 2018-11-08T23:31:14Z

scripts/question_answering/bidaf.py

+        self._encoding_dim = encoding_dim
+        self._precision = precision
+
+    def _get_big_negative_value(self):


We can keep this as-is. Another way is that we can use python best_val = float('Inf')

cgraywang · 2018-11-08T23:33:54Z

scripts/question_answering/question_answering.py

-                                                                              batch_sizes)]
-        return state_list
-
-    def hybrid_forward(self, F, x, state, *args):


Why state is ignored here?

Refactored and removed state from the code. The usage of the default state is enough as we don't pass state between batches.

cgraywang · 2018-11-08T23:38:54Z

scripts/question_answering/bidaf.py

+# under the License.
+
+from mxnet import gluon
+import numpy as np


Please find the comment in the _get_big_negative_value

cgraywang · 2018-11-14T19:01:24Z

gluonnlp/data/question_answering.py

@@ -80,7 +80,7 @@ def __init__(self, segment='train', root=os.path.join('~', '.mxnet', 'datasets',
        self._segment = segment
        self._get_data()

-        super(SQuAD, self).__init__(self._read_data())
+        super(SQuAD, self).__init__(SQuAD._get_records(self._read_data()))

    def _get_data(self):


I think in MXnet in general, we will normally download the data into mxnet root dir, and then we will identify whether the data is there and load from there if it exists, otherwise we will download the data from s3.

cgraywang · 2018-11-14T23:08:08Z

scripts/question_answering/similarity_function.py

@@ -0,0 +1,195 @@
+# coding: utf-8


We can plan next steps to integrate the function into gluonnlp attention_cell

Next step for us

cgraywang · 2018-11-14T23:13:16Z

scripts/question_answering/train_question_answering.py

+    print("Training time {:6.2f} seconds".format(time() - train_start))
+
+
+def get_learning_rate_per_iteration(iteration, options):


rename this to warm_up_steps for others to easily understand what is this.

cgraywang · 2018-11-14T23:18:26Z

scripts/question_answering/train_question_answering.py

+    return True if "predefined_embedding_layer" in name else False
+
+
+def save_model_parameters(net, epoch, options):


Can we merge the save_model_parameters and save_ema_parameters ? since whether use ema the user's setting, we should only expose one inference or one copy of the model parameters.

cgraywang · 2018-11-14T23:20:37Z

scripts/question_answering/train_question_answering.py

+              .format(e, avg_loss_scalar, options.batch_size, trainer.learning_rate,
+                      records_per_epoch_count / epoch_time, epoch_time))
+
+        save_model_parameters(net, e, options)


We might want to store the parameters for the last several epochs as checkpoints according to our discussion, since we eventually will need to ensemble the results.

Parameters of all epochs are stored in the separate files, so if we want to ensemble them together, we could just reference them even now.

cgraywang · 2018-11-14T23:39:58Z

scripts/question_answering/train_question_answering.py

+    net.save_parameters(save_path)
+
+
+def save_ema_parameters(ema, epoch, options):


We don't need to save the parameters of the ema since they are only used in the evaluation.

Yes, will refactor to store either EMA or raw parameters depending on the flag

Sergey Sokolov added 3 commits August 2, 2018 11:27

Model is assembled, but TODOs need to be addressed

a751587

Commenting useless code

32787aa

Fix epoch time display

23f1e9e

cgraywang reviewed Aug 31, 2018

View reviewed changes

Multigpu support + official evalualtion

9c6c98e

Sergey Sokolov added 7 commits August 31, 2018 17:14

Add save params + support for uneven data splits

00b051d

Showcase last batch fails to be processed

29d8b1e

Use correct attention layer

51f7d2a

Multiple bug fixes in Attention flow layer

7cbd563

kvstore set to local to prevent malloc exception

ca9c948

Fix to get 1 epoch on 4 gpu ~1.8 hours

1f39fb1

Merge branch 'master' into bidaf_assembled

fa24368

cgraywang reviewed Sep 13, 2018

View reviewed changes

Sergey Sokolov added 15 commits September 18, 2018 15:12

Make evaluation faster

8b14acd

Update hyperparameters

f8368f6

Float16 works, hybridization not

152bb4c

Float16 + hybridize works. TODO:replace hard codes

5e4975b

Hard code removed

69ea71c

Some useless code removed

74707fc

Merge https://github.com/dmlc/gluon-nlp into bidaf_assembled

20ffcd9

Bug fix in data preprocessing

d35c802

EMA is added to code + loss function change

19c37d2

EMA can be used for prediction

60a4374

Caching of vocabs is added

ddaac06

Making utils function support FP16

8379a2c

Dev set also present in vocab

e163925

GlobalGradClip seems to work on 4 gpu, 15 items

e72eb64

Bug fixes. EM=39.8, F1=51.965 after 23 epochs

5742bea

Sergey Sokolov added 12 commits October 16, 2018 13:53

Evaluation changed and resume training added

bc6b8a7

Training resuming added

bf70d66

Clean up code

dde6539

Parameters parsing fixes

a1fcdcc

Early stop is added

b3d95e6

Can log results without early stopping

8998836

NLTK tokenizer is used to fix [citation needed]

2528b8e

Bidaf similarity is used

2a28253

Add comments to code

f62e1c5

Return static_alloc

634b069

Multigpu and arbitrary batch support for evals

c4f4a3c

Add terminate training if need to reach F1 only

6ee7732

cgraywang reviewed Nov 2, 2018

View reviewed changes

Sergey Sokolov added 2 commits November 7, 2018 13:31

Code review comments addressed

11e66c8

FP16 removed and tests are fixed

5bddd48

cgraywang reviewed Nov 14, 2018

View reviewed changes

Ishitori and others added 6 commits November 15, 2018 14:37

Merge branch 'bidaf' into bidaf_assembled

b0dbe34

Code review changes

1c15cca

Optimize imports

7303444

Only EMA or original params get saved

1571548

Get hybridization back

cbbfcb3

Make pylint happy

355116c

Ishitori mentioned this pull request Nov 16, 2018

Add BiDAF question answering model dmlc/gluon-nlp#412

Merged

4 tasks

		return nd.transpose(output, axes=(1, 0, 2))


		class BiDAFModel(Block):

		return _last_dimension_applicator(F, masked_log_softmax, tensor, mask, tensor_shape, mask_shape)


		def weighted_sum(F, matrix, attention, matrix_shape, attention_shape):

		return intermediate.sum(axis=-2)


		def replace_masked_values(F, tensor, mask, replace_with):

		print("Training time {:6.2f} seconds".format(time() - train_start))


		def get_learning_rate_per_iteration(iteration, options):

		return True if "predefined_embedding_layer" in name else False


		def save_model_parameters(net, epoch, options):

		net.save_parameters(save_path)


		def save_ema_parameters(ema, epoch, options):

Assemble BiDAF model and add training script #15

Are you sure you want to change the base?

Assemble BiDAF model and add training script #15

Conversation

Ishitori commented Aug 2, 2018 • edited Loading

Description

Checklist

Essentials

Changes

cgraywang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Ishitori commented Aug 31, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cgraywang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cgraywang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Ishitori commented Aug 2, 2018 •

edited

Loading