Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assemble BiDAF model and add training script #15

Open
wants to merge 46 commits into
base: bidaf
Choose a base branch
from

Conversation

Ishitori
Copy link

@Ishitori Ishitori commented Aug 2, 2018

Description

The BiDAF model is assembled and training script can be run for multiple epochs. Need to fix everything that is marked with TODO: in code.

[X] Reduce data processing time (or cache it in file system between runs)
[X] Update evaluation metrics to properly use official evaluation script (metric.py)
[ ] Question max length was set to 400 (equal to paragraph max length) to make sure that dot product similarity doesn't fail. Need to return it back to original 30.
[X] Validation dataset is not used for calculating F1 and Exact Match (EM) scores.

Checklist

Essentials

  • [] Changes are complete (i.e. I finished coding on this PR)
  • [] All changes have test coverage
  • [] Code is well-documented

Changes

  • Assembled BiDAF model
  • Training script

Copy link
Owner

@cgraywang cgraywang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mainly comment on the model part this time. And will continue reviewing the script. Thanks Sergey!

dropout=0.2, prefix=None, params=None):
super(BiDAFModelingLayer, self).__init__(prefix=prefix, params=params)

self._modeling_layer = LSTM(hidden_size=input_dim, num_layers=nlayers, dropout=dropout,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be more feasible and possibly later help to improve the performance by adding dropout layer in the model layer. I suggest we use mxnet.gluon.nn.HybridSequential to stack the LSTM and Dropout layer.


# TODO: Loss function applies softmax by default, so this code is commented here
# Will need to reuse it to actually make predictions
# start_index_softmax_output = start_index_dense_output.softmax(axis=1)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually we need this, we need this in the evaluation to connect with the official eval script. A suggestion there: we can split the computation of the logits here and use the utils.masked_softmax to generate the probability.

options.highway_num_layers,
options.embedding_size,
prefix="question_embedding")
self._attention_layer = AttentionFlow(DotProductSimilarity())
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can provide similarity_option here, DotProductSimilarity() is defined by default. Later on we can experiment more similarity options to explore in a more flexiable way.

q_embedding_states)

attention_layer_output = self._attention_layer(ctx_embedding_output, q_embedding_output)
modeling_layer_output = self._modeling_layer(attention_layer_output)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update accordingly with the dropout in the model layer?

return nd.transpose(output, axes=(1, 0, 2))


class BiDAFModel(Block):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please feel free to add some docs to at least explain the input and the output, and the corresponding dimensions.

@Ishitori
Copy link
Author

Added multi gpu support and official evaluation script support.

It works only with a small change in SQuAD dataset - #16 because I need to provide json to evaluation script.

I haven't applied fixes to your comments yet.

super(BiDAFOutputLayer, self).__init__(prefix=prefix, params=params)

units = 10 * span_start_input_dim if units is None else units
units = 4 * span_start_input_dim if units is None else units
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this is four times?

@@ -10,7 +10,7 @@
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# Unless required by applicable law or agreed to in writinConvolutionalEncoderg,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mistake?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: writinConvolutionalEncoderg

Copy link
Owner

@cgraywang cgraywang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the code for now in general is in a good shape. We might want to improve the code quality further before moving to dmlc. For the now it needs around ~1650 seconds per epoch. We might also improve the performance as well. @Ishitori Thanks for the good work!

@@ -80,7 +80,7 @@ def __init__(self, segment='train', root=os.path.join('~', '.mxnet', 'datasets',
self._segment = segment
self._get_data()

super(SQuAD, self).__init__(self._read_data())
super(SQuAD, self).__init__(SQuAD._get_records(self._read_data()))

def _get_data(self):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should data always come from the SQuAD._get_records?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep.

The thing is that the official evaluation scripts need original JSON content, and instead of trying to figure out the path in FS and loading it manually, it is easier to just have a separate method on reading JSON from disk and parsing it. Then I can just send the JSON as is to official evaluation script without a need to know where MXNet stores the file itself.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in MXnet in general, we will normally download the data into mxnet root dir, and then we will identify whether the data is there and load from there if it exists, otherwise we will download the data from s3.

@@ -10,7 +10,7 @@
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# Unless required by applicable law or agreed to in writinConvolutionalEncoderg,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: writinConvolutionalEncoderg


def hybrid_forward(self, F, matrix_1, matrix_2):
# pylint: disable=arguments-differ
tiled_matrix_1 = matrix_1.expand_dims(2).broadcast_to(shape=(self._batch_size,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make this hyridizable, we can use
tiled_matrix_1 = F.expand_dims(matrix_1, 2).broadcast_to(matrix_1, shape=(self._batch_size, self._passage_length, self._question_length, self._embedding_size))

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is hybridizable even if we use expand directly on matrix_1.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds this is a bug in MXnet based on offline discussion, we should have the automatic registration of the operators in mxnet.sym and mxnet.sym.Symbol.

self._passage_length,
self._question_length,
self._embedding_size))
tiled_matrix_2 = matrix_2.expand_dims(1).broadcast_to(shape=(self._batch_size,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block is hybridizable even when calling matrix_2.expand_dims. I would leave it like that, because it is more readable.

self._passage_length = passage_length
self._question_length = question_length
self._encoding_dim = encoding_dim
self._precision = precision
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we haven't verified the fp16 setting, I suggest we probably can remove this option for now. Once later we verified the option, probably we can add this back.

return _last_dimension_applicator(F, masked_log_softmax, tensor, mask, tensor_shape, mask_shape)


def weighted_sum(F, matrix, attention, matrix_shape, attention_shape):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

F seems not needed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is needed, because of batch_dot.

return intermediate.sum(axis=-2)


def replace_masked_values(F, tensor, mask, replace_with):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

F seems not needed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is needed, because of broadcast_add and broadcast_mul.

self._scale_output = scale_output

def hybrid_forward(self, F, array_1, array_2):
result = (array_1 * array_2).sum(axis=-1)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make use of hybridization, we can rewrite to result = F.sum(F.dot(array_1, array_2), axis=-1)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not needed, it can be hybridized even like that.

def hybrid_forward(self, F, array_1, array_2):
normalized_array_1 = F.broadcast_div(array_1, F.norm(array_1, axis=-1, keepdims=True))
normalized_array_2 = F.broadcast_div(array_2, F.norm(array_2, axis=-1, keepdims=True))
return (normalized_array_1 * normalized_array_2).sum(axis=-1)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the above comment, we could use F based operators.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we could, and it seems to hybridize just fine even without it.

avg_loss_scalar = avg_loss.asscalar()
epoch_time = time() - e_start

print("\tEPOCH {:2}: train loss {:6.4f} | batch {:4} | lr {:5.3f} | "
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also print out the throughtput (# of samples/sec), it would help us do better speed and profile analysis later on.

Copy link
Owner

@cgraywang cgraywang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code in general is good shape, and needs a few improvements. We can decide what to be done in this iteration, what to be done as the next steps. I think you might want to try to improve some of the them before PR to the dmlc feature branch. I have created the feature branch in dmlc: https://github.com/dmlc/gluon-nlp/tree/bidaf


record_index = gluon.utils.split_and_load(record_index, ctx, even_split=False)
q_words = gluon.utils.split_and_load(q_words, ctx, even_split=False)
ctx_words = gluon.utils.split_and_load(ctx_words, ctx, even_split=False)
q_chars = gluon.utils.split_and_load(q_chars, ctx, even_split=False)
ctx_chars = gluon.utils.split_and_load(ctx_chars, ctx, even_split=False)

ctx_embedding_begin_state_list = net.ctx_embedding.begin_state(ctx)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still need to initialize the begin_state, am I right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They will be initialized by default. I had to use this code before only to support float16 precision mode - I even have created an issue apache/mxnet#12650 regarding to that, but no progress on that so far.

self._encoding_dim = encoding_dim
self._precision = precision

def _get_big_negative_value(self):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can keep this as-is. Another way is that we can use python best_val = float('Inf')

batch_sizes)]
return state_list

def hybrid_forward(self, F, x, state, *args):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why state is ignored here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored and removed state from the code. The usage of the default state is enough as we don't pass state between batches.

# under the License.

from mxnet import gluon
import numpy as np
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please find the comment in the _get_big_negative_value

@@ -80,7 +80,7 @@ def __init__(self, segment='train', root=os.path.join('~', '.mxnet', 'datasets',
self._segment = segment
self._get_data()

super(SQuAD, self).__init__(self._read_data())
super(SQuAD, self).__init__(SQuAD._get_records(self._read_data()))

def _get_data(self):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in MXnet in general, we will normally download the data into mxnet root dir, and then we will identify whether the data is there and load from there if it exists, otherwise we will download the data from s3.

@@ -0,0 +1,195 @@
# coding: utf-8
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can plan next steps to integrate the function into gluonnlp attention_cell

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Next step for us

print("Training time {:6.2f} seconds".format(time() - train_start))


def get_learning_rate_per_iteration(iteration, options):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename this to warm_up_steps for others to easily understand what is this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

return True if "predefined_embedding_layer" in name else False


def save_model_parameters(net, epoch, options):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we merge the save_model_parameters and save_ema_parameters ? since whether use ema the user's setting, we should only expose one inference or one copy of the model parameters.

.format(e, avg_loss_scalar, options.batch_size, trainer.learning_rate,
records_per_epoch_count / epoch_time, epoch_time))

save_model_parameters(net, e, options)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to store the parameters for the last several epochs as checkpoints according to our discussion, since we eventually will need to ensemble the results.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parameters of all epochs are stored in the separate files, so if we want to ensemble them together, we could just reference them even now.

net.save_parameters(save_path)


def save_ema_parameters(ema, epoch, options):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to save the parameters of the ema since they are only used in the evaluation.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will refactor to store either EMA or raw parameters depending on the flag

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants