-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Assemble BiDAF model and add training script #15
base: bidaf
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mainly comment on the model part this time. And will continue reviewing the script. Thanks Sergey!
dropout=0.2, prefix=None, params=None): | ||
super(BiDAFModelingLayer, self).__init__(prefix=prefix, params=params) | ||
|
||
self._modeling_layer = LSTM(hidden_size=input_dim, num_layers=nlayers, dropout=dropout, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be more feasible and possibly later help to improve the performance by adding dropout layer in the model layer. I suggest we use mxnet.gluon.nn.HybridSequential
to stack the LSTM and Dropout layer.
|
||
# TODO: Loss function applies softmax by default, so this code is commented here | ||
# Will need to reuse it to actually make predictions | ||
# start_index_softmax_output = start_index_dense_output.softmax(axis=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually we need this, we need this in the evaluation to connect with the official eval script. A suggestion there: we can split the computation of the logits here and use the utils.masked_softmax
to generate the probability.
options.highway_num_layers, | ||
options.embedding_size, | ||
prefix="question_embedding") | ||
self._attention_layer = AttentionFlow(DotProductSimilarity()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can provide similarity_option
here, DotProductSimilarity()
is defined by default. Later on we can experiment more similarity options to explore in a more flexiable way.
q_embedding_states) | ||
|
||
attention_layer_output = self._attention_layer(ctx_embedding_output, q_embedding_output) | ||
modeling_layer_output = self._modeling_layer(attention_layer_output) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update accordingly with the dropout in the model layer?
return nd.transpose(output, axes=(1, 0, 2)) | ||
|
||
|
||
class BiDAFModel(Block): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please feel free to add some docs to at least explain the input and the output, and the corresponding dimensions.
Added multi gpu support and official evaluation script support. It works only with a small change in SQuAD dataset - #16 because I need to provide json to evaluation script. I haven't applied fixes to your comments yet. |
super(BiDAFOutputLayer, self).__init__(prefix=prefix, params=params) | ||
|
||
units = 10 * span_start_input_dim if units is None else units | ||
units = 4 * span_start_input_dim if units is None else units |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this is four times?
scripts/question_answering/bidaf.py
Outdated
@@ -10,7 +10,7 @@ | |||
# | |||
# http://www.apache.org/licenses/LICENSE-2.0 | |||
# | |||
# Unless required by applicable law or agreed to in writing, | |||
# Unless required by applicable law or agreed to in writinConvolutionalEncoderg, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mistake?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: writinConvolutionalEncoderg
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the code for now in general is in a good shape. We might want to improve the code quality further before moving to dmlc. For the now it needs around ~1650 seconds per epoch. We might also improve the performance as well. @Ishitori Thanks for the good work!
@@ -80,7 +80,7 @@ def __init__(self, segment='train', root=os.path.join('~', '.mxnet', 'datasets', | |||
self._segment = segment | |||
self._get_data() | |||
|
|||
super(SQuAD, self).__init__(self._read_data()) | |||
super(SQuAD, self).__init__(SQuAD._get_records(self._read_data())) | |||
|
|||
def _get_data(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should data always come from the SQuAD._get_records
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep.
The thing is that the official evaluation scripts need original JSON content, and instead of trying to figure out the path in FS and loading it manually, it is easier to just have a separate method on reading JSON from disk and parsing it. Then I can just send the JSON as is to official evaluation script without a need to know where MXNet stores the file itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in MXnet in general, we will normally download the data into mxnet root dir, and then we will identify whether the data is there and load from there if it exists, otherwise we will download the data from s3.
scripts/question_answering/bidaf.py
Outdated
@@ -10,7 +10,7 @@ | |||
# | |||
# http://www.apache.org/licenses/LICENSE-2.0 | |||
# | |||
# Unless required by applicable law or agreed to in writing, | |||
# Unless required by applicable law or agreed to in writinConvolutionalEncoderg, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: writinConvolutionalEncoderg
|
||
def hybrid_forward(self, F, matrix_1, matrix_2): | ||
# pylint: disable=arguments-differ | ||
tiled_matrix_1 = matrix_1.expand_dims(2).broadcast_to(shape=(self._batch_size, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make this hyridizable, we can use
tiled_matrix_1 = F.expand_dims(matrix_1, 2).broadcast_to(matrix_1, shape=(self._batch_size, self._passage_length, self._question_length, self._embedding_size))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is hybridizable even if we use expand directly on matrix_1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds this is a bug in MXnet based on offline discussion, we should have the automatic registration of the operators in mxnet.sym
and mxnet.sym.Symbol
.
self._passage_length, | ||
self._question_length, | ||
self._embedding_size)) | ||
tiled_matrix_2 = matrix_2.expand_dims(1).broadcast_to(shape=(self._batch_size, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block is hybridizable even when calling matrix_2.expand_dims
. I would leave it like that, because it is more readable.
scripts/question_answering/bidaf.py
Outdated
self._passage_length = passage_length | ||
self._question_length = question_length | ||
self._encoding_dim = encoding_dim | ||
self._precision = precision |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we haven't verified the fp16 setting, I suggest we probably can remove this option for now. Once later we verified the option, probably we can add this back.
return _last_dimension_applicator(F, masked_log_softmax, tensor, mask, tensor_shape, mask_shape) | ||
|
||
|
||
def weighted_sum(F, matrix, attention, matrix_shape, attention_shape): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
F
seems not needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is needed, because of batch_dot
.
return intermediate.sum(axis=-2) | ||
|
||
|
||
def replace_masked_values(F, tensor, mask, replace_with): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
F
seems not needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is needed, because of broadcast_add
and broadcast_mul
.
self._scale_output = scale_output | ||
|
||
def hybrid_forward(self, F, array_1, array_2): | ||
result = (array_1 * array_2).sum(axis=-1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make use of hybridization, we can rewrite to result = F.sum(F.dot(array_1, array_2), axis=-1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not needed, it can be hybridized even like that.
def hybrid_forward(self, F, array_1, array_2): | ||
normalized_array_1 = F.broadcast_div(array_1, F.norm(array_1, axis=-1, keepdims=True)) | ||
normalized_array_2 = F.broadcast_div(array_2, F.norm(array_2, axis=-1, keepdims=True)) | ||
return (normalized_array_1 * normalized_array_2).sum(axis=-1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the above comment, we could use F
based operators.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we could, and it seems to hybridize just fine even without it.
avg_loss_scalar = avg_loss.asscalar() | ||
epoch_time = time() - e_start | ||
|
||
print("\tEPOCH {:2}: train loss {:6.4f} | batch {:4} | lr {:5.3f} | " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also print out the throughtput (# of samples/sec), it would help us do better speed and profile analysis later on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code in general is good shape, and needs a few improvements. We can decide what to be done in this iteration, what to be done as the next steps. I think you might want to try to improve some of the them before PR to the dmlc feature branch. I have created the feature branch in dmlc: https://github.com/dmlc/gluon-nlp/tree/bidaf
|
||
record_index = gluon.utils.split_and_load(record_index, ctx, even_split=False) | ||
q_words = gluon.utils.split_and_load(q_words, ctx, even_split=False) | ||
ctx_words = gluon.utils.split_and_load(ctx_words, ctx, even_split=False) | ||
q_chars = gluon.utils.split_and_load(q_chars, ctx, even_split=False) | ||
ctx_chars = gluon.utils.split_and_load(ctx_chars, ctx, even_split=False) | ||
|
||
ctx_embedding_begin_state_list = net.ctx_embedding.begin_state(ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still need to initialize the begin_state
, am I right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They will be initialized by default. I had to use this code before only to support float16 precision mode - I even have created an issue apache/mxnet#12650 regarding to that, but no progress on that so far.
self._encoding_dim = encoding_dim | ||
self._precision = precision | ||
|
||
def _get_big_negative_value(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can keep this as-is. Another way is that we can use python best_val = float('Inf')
batch_sizes)] | ||
return state_list | ||
|
||
def hybrid_forward(self, F, x, state, *args): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why state is ignored here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactored and removed state from the code. The usage of the default state is enough as we don't pass state between batches.
# under the License. | ||
|
||
from mxnet import gluon | ||
import numpy as np |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please find the comment in the _get_big_negative_value
@@ -80,7 +80,7 @@ def __init__(self, segment='train', root=os.path.join('~', '.mxnet', 'datasets', | |||
self._segment = segment | |||
self._get_data() | |||
|
|||
super(SQuAD, self).__init__(self._read_data()) | |||
super(SQuAD, self).__init__(SQuAD._get_records(self._read_data())) | |||
|
|||
def _get_data(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in MXnet in general, we will normally download the data into mxnet root dir, and then we will identify whether the data is there and load from there if it exists, otherwise we will download the data from s3.
@@ -0,0 +1,195 @@ | |||
# coding: utf-8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can plan next steps to integrate the function into gluonnlp attention_cell
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Next step for us
print("Training time {:6.2f} seconds".format(time() - train_start)) | ||
|
||
|
||
def get_learning_rate_per_iteration(iteration, options): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename this to warm_up_steps
for others to easily understand what is this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
return True if "predefined_embedding_layer" in name else False | ||
|
||
|
||
def save_model_parameters(net, epoch, options): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we merge the save_model_parameters
and save_ema_parameters
? since whether use ema the user's setting, we should only expose one inference or one copy of the model parameters.
.format(e, avg_loss_scalar, options.batch_size, trainer.learning_rate, | ||
records_per_epoch_count / epoch_time, epoch_time)) | ||
|
||
save_model_parameters(net, e, options) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want to store the parameters for the last several epochs as checkpoints according to our discussion, since we eventually will need to ensemble the results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parameters of all epochs are stored in the separate files, so if we want to ensemble them together, we could just reference them even now.
net.save_parameters(save_path) | ||
|
||
|
||
def save_ema_parameters(ema, epoch, options): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to save the parameters of the ema
since they are only used in the evaluation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, will refactor to store either EMA or raw parameters depending on the flag
Description
The BiDAF model is assembled and training script can be run for multiple epochs. Need to fix everything that is marked with
TODO:
in code.[X] Reduce data processing time (or cache it in file system between runs)
[X] Update evaluation metrics to properly use official evaluation script (
metric.py
)[ ] Question max length was set to 400 (equal to paragraph max length) to make sure that dot product similarity doesn't fail. Need to return it back to original 30.
[X] Validation dataset is not used for calculating F1 and Exact Match (EM) scores.
Checklist
Essentials
Changes