Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assemble BiDAF model and add training script #15

Open
wants to merge 46 commits into
base: bidaf
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
46 commits
Select commit Hold shift + click to select a range
a751587
Model is assembled, but TODOs need to be addressed
Aug 2, 2018
32787aa
Commenting useless code
Aug 2, 2018
23f1e9e
Fix epoch time display
Aug 2, 2018
9c6c98e
Multigpu support + official evalualtion
Aug 31, 2018
00b051d
Add save params + support for uneven data splits
Sep 1, 2018
29d8b1e
Showcase last batch fails to be processed
Sep 4, 2018
51f7d2a
Use correct attention layer
Sep 4, 2018
7cbd563
Multiple bug fixes in Attention flow layer
Sep 6, 2018
ca9c948
kvstore set to local to prevent malloc exception
Sep 6, 2018
1f39fb1
Fix to get 1 epoch on 4 gpu ~1.8 hours
Sep 11, 2018
fa24368
Merge branch 'master' into bidaf_assembled
Sep 11, 2018
8b14acd
Make evaluation faster
Sep 18, 2018
f8368f6
Update hyperparameters
Sep 20, 2018
152bb4c
Float16 works, hybridization not
Sep 24, 2018
5e4975b
Float16 + hybridize works. TODO:replace hard codes
Sep 28, 2018
69ea71c
Hard code removed
Sep 28, 2018
74707fc
Some useless code removed
Sep 28, 2018
20ffcd9
Merge https://github.com/dmlc/gluon-nlp into bidaf_assembled
Sep 28, 2018
d35c802
Bug fix in data preprocessing
Oct 6, 2018
19c37d2
EMA is added to code + loss function change
Oct 10, 2018
60a4374
EMA can be used for prediction
Oct 10, 2018
ddaac06
Caching of vocabs is added
Oct 10, 2018
8379a2c
Making utils function support FP16
Oct 10, 2018
e163925
Dev set also present in vocab
Oct 12, 2018
e72eb64
GlobalGradClip seems to work on 4 gpu, 15 items
Oct 12, 2018
5742bea
Bug fixes. EM=39.8, F1=51.965 after 23 epochs
Oct 16, 2018
bc6b8a7
Evaluation changed and resume training added
Oct 16, 2018
bf70d66
Training resuming added
Oct 17, 2018
dde6539
Clean up code
Oct 19, 2018
a1fcdcc
Parameters parsing fixes
Oct 20, 2018
b3d95e6
Early stop is added
Oct 22, 2018
8998836
Can log results without early stopping
Oct 22, 2018
2528b8e
NLTK tokenizer is used to fix [citation needed]
Oct 24, 2018
2a28253
Bidaf similarity is used
Oct 25, 2018
f62e1c5
Add comments to code
Oct 25, 2018
634b069
Return static_alloc
Oct 25, 2018
c4f4a3c
Multigpu and arbitrary batch support for evals
Oct 30, 2018
6ee7732
Add terminate training if need to reach F1 only
Nov 1, 2018
11e66c8
Code review comments addressed
Nov 7, 2018
5bddd48
FP16 removed and tests are fixed
Nov 7, 2018
b0dbe34
Merge branch 'bidaf' into bidaf_assembled
Ishitori Nov 15, 2018
1c15cca
Code review changes
Nov 15, 2018
7303444
Optimize imports
Nov 15, 2018
1571548
Only EMA or original params get saved
Nov 16, 2018
cbbfcb3
Get hybridization back
Nov 16, 2018
355116c
Make pylint happy
Nov 16, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions gluonnlp/data/question_answering.py
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,7 @@ def __init__(self, segment='train', root=os.path.join('~', '.mxnet', 'datasets',
self._segment = segment
self._get_data()

super(SQuAD, self).__init__(self._read_data())
super(SQuAD, self).__init__(SQuAD._get_records(self._read_data()))

def _get_data(self):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should data always come from the SQuAD._get_records?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep.

The thing is that the official evaluation scripts need original JSON content, and instead of trying to figure out the path in FS and loading it manually, it is easier to just have a separate method on reading JSON from disk and parsing it. Then I can just send the JSON as is to official evaluation script without a need to know where MXNet stores the file itself.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in MXnet in general, we will normally download the data into mxnet root dir, and then we will identify whether the data is there and load from there if it exists, otherwise we will download the data from s3.

"""Load data from the file. Does nothing if data was loaded before
Expand Down Expand Up @@ -116,9 +116,9 @@ def _read_data(self):
_, data_file_name, _ = self._data_file[self._segment]

with open(os.path.join(self._root, data_file_name)) as f:
samples = json.load(f)
json_data = json.load(f)

return SQuAD._get_records(samples)
return json_data

@staticmethod
def _get_records(json_dict):
Expand Down
70 changes: 70 additions & 0 deletions scripts/question_answering/attention_flow.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
# coding: utf-8

# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.

"""Attention Flow Layer"""
from mxnet import gluon

from .similarity_function import DotProductSimilarity


class AttentionFlow(gluon.HybridBlock):
"""
This ``block`` takes two ndarrays as input and returns a ndarray of attentions.

We compute the similarity between each row in each matrix and return unnormalized similarity
scores. Because these scores are unnormalized, we don't take a mask as input; it's up to the
caller to deal with masking properly when this output is used.

By default similarity is computed with a dot product, but you can alternatively use a
parameterized similarity function if you wish.


Input:
- ndarray_1: ``(batch_size, num_rows_1, embedding_dim)``
- ndarray_2: ``(batch_size, num_rows_2, embedding_dim)``

Output:
- ``(batch_size, num_rows_1, num_rows_2)``

Parameters
----------
similarity_function: ``SimilarityFunction``, optional (default=``DotProductSimilarity``)
The similarity function to use when computing the attention.
"""
def __init__(self, similarity_function, batch_size, passage_length,
question_length, embedding_size, **kwargs):
super(AttentionFlow, self).__init__(**kwargs)

self._similarity_function = similarity_function or DotProductSimilarity()
self._batch_size = batch_size
self._passage_length = passage_length
self._question_length = question_length
self._embedding_size = embedding_size

def hybrid_forward(self, F, matrix_1, matrix_2):
# pylint: disable=arguments-differ
tiled_matrix_1 = matrix_1.expand_dims(2).broadcast_to(shape=(self._batch_size,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make this hyridizable, we can use
tiled_matrix_1 = F.expand_dims(matrix_1, 2).broadcast_to(matrix_1, shape=(self._batch_size, self._passage_length, self._question_length, self._embedding_size))

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is hybridizable even if we use expand directly on matrix_1.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds this is a bug in MXnet based on offline discussion, we should have the automatic registration of the operators in mxnet.sym and mxnet.sym.Symbol.

self._passage_length,
self._question_length,
self._embedding_size))
tiled_matrix_2 = matrix_2.expand_dims(1).broadcast_to(shape=(self._batch_size,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block is hybridizable even when calling matrix_2.expand_dims. I would leave it like that, because it is more readable.

self._passage_length,
self._question_length,
self._embedding_size))
return self._similarity_function(tiled_matrix_1, tiled_matrix_2)
121 changes: 121 additions & 0 deletions scripts/question_answering/bidaf.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
# coding: utf-8

# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.

"""Bidirectional attention flow layer"""
from mxnet import gluon
import numpy as np
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question, why are we using numpy instead of ndarray here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is only used to get epsilon and minimum value for different precisions, instead of hardcoding the numbers into the code. Look at how _get_big_negative_value() and _get_small_positive_value() are implemented. The rest of the code is using NDArray and it is hybridizable.

Do we have this info available via nd package?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please find the comment in the _get_big_negative_value


from .utils import last_dim_softmax, weighted_sum, replace_masked_values, masked_softmax


class BidirectionalAttentionFlow(gluon.HybridBlock):
"""
This class implements Minjoon Seo's `Bidirectional Attention Flow model
<https://www.semanticscholar.org/paper/Bidirectional-Attention-Flow-for-Machine-Seo-Kembhavi/7586b7cca1deba124af80609327395e613a20e9d>`_
for answering reading comprehension questions (ICLR 2017).
"""

def __init__(self,
batch_size,
passage_length,
question_length,
encoding_dim,
**kwargs):
super(BidirectionalAttentionFlow, self).__init__(**kwargs)

self._batch_size = batch_size
self._passage_length = passage_length
self._question_length = question_length
self._encoding_dim = encoding_dim

def _get_big_negative_value(self):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to change this based on ndarray.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, didn't get it. Do you want to do this based on dtype of NDArray instead of using an __init__() parameter?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can keep this as-is. Another way is that we can use python best_val = float('Inf')

"""Provides maximum negative Float32 value
Returns
-------
value : float32
Maximum negative float32 value
"""
return np.finfo(np.float32).min

def _get_small_positive_value(self):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to change this based on ndarray.

"""Provides minimal possible Float32 value
Returns
-------
value : float32
Minimal float32 value
"""
return np.finfo(np.float32).eps

def hybrid_forward(self, F, passage_question_similarity,
encoded_passage, encoded_question, question_mask, passage_mask):
# pylint: disable=arguments-differ
# Shape: (batch_size, passage_length, question_length)
passage_question_similarity_shape = (self._batch_size, self._passage_length,
self._question_length)

question_mask_shape = (self._batch_size, self._question_length)
# Shape: (batch_size, passage_length, question_length)
passage_question_attention = last_dim_softmax(F,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems no need to pass F in last_dim_softmax, what's the reason of doing so?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem here is that 'Symbol' object has no attribute 'broadcast_div'. You get this exception because as the last step of masked_softmax() we are applying broadcast_div method (line 286 of utils.py):

result = F.broadcast_div(result, (result.sum(axis=1, keepdims=True) + epsilon))

Since Symbol object doesn't have it (and mx.sym package does), I have to pass current backend. If we could figure out other way to do it, then passing the backend would not be needed - the rest functions seems to have their definition on Symbol object

passage_question_similarity,
question_mask,
passage_question_similarity_shape,
question_mask_shape,
epsilon=self._get_small_positive_value())
# Shape: (batch_size, passage_length, encoding_dim)
encoded_question_shape = (self._batch_size, self._question_length, self._encoding_dim)
passage_question_attention_shape = (self._batch_size, self._passage_length,
self._question_length)
passage_question_vectors = weighted_sum(F, encoded_question, passage_question_attention,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same as the above comment, what's the reason of passing F in the weighted_sum ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same reason as above, but in this case the problem is batch_dot() method, which exists in mx.sym, but doesn't exists on mx.sym.Symbol object.

encoded_question_shape,
passage_question_attention_shape)

# We replace masked values with something really negative here, so they don't affect the
# max below.
masked_similarity = passage_question_similarity if question_mask is None else \
replace_masked_values(F,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

F ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue, here it is both broadcast_add and broadcast_mul

passage_question_similarity,
question_mask.expand_dims(1),
replace_with=self._get_big_negative_value())

# Shape: (batch_size, passage_length)
question_passage_similarity = masked_similarity.max(axis=-1)

# Shape: (batch_size, passage_length)
question_passage_attention = masked_softmax(F, question_passage_similarity, passage_mask,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

F?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same: broadcast_div.

epsilon=self._get_small_positive_value())

# Shape: (batch_size, encoding_dim)
encoded_passage_shape = (self._batch_size, self._passage_length, self._encoding_dim)
question_passage_attention_shape = (self._batch_size, self._passage_length)
question_passage_vector = weighted_sum(F, encoded_passage, question_passage_attention,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

F ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

batch_dot

encoded_passage_shape,
question_passage_attention_shape)

# Shape: (batch_size, passage_length, encoding_dim)
tiled_question_passage_vector = question_passage_vector.expand_dims(1)

# Shape: (batch_size, passage_length, encoding_dim * 4)
final_merged_passage = F.concat(encoded_passage,
passage_question_vectors,
encoded_passage * passage_question_vectors,
F.broadcast_mul(encoded_passage,
tiled_question_passage_vector),
dim=-1)

return final_merged_passage
Loading