-
Notifications
You must be signed in to change notification settings - Fork 532
GluonNLP 0.9.x RoBERTa model return inconsistent encoded result comparing to Fairseq #1183
Comments
Hi, any updates on this issue? |
@leezu was there any known inconsistency issue with the refactoring PR? |
dependencies:
with commit 4e55539
with commit 184a000
|
We can fix commit 184a000 as follows But there are more bugs introduced later commit 2a964133c43516db5dc97ae320e80c337b0317e8 (HEAD -> fixfairseq, ec2/fixfairseq)
Author: Leonard Lausen <[email protected]>
Date: Thu Apr 30 07:02:55 2020 +0000
Fix layer_norm_eps in BERTEncoder
diff --git a/src/gluonnlp/model/bert.py b/src/gluonnlp/model/bert.py
index 27bea07d0..ec234c5b3 100644
--- a/src/gluonnlp/model/bert.py
+++ b/src/gluonnlp/model/bert.py
@@ -112,11 +112,12 @@ class BERTEncoder(HybridBlock, Seq2SeqEncoder):
self._output_attention = output_attention
self._output_all_encodings = output_all_encodings
self._dropout = dropout
+ self._layer_norm_eps = layer_norm_eps
with self.name_scope():
if dropout:
self.dropout_layer = nn.Dropout(rate=dropout)
- self.layer_norm = nn.LayerNorm(in_channels=units, epsilon=1e-12)
+ self.layer_norm = nn.LayerNorm(in_channels=units, epsilon=self._layer_norm_eps)
self.position_weight = self.params.get('position_weight', shape=(max_length, units),
init=weight_initializer)
self.transformer_cells = nn.HybridSequential()
@@ -344,7 +345,7 @@ class BERTModel(HybridBlock):
decoder = nn.HybridSequential(prefix=prefix)
decoder.add(nn.Dense(units, flatten=False))
decoder.add(GELU())
- decoder.add(nn.LayerNorm(in_channels=units, epsilon=1e-12))
+ decoder.add(nn.LayerNorm(in_channels=units, epsilon=self.encoder._layer_norm_eps))
decoder.add(nn.Dense(vocab_size, flatten=False, params=embed.collect_params()))
assert decoder[3].weight == list(embed.collect_params().values())[0], \
'The weights of word embedding are not tied with those of decoder' With that fix, we get
|
@eric-haibin-lin @kaonashi-tyc: @acphile helped to investigate which commit introduces the regression (after the layer_norm eps fix is applied). He finds that 75c29a3 is the first bad commit |
Maybe because we switched to softmax with length, which assigns exact 0 to the masked position, where the older version assigns a number very close to 0... |
Just check-back, is this issue resolved as of now? |
We found some similar issue in GluonNLP numpy version and came to this fix: apache/mxnet#18827. Let's verify if this patch fixes this problem too. |
@barry-jin could you help test it with latest mxnet and see if the issue persist? |
@eric-haibin-lin Sure. |
Dependencies:
Steps:
Output:
|
@barry-jin thanks for sharing. What's the GluonNLP version? |
@szha GluonNLP version is 0.9.1
|
Thanks. Could you try the nightly builds on https://dist.mxnet.io/python ? Let's check whether the master and v1.x branches are fixed. You can install the nightly builds for 2.0 with
and 1.x with
|
I tried 2.0 as follows: Dependency change: mxnet
Stack trace for error message:
|
This looks very weird. I can confirm that the loss module is there... |
Output for v1.x:
Output:
No error message, but stdev is still around 0.2 |
@kaonashi-tyc could you take a look at the result above on 1.7? |
The stdev is still too large |
Yeah, the stdev is still too large IMO |
In my opinion, there may be something wrong with the versions prior to 4e55539. For bert.py versions prior to 4e55539, Dot Product Attention Cell is used and in its attention weight computing function, softmax goes first and then mask is applied. This is different from the attention mechanisms stated in d2l, where, for masked_softmax, mask should be applied first and then softmax. For bert.py current versions, softmax with length is used. In its definition file, mask is applied first and then softmax, which is consistent with d2l. So, current version of bert.py to calculate masked_softmax is right. I'm not sure my opinion is right or not, but I found this had affected the attention weights a lot in my debug process and may have some impact on the final stdev. |
I have a patch that fixes the incorrect bias term: master...eric-haibin-lin:fix-attn |
Description
I am working on porting XLM-R model to gluonnlp (view this script here convert_xlmr_model_to_gluon.py).
The XLM-R model reuses the same RoBERTa model architecture from fairseq.
When using gluonnlp 0.8.x, the converted model can match the fairseq reference implementation with a small error range within 5e-6, and the std-dev is small (< 8e-7)
Once updated gluonnlp to 0.9.x, the result no longer matches, and std-dev increases to 0.2+ ranges, which indicates drastically different output.
Error Message
To Reproduce
Steps to reproduce
(Paste the commands you ran that produced the error.)
download the xlm-r model artifacts from fairseq repo:
https://github.com/pytorch/fairseq/tree/master/examples/xlmr
Install both fairseq and gluonnlp and sentencepieces via pip
run the following command
python convert_xlmr_model_to_gluon.py --ckpt_dir xlmr_ckpt/xlmr.base/ --verbose
What have you tried to solve it?
I have notified @eric-haibin-lin and identified the regression happens after this commit:
184a000
Further investigation will be needed to pinpoint the exact change causing this bug.
Environment
We recommend using our script for collecting the diagnositc information. Run the following command and paste the outputs below:
The text was updated successfully, but these errors were encountered: