Skip to content
This repository has been archived by the owner on Jan 15, 2024. It is now read-only.

GluonNLP 0.9.x RoBERTa model return inconsistent encoded result comparing to Fairseq #1183

Closed
kaonashi-tyc opened this issue Mar 13, 2020 · 22 comments
Assignees
Labels
bug Something isn't working

Comments

@kaonashi-tyc
Copy link

kaonashi-tyc commented Mar 13, 2020

Description

I am working on porting XLM-R model to gluonnlp (view this script here convert_xlmr_model_to_gluon.py).

The XLM-R model reuses the same RoBERTa model architecture from fairseq.

When using gluonnlp 0.8.x, the converted model can match the fairseq reference implementation with a small error range within 5e-6, and the std-dev is small (< 8e-7)

Once updated gluonnlp to 0.9.x, the result no longer matches, and std-dev increases to 0.2+ ranges, which indicates drastically different output.

Error Message

*** Maximum errors for vector of size 3840:  rtol=0.001, atol=0.001

  1: Error 1004.472290  Location of error: (0, 3, 152), a=1.01276851, b=-1.85498929
  2: Error 704.275513  Location of error: (0, 4, 588), a=1.44983745, b=7.28419113
  3: Error 647.055481  Location of error: (0, 3, 465), a=-0.56997168, b=0.21840204
  4: Error 600.452820  Location of error: (0, 0, 588), a=1.60323822, b=5.51547194
  5: Error 538.981934  Location of error: (0, 2, 749), a=-0.70366722, b=-0.10700922
  6: Error 522.929871  Location of error: (0, 2, 465), a=-0.22267081, b=0.62938148
  7: Error 508.802521  Location of error: (0, 3, 467), a=0.50823522, b=-0.00115502
  8: Error 508.074951  Location of error: (0, 2, 459), a=-0.04906127, b=-1.13256359
  9: Error 489.669586  Location of error: (0, 2, 152), a=0.31114388, b=-0.34982380
 10: Error 481.314819  Location of error: (0, 2, 145), a=1.19826353, b=0.48399478
Traceback (most recent call last):
  File "xlmr_conversion/convert_vocab.py", line 198, in <module>
    check_output('Hello world!')
  File "xlmr_conversion/convert_vocab.py", line 192, in check_output
    mx.test_utils.assert_almost_equal(mx_output.asnumpy(), torch_output, atol=1e-3, rtol=1e-3)
  File "/efs/users/tiayuche/mxnet16/lib/python3.6/site-packages/mxnet/test_utils.py", line 627, in assert_almost_equal
    raise AssertionError(msg)
AssertionError:
Items are not equal:
Error 1004.472290 exceeds tolerance rtol=1.000000e-03, atol=1.000000e-03 (mismatch at least 0.286458%).
Location of maximum error: (0, 3, 152), a=1.01276851, b=-1.85498929
 ACTUAL: array([[[ 0.09790944,  0.08255608,  0.01538194, ..., -0.20328012,
          0.02295702,  0.06142968],
        [-0.13235903,  0.02904704, -0.09925546, ..., -0.03289408,...
 DESIRED: array([[[ 0.28058335,  0.18484427,  0.09018373, ..., -0.2651089 ,
          0.10111443, -0.05324648],
        [-0.05585899,  0.07528387, -0.04672416, ...,  0.12345381,...

To Reproduce

Steps to reproduce

(Paste the commands you ran that produced the error.)

  1. download the xlm-r model artifacts from fairseq repo:
    https://github.com/pytorch/fairseq/tree/master/examples/xlmr

  2. Install both fairseq and gluonnlp and sentencepieces via pip

  3. run the following command

python convert_xlmr_model_to_gluon.py --ckpt_dir xlmr_ckpt/xlmr.base/ --verbose

What have you tried to solve it?

I have notified @eric-haibin-lin and identified the regression happens after this commit:

184a000

Further investigation will be needed to pinpoint the exact change causing this bug.

Environment

We recommend using our script for collecting the diagnositc information. Run the following command and paste the outputs below:

curl --retry 10 -s https://raw.githubusercontent.com/dmlc/gluon-nlp/master/tools/diagnose.py | python

----------Python Info----------
Version      : 3.6.6
Compiler     : GCC 7.2.0
Build        : ('default', 'Jun 28 2018 17:14:51')
Arch         : ('64bit', '')
------------Pip Info-----------
Version      : 20.0.2
Directory    : XXXXX
----------MXNet Info-----------
Version      : 1.6.0
Directory    : XXXX
Num GPUs     : 4
Commit Hash   : 6eec9da55c5096079355d1f1a5fa58dcf35d6752
----------System Info----------
Platform     : Linux-4.4.0-1102-aws-x86_64-with-debian-stretch-sid
system       : Linux
node         : ip-172-31-21-222
release      : 4.4.0-1102-aws
version      : #113-Ubuntu SMP Wed Jan 29 14:54:54 UTC 2020
----------Hardware Info----------
machine      : x86_64
processor    : x86_64
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0011 sec, LOAD: 0.0248 sec.
Timing for GluonNLP GitHub: https://github.com/dmlc/gluon-nlp, DNS: 0.0003 sec, LOAD: 0.0223 sec.
Timing for GluonNLP: http://gluon-nlp.mxnet.io, DNS: 0.0004 sec, LOAD: 0.0200 sec.
Timing for D2L: http://d2l.ai, DNS: 0.0003 sec, LOAD: 0.0035 sec.
Timing for D2L (zh-cn): http://zh.d2l.ai, DNS: 0.0003 sec, LOAD: 0.0081 sec.
Timing for FashionMNIST: https://repo.mxnet.io/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0005 sec, LOAD: 0.3333 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0004 sec, LOAD: 0.0586 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0003 sec, LOAD: 0.0333 sec.```
@kaonashi-tyc kaonashi-tyc added the bug Something isn't working label Mar 13, 2020
@kaonashi-tyc
Copy link
Author

Hi, any updates on this issue?

@eric-haibin-lin
Copy link
Member

@leezu was there any known inconsistency issue with the refactoring PR?

@eric-haibin-lin
Copy link
Member

dependencies:

ubuntu@ip-172-31-57-164:~$ pip3 list | grep mxnet
mxnet-cu100         1.6.0
ubuntu@ip-172-31-57-164:~$ pip3 list | grep torch
torch               1.5.0
ubuntu@ip-172-31-57-164:~$ pip3 list | grep fairseq
fairseq             0.9.0
wget https://dl.fbaipublicfiles.com/fairseq/models/xlmr.base.tar.gz
python3 convert_xlmr_model_to_gluon.py --ckpt_dir /home/ubuntu/src/xlmr.base/ --verbose

with commit 4e55539

torch.Size([514, 768])
stdev =  6.2218635e-07
stdev =  2.409921e-07
stdev =  2.4556107e-07
stdev =  2.0841426e-07
stdev =  2.5782302e-07

with commit 184a000


torch.Size([514, 768])
stdev =  0.00013029973
stdev =  0.00016538663

*** Maximum errors for vector of size 4608:  rtol=0.001, atol=0.001

  1: Error 4.499456  Location of error: (0, 1, 152), a=-0.09637016, b=-0.10132553
  2: Error 1.222937  Location of error: (0, 4, 152), a=0.10587245, b=0.10452169
stdev =  8.325626e-05
stdev =  0.00012943218

*** Maximum errors for vector of size 5376:  rtol=0.001, atol=0.001

  1: Error 1.608742  Location of error: (0, 4, 145), a=-0.08318494, b=-0.08493032
  2: Error 1.293719  Location of error: (0, 5, 152), a=0.03806081, b=0.03671959
  3: Error 1.101096  Location of error: (0, 4, 459), a=-0.71366501, b=-0.71555400
stdev =  8.076113e-05

@leezu
Copy link
Contributor

leezu commented Apr 30, 2020

We can fix commit 184a000 as follows

But there are more bugs introduced later

commit 2a964133c43516db5dc97ae320e80c337b0317e8 (HEAD -> fixfairseq, ec2/fixfairseq)
Author: Leonard Lausen <[email protected]>
Date:   Thu Apr 30 07:02:55 2020 +0000

    Fix layer_norm_eps in BERTEncoder

diff --git a/src/gluonnlp/model/bert.py b/src/gluonnlp/model/bert.py
index 27bea07d0..ec234c5b3 100644
--- a/src/gluonnlp/model/bert.py
+++ b/src/gluonnlp/model/bert.py
@@ -112,11 +112,12 @@ class BERTEncoder(HybridBlock, Seq2SeqEncoder):
         self._output_attention = output_attention
         self._output_all_encodings = output_all_encodings
         self._dropout = dropout
+        self._layer_norm_eps = layer_norm_eps

         with self.name_scope():
             if dropout:
                 self.dropout_layer = nn.Dropout(rate=dropout)
-            self.layer_norm = nn.LayerNorm(in_channels=units, epsilon=1e-12)
+            self.layer_norm = nn.LayerNorm(in_channels=units, epsilon=self._layer_norm_eps)
             self.position_weight = self.params.get('position_weight', shape=(max_length, units),
                                                    init=weight_initializer)
             self.transformer_cells = nn.HybridSequential()
@@ -344,7 +345,7 @@ class BERTModel(HybridBlock):
             decoder = nn.HybridSequential(prefix=prefix)
             decoder.add(nn.Dense(units, flatten=False))
             decoder.add(GELU())
-            decoder.add(nn.LayerNorm(in_channels=units, epsilon=1e-12))
+            decoder.add(nn.LayerNorm(in_channels=units, epsilon=self.encoder._layer_norm_eps))
             decoder.add(nn.Dense(vocab_size, flatten=False, params=embed.collect_params()))
         assert decoder[3].weight == list(embed.collect_params().values())[0], \
             'The weights of word embedding are not tied with those of decoder'

With that fix, we get

stdev =  3.8439632e-07
stdev =  2.3018535e-07
stdev =  2.380526e-07
stdev =  2.1728458e-07
stdev =  5.5794425e-07

@leezu
Copy link
Contributor

leezu commented Apr 30, 2020

@eric-haibin-lin @kaonashi-tyc: @acphile helped to investigate which commit introduces the regression (after the layer_norm eps fix is applied). He finds that 75c29a3 is the first bad commit
where regression still occurs after modifying src/gluonnlp/model/bert.py

@eric-haibin-lin
Copy link
Member

Maybe because we switched to softmax with length, which assigns exact 0 to the masked position, where the older version assigns a number very close to 0...

@kaonashi-tyc
Copy link
Author

Just check-back, is this issue resolved as of now?

@szha
Copy link
Member

szha commented Aug 5, 2020

We found some similar issue in GluonNLP numpy version and came to this fix: apache/mxnet#18827. Let's verify if this patch fixes this problem too.

@eric-haibin-lin
Copy link
Member

@barry-jin could you help test it with latest mxnet and see if the issue persist?

@barry-jin
Copy link
Contributor

@eric-haibin-lin Sure.

@barry-jin
Copy link
Contributor

Dependencies:

ubuntu@ip-172-31-20-128:~$ pip3 list | grep mxnet
mxnet-cu102         1.6.0
ubuntu@ip-172-31-20-128:~$ pip3 list | grep torch
torch               1.6.0
ubuntu@ip-172-31-20-128:~$ pip3 list | grep fairseq
fairseq             0.9.0

Steps:

wget https://dl.fbaipublicfiles.com/fairseq/models/xlmr.base.tar.gz
tar -xzvf xmlr.base.tar.gz
python3 convert_xlmr_model_to_gluon.py --ckpt_dir /home/ubuntu/tmp/xlmr.base/ --verbose

Output:

torch.Size([514, 768])
stdev =  0.19624774

*** Maximum errors for vector of size 3840:  rtol=0.001, atol=0.001

  1: Error 1004.471985  Location of error: (0, 3, 152), a=1.01276755, b=-1.85499024
  2: Error 704.276367  Location of error: (0, 4, 588), a=1.44983697, b=7.28421259
  3: Error 647.054993  Location of error: (0, 3, 465), a=-0.56997108, b=0.21840222
  4: Error 600.453857  Location of error: (0, 0, 588), a=1.60323966, b=5.51549196
  5: Error 538.982544  Location of error: (0, 2, 749), a=-0.70366699, b=-0.10700862
  6: Error 522.929871  Location of error: (0, 2, 465), a=-0.22267099, b=0.62938124
  7: Error 508.801910  Location of error: (0, 3, 467), a=0.50823474, b=-0.00115474
  8: Error 508.074127  Location of error: (0, 2, 459), a=-0.04906178, b=-1.13256073
  9: Error 489.671600  Location of error: (0, 2, 152), a=0.31114697, b=-0.34982315
 10: Error 481.318054  Location of error: (0, 2, 145), a=1.19826555, b=0.48399293
stdev =  0.20739098

*** Maximum errors for vector of size 4608:  rtol=0.001, atol=0.001

  1: Error 1183.268433  Location of error: (0, 2, 588), a=2.24659586, b=-5.80201578
  2: Error 1161.049805  Location of error: (0, 4, 588), a=2.24659729, b=-6.74044371
  3: Error 928.264038  Location of error: (0, 1, 152), a=-1.12364161, b=-0.10132299
  4: Error 601.523010  Location of error: (0, 2, 259), a=-0.30029136, b=0.75595784
  5: Error 580.307129  Location of error: (0, 2, 145), a=-0.06280632, b=1.23304677
  6: Error 566.850525  Location of error: (0, 3, 145), a=0.53026021, b=-0.08447501
  7: Error 517.987793  Location of error: (0, 4, 145), a=-0.06280673, b=0.94433534
  8: Error 485.135864  Location of error: (0, 2, 459), a=0.22280854, b=-0.50950789
  9: Error 450.751129  Location of error: (0, 2, 12), a=0.06082506, b=0.93141055
 10: Error 446.660492  Location of error: (0, 3, 465), a=-0.33150461, b=0.20811081
stdev =  0.18995003

*** Maximum errors for vector of size 3840:  rtol=0.001, atol=0.001

  1: Error 1085.388550  Location of error: (0, 2, 588), a=1.46543539, b=-4.45078850
  2: Error 1066.772461  Location of error: (0, 1, 588), a=1.46543479, b=-5.97044420
  3: Error 787.812927  Location of error: (0, 3, 152), a=0.77056408, b=-0.08129084
  4: Error 655.578003  Location of error: (0, 2, 459), a=0.15703337, b=-1.44748211
  5: Error 601.337708  Location of error: (0, 3, 459), a=0.07274741, b=-1.32591033
  6: Error 526.234802  Location of error: (0, 1, 145), a=-0.01829262, b=1.07213926
  7: Error 472.145905  Location of error: (0, 1, 459), a=0.15703396, b=-0.59696794
  8: Error 416.252808  Location of error: (0, 3, 626), a=-0.33784640, b=0.13431576
  9: Error 395.673187  Location of error: (0, 2, 221), a=0.00158666, b=-0.65210831
 10: Error 373.908081  Location of error: (0, 2, 612), a=-0.22256255, b=0.24173050
stdev =  0.15333436

*** Maximum errors for vector of size 5376:  rtol=0.001, atol=0.001

  1: Error 1053.334717  Location of error: (0, 3, 588), a=1.30375504, b=-4.69525385
  2: Error 891.753967  Location of error: (0, 2, 152), a=0.84285378, b=-0.45175028
  3: Error 760.975037  Location of error: (0, 1, 145), a=0.68549037, b=-0.31580281
  4: Error 739.728027  Location of error: (0, 4, 459), a=0.55348921, b=-0.71555465
  5: Error 735.922119  Location of error: (0, 2, 459), a=0.12832984, b=-2.30080748
  6: Error 682.559387  Location of error: (0, 4, 152), a=0.38436258, b=-0.93937868
  7: Error 634.900940  Location of error: (0, 5, 152), a=0.69493502, b=0.03672028
  8: Error 586.230469  Location of error: (0, 2, 465), a=-0.43865040, b=0.35667226
  9: Error 583.885071  Location of error: (0, 3, 259), a=-0.20851725, b=0.90207750
 10: Error 562.726746  Location of error: (0, 4, 145), a=0.52558923, b=-0.08492993
stdev =  0.18738568

*** Maximum errors for vector of size 3840:  rtol=0.001, atol=0.001

  1: Error 1129.763306  Location of error: (0, 2, 588), a=1.67255735, b=-4.18295527
  2: Error 1094.432373  Location of error: (0, 3, 588), a=1.67261219, b=-6.12268019
  3: Error 638.781616  Location of error: (0, 1, 459), a=0.08820118, b=-1.52423167
  4: Error 575.069885  Location of error: (0, 1, 152), a=0.27381456, b=-0.70895278
  5: Error 545.580383  Location of error: (0, 2, 259), a=-0.20532960, b=0.74875915
  6: Error 454.918396  Location of error: (0, 1, 626), a=-0.42154688, b=0.06122302
  7: Error 442.742065  Location of error: (0, 3, 259), a=-0.20532867, b=0.42603865
  8: Error 399.461243  Location of error: (0, 1, 391), a=-0.13588569, b=0.43889853
  9: Error 396.245575  Location of error: (0, 1, 221), a=0.37708509, b=-0.03173557
 10: Error 390.489594  Location of error: (0, 2, 465), a=-0.22364777, b=0.27373093

@szha
Copy link
Member

szha commented Aug 8, 2020

@barry-jin thanks for sharing. What's the GluonNLP version?

@barry-jin
Copy link
Contributor

@szha GluonNLP version is 0.9.1

ubuntu@ip-172-31-20-128:~$ pip3 list | grep gluonnlp
gluonnlp                 0.9.1               /home/ubuntu/.local/lib/python3.6/site-packages

@szha
Copy link
Member

szha commented Aug 8, 2020

Thanks. Could you try the nightly builds on https://dist.mxnet.io/python ? Let's check whether the master and v1.x branches are fixed. You can install the nightly builds for 2.0 with

pip3 install --pre --upgrade mxnet-cu102 -f https://dist.mxnet.io/python

and 1.x with

pip3 install --pre --upgrade 'mxnet-cu102<2' -f https://dist.mxnet.io/python

@barry-jin
Copy link
Contributor

I tried 2.0 as follows:

Dependency change: mxnet

ubuntu@ip-172-31-20-128:~$ pip3 list | grep mxnet
mxnet-cu102              2.0.0b20200806

Stack trace for error message:

ubuntu@ip-172-31-20-128:~$ python3 convert_xlmr_model_to_gluon.py --ckpt_dir /home/ubuntu/tmp/xlmr.base/ --verbose
Traceback (most recent call last):
  File "convert_xlmr_model_to_gluon.py", line 13, in <module>
    import gluonnlp as nlp
  File "/home/ubuntu/.local/lib/python3.6/site-packages/gluonnlp/__init__.py", line 24, in <module>
    from . import loss
  File "/home/ubuntu/.local/lib/python3.6/site-packages/gluonnlp/loss/__init__.py", line 21, in <module>
    from . import activation_regularizer, loss, label_smoothing
  File "/home/ubuntu/.local/lib/python3.6/site-packages/gluonnlp/loss/activation_regularizer.py", line 23, in <module>
    from mxnet.gluon.loss import Loss
ModuleNotFoundError: No module named 'mxnet.gluon.loss'

@szha
Copy link
Member

szha commented Aug 8, 2020

This looks very weird. I can confirm that the loss module is there...

@barry-jin
Copy link
Contributor

Output for v1.x:
Dependency:

ubuntu@ip-172-31-20-128:~$ pip3 list | grep mxnet
mxnet-cu102              1.7.0b20200807

Output:

torch.Size([514, 768])
stdev =  0.19625182
stdev =  0.20733927
stdev =  0.18992558
stdev =  0.15340993
stdev =  0.1872536

No error message, but stdev is still around 0.2

@szha
Copy link
Member

szha commented Aug 8, 2020

@kaonashi-tyc could you take a look at the result above on 1.7?

@eric-haibin-lin
Copy link
Member

The stdev is still too large

@kaonashi-tyc
Copy link
Author

kaonashi-tyc commented Aug 11, 2020

Yeah, the stdev is still too large IMO
The previous deviation is between 1e-5

@barry-jin
Copy link
Contributor

barry-jin commented Aug 12, 2020

In my opinion, there may be something wrong with the versions prior to 4e55539.

For bert.py versions prior to 4e55539, Dot Product Attention Cell is used and in its attention weight computing function, softmax goes first and then mask is applied. This is different from the attention mechanisms stated in d2l, where, for masked_softmax, mask should be applied first and then softmax.

For bert.py current versions, softmax with length is used. In its definition file, mask is applied first and then softmax, which is consistent with d2l. So, current version of bert.py to calculate masked_softmax is right.

I'm not sure my opinion is right or not, but I found this had affected the attention weights a lot in my debug process and may have some impact on the final stdev.

@eric-haibin-lin
Copy link
Member

I have a patch that fixes the incorrect bias term: master...eric-haibin-lin:fix-attn

@eric-haibin-lin eric-haibin-lin mentioned this issue Aug 12, 2020
6 tasks
@szha szha closed this as completed Sep 4, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants