Skip to content

Commit

Permalink
Update transformer_tutorial.py (#2363)
Browse files Browse the repository at this point in the history
Fix to "perhaps there is a misprint at line 40 #2111";

review of referenced paper https://arxiv.org/pdf/1706.03762.pdf section 3.2.3 suggests:
"Similarly, self-attention layers in the decoder allow each position in the decoder to attend to
all positions in the decoder up to and including that position. We need to prevent leftward
information flow in the decoder to preserve the auto-regressive property. We implement this
inside of scaled dot-product attention by masking out (setting to −∞) all values in the input
of the softmax which correspond to illegal connections. See Figure 2."
Thus the suggested change in reference from nn.Transform.Encoder to nn.Transform.Decoder seems reasonable.
  • Loading branch information
frasertajima authored May 31, 2023
1 parent 921f4fb commit 510f82e
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion beginner_source/transformer_tutorial.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@
# ``nn.TransformerEncoder`` consists of multiple layers of
# `nn.TransformerEncoderLayer <https://pytorch.org/docs/stable/generated/torch.nn.TransformerEncoderLayer.html>`__.
# Along with the input sequence, a square attention mask is required because the
# self-attention layers in ``nn.TransformerEncoder`` are only allowed to attend
# self-attention layers in ``nn.TransformerDecoder`` are only allowed to attend
# the earlier positions in the sequence. For the language modeling task, any
# tokens on the future positions should be masked. To produce a probability
# distribution over output words, the output of the ``nn.TransformerEncoder``
Expand Down

0 comments on commit 510f82e

Please sign in to comment.