Question about attention mask #23

W-rudder · 2024-12-01T07:36:46Z

          I don't think you need to set an attention mask because for the causal attention, padding tokens at the right of the sequences will not affect the training result.

Originally posted by @getao in #3 (comment)

The padding token is at the end of the sequence, but based on the provided code, the memory token is positioned to the right of the padding token. If the input sequence lengths are inconsistent, the padding token could affect the attention score of the memory token, right?

The text was updated successfully, but these errors were encountered:

RewindL · 2024-12-04T03:58:03Z

          I don't think you need to set an attention mask because for the causal attention, padding tokens at the right of the sequences will not affect the training result.
Originally posted by @getao in #3 (comment)

The padding token is at the end of the sequence, but based on the provided code, the memory token is positioned to the right of the padding token. If the input sequence lengths are inconsistent, the padding token could affect the attention score of the memory token, right?

I have same question and I believe pad_token will affect output mem_slots, but the training loss will ignore the pad_tokens position, so that mem_slots might learn to "ignore" recording information of pad_tokens during training.
But still, the implementation does not avoid pad_tokens' impact on mem_slots, which needs to be addressed.

getao · 2024-12-04T04:05:15Z

Thank you for your question.

I don't think you need to set an attention mask because for the causal attention, padding tokens at the right of the sequences will not affect the training result.

This post above was for the v1 code.

I think you are asking about the v2 code. For v2, only batch=1 is supported, as we mentioned at: https://github.com/getao/icae/tree/main/code#updated-april-2024

W-rudder · 2024-12-04T04:05:21Z

          I don't think you need to set an attention mask because for the causal attention, padding tokens at the right of the sequences will not affect the training result.
Originally posted by @getao in #3 (comment)
The padding token is at the end of the sequence, but based on the provided code, the memory token is positioned to the right of the padding token. If the input sequence lengths are inconsistent, the padding token could affect the attention score of the memory token, right?
I have same question and I believe pad_token will affect output mem_slots, but the training loss will ignore the pad_tokens position, so that mem_slots might learn to "ignore" recording information of pad_tokens during training. But still, the implementation does not avoid pad_tokens' impact on mem_slots, which needs to be addressed.

In v1, I guess the author directly appended the memory token to the end of each text and then applied padding. This approach seems reasonable, but without the relevant code, it's impossible to confirm if that’s the case.

In v2, I noticed that the author used a batch size of 1, so the memory token embedding is directly added to the end of the sentence. This works only for a batch size of 1. For batch sizes greater than 1, the padding token will affect the memory token.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about attention mask #23

Question about attention mask #23

W-rudder commented Dec 1, 2024

RewindL commented Dec 4, 2024

getao commented Dec 4, 2024

W-rudder commented Dec 4, 2024

Question about attention mask #23

Question about attention mask #23

Comments

W-rudder commented Dec 1, 2024

RewindL commented Dec 4, 2024

getao commented Dec 4, 2024

W-rudder commented Dec 4, 2024