Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about attention mask #23

Open
W-rudder opened this issue Dec 1, 2024 · 3 comments
Open

Question about attention mask #23

W-rudder opened this issue Dec 1, 2024 · 3 comments

Comments

@W-rudder
Copy link

W-rudder commented Dec 1, 2024

          I don't think you need to set an attention mask because for the causal attention, padding tokens at the right of the sequences will not affect the training result.

Originally posted by @getao in #3 (comment)

The padding token is at the end of the sequence, but based on the provided code, the memory token is positioned to the right of the padding token. If the input sequence lengths are inconsistent, the padding token could affect the attention score of the memory token, right?

@RewindL
Copy link

RewindL commented Dec 4, 2024

          I don't think you need to set an attention mask because for the causal attention, padding tokens at the right of the sequences will not affect the training result.

Originally posted by @getao in #3 (comment)

The padding token is at the end of the sequence, but based on the provided code, the memory token is positioned to the right of the padding token. If the input sequence lengths are inconsistent, the padding token could affect the attention score of the memory token, right?

I have same question and I believe pad_token will affect output mem_slots, but the training loss will ignore the pad_tokens position, so that mem_slots might learn to "ignore" recording information of pad_tokens during training.
But still, the implementation does not avoid pad_tokens' impact on mem_slots, which needs to be addressed.

@getao
Copy link
Owner

getao commented Dec 4, 2024

Thank you for your question.

I don't think you need to set an attention mask because for the causal attention, padding tokens at the right of the sequences will not affect the training result.

This post above was for the v1 code.

I think you are asking about the v2 code. For v2, only batch=1 is supported, as we mentioned at: https://github.com/getao/icae/tree/main/code#updated-april-2024

@W-rudder
Copy link
Author

W-rudder commented Dec 4, 2024

          I don't think you need to set an attention mask because for the causal attention, padding tokens at the right of the sequences will not affect the training result.

Originally posted by @getao in #3 (comment)
The padding token is at the end of the sequence, but based on the provided code, the memory token is positioned to the right of the padding token. If the input sequence lengths are inconsistent, the padding token could affect the attention score of the memory token, right?

I have same question and I believe pad_token will affect output mem_slots, but the training loss will ignore the pad_tokens position, so that mem_slots might learn to "ignore" recording information of pad_tokens during training. But still, the implementation does not avoid pad_tokens' impact on mem_slots, which needs to be addressed.

In v1, I guess the author directly appended the memory token to the end of each text and then applied padding. This approach seems reasonable, but without the relevant code, it's impossible to confirm if that’s the case.

In v2, I noticed that the author used a batch size of 1, so the memory token embedding is directly added to the end of the sentence. This works only for a batch size of 1. For batch sizes greater than 1, the padding token will affect the memory token.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants