-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about attention mask #23
Comments
I have same question and I believe pad_token will affect output mem_slots, but the training loss will ignore the pad_tokens position, so that mem_slots might learn to "ignore" recording information of pad_tokens during training. |
Thank you for your question.
This post above was for the v1 code. I think you are asking about the v2 code. For v2, only batch=1 is supported, as we mentioned at: https://github.com/getao/icae/tree/main/code#updated-april-2024 |
In v1, I guess the author directly appended the memory token to the end of each text and then applied padding. This approach seems reasonable, but without the relevant code, it's impossible to confirm if that’s the case. In v2, I noticed that the author used a batch size of 1, so the memory token embedding is directly added to the end of the sentence. This works only for a batch size of 1. For batch sizes greater than 1, the padding token will affect the memory token. |
Originally posted by @getao in #3 (comment)
The padding token is at the end of the sequence, but based on the provided code, the memory token is positioned to the right of the padding token. If the input sequence lengths are inconsistent, the padding token could affect the attention score of the memory token, right?
The text was updated successfully, but these errors were encountered: