-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Including reference to teacher_forcing_ratio and padding_idx to seq2seq_translation_tutorial #2870
Conversation
…eq_translation_tutorial
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/tutorials/2870
Note: Links to docs will display an error until the docs builds have been completed. This comment was automatically generated by Dr. CI and updates every 15 minutes. |
CC: @spro |
Hello! Possibly a silly question, but I see my PR failed from a spell check issue. It looks like it's complaining about AttnDecoderRNN.forward_step(). Is this something I add as an ignore (regex?) in the .pyspelling.yml file or something? |
Hi, I am still interested by this issue as I am still working on porting this tutorial to R torch. Using batch training with For the encoder, it is necessary to pad and pack the sentences to take advantage of torch built-in capacity
For the decoder, it is not possible to take advange of padding and packing in the same way
My implementation follows loosely the tutorial. In my implementation, the attention modules take the encoder hidden states (H) and the current decoder
In order to compute the attention weights, the function
The follwing decoder module performs a single decoding step
Note that the input for the decoder at each step is the concatenation of the previously generated word A final module puts everything together and implements the full encoding and decoding process.
The implementation is a bit tricky because the number of sentences to be processed in the
In this implementation of the decoder, I use the two indices ( The full model can be tested
There is still an issue when computing the gradient in the I am interested to see if there is a more efficient implementation of the decoding process that would |
@spro do the changes look good to you? Please leave a comment. |
@gavril0 To your last point about optimizing the decoder step of an RNN, I think that would be an interesting tutorial! This could be discussed based on beam search. I'm unaware of other optimization techniques. I would argue this could be its own tutorial, perhaps under advanced. If others disagree and would like this included in this tutorial, I'd be happy to add it. |
Co-authored-by: Svetlana Karslioglu <[email protected]>
Co-authored-by: Svetlana Karslioglu <[email protected]>
Co-authored-by: Svetlana Karslioglu <[email protected]>
@brcolli Thanks for the work and the pointer toward beam search. I am still refining my implementation with indices and waiting for an issue/bug in R torch port be fixed before making it public in github. I looked again at the Python code and my understanding is that the For info, I had an error about a variable being modified in place when computing the gradient because I assigned the value 0 to some elements of the attention weights matrix in the
|
Sorry for taking so long to respond. |
@brcolli. This is what I was thinking. One also needs to make sure outputs that are longer than the target sentences are padded with zeros during training because I don't think that the computed loss is correct otherwise. I wanted to compare the gradients computed in this manner to those computed using indices but I have yet to do it. |
No need for my approval but yes! |
Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as |
Fixes #2840
Description
I have added details regarding teacher_forcing_ratio, as well as padding_idx. I agree that padding_idx needs to be set to 0 to properly handle the padding.
I do think having batch processing is still meaningful, even if the sentences are short (max 10 words in the tutorial).
I didn't want to include too much discussion about the pros/cons of batch processing and padding, as I feel that might be out of scope.
Checklist
cc @albanD @jbschlosser