This is really essentially a copy of this page: http://nlp.seas.harvard.edu/2018/04/03/attention.html#encoder-and-decoder-stacks I'm just going through it line by line so that I can figure out how it works