batch_encode_plus doesn't work correctly #1704

tempdeltavalue · 2024-12-18T13:04:25Z

code here:
https://github.com/tempdeltavalue/temp_l/blob/main/finetune_seq2seq.ipynb

https://discuss.huggingface.co/t/repetitive-words-in-model-output/132085/2

tempdeltavalue · 2024-12-18T13:23:08Z

same with tokenizer() batch encoding

jonvet · 2025-01-11T11:41:00Z

In your notebook you initialise the tokenizer as follows

tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-medium", padding_side='left')
tokenizer.pad_token = tokenizer.eos_token

so you're padding with tokens from the left.
the reason why you're getting more pad tokens for the same input sequence when you encode X[0:99] than when you encode X[0:3] is that some sequence in X[3:99] is longer than the longest sequence in X[0:3], and it will make all encoded sequences the same length (due to padding). where is this going wrong in your opinion?

tempdeltavalue changed the title ~~batch_encode doesn't work correctly~~ batch_encode_plus doesn't work correctly Dec 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

batch_encode_plus doesn't work correctly #1704

batch_encode_plus doesn't work correctly #1704

tempdeltavalue commented Dec 18, 2024 •

edited

Loading

tempdeltavalue commented Dec 18, 2024

jonvet commented Jan 11, 2025

batch_encode_plus doesn't work correctly #1704

batch_encode_plus doesn't work correctly #1704

Comments

tempdeltavalue commented Dec 18, 2024 • edited Loading

tempdeltavalue commented Dec 18, 2024

jonvet commented Jan 11, 2025

tempdeltavalue commented Dec 18, 2024 •

edited

Loading