Fix prefix space issues with certain tokenizers #156

OyvindTafjord · 2023-11-17T18:53:34Z

This adds a heuristic to catch tokenizers (e.g., Llama) which treats a word starting a string has having a prefixed space, where previously the extra space added at the start of "continuations" in ranked classification task leads to a spurious space token (so for instance total probability mass over answer choices A/B/C/D in MC tasks drops from around 1 to near zero).

OyvindTafjord added 3 commits November 16, 2023 16:14

Add hack for tokenizers with auto prefix space

934b6a1

Fix polarity bug

8558449

Strip prefix space for prefix-space tokenizers

a3d49c7

OyvindTafjord merged commit 576443d into allenai:olmo-eval Nov 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix prefix space issues with certain tokenizers #156

Fix prefix space issues with certain tokenizers #156

OyvindTafjord commented Nov 17, 2023

Fix prefix space issues with certain tokenizers #156

Fix prefix space issues with certain tokenizers #156

Conversation

OyvindTafjord commented Nov 17, 2023