Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix prefix space issues with certain tokenizers #156

Merged

Conversation

OyvindTafjord
Copy link
Contributor

This adds a heuristic to catch tokenizers (e.g., Llama) which treats a word starting a string has having a prefixed space, where previously the extra space added at the start of "continuations" in ranked classification task leads to a spurious space token (so for instance total probability mass over answer choices A/B/C/D in MC tasks drops from around 1 to near zero).

@OyvindTafjord OyvindTafjord merged commit 576443d into allenai:olmo-eval Nov 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant