Spaces are not being added after added tokens when `legacy: true` is used #7094

xzuyn · 2024-05-06T04:19:12Z

I think LLaMa-1, LLaMa-2, Mistral-v0.1, Mistral-v0.2, Solar (which is based on Mistral-v0.1), and probably a few others all use "legacy": true.

Trainers like Axolotl will use Transformers to tokenize datasets for training, which if this setting is set to true will add a space after special/added tokens. A bit weird in my opinion, but that's probably why they consider it legacy. Weirdness aside this all seems fine as long as inference tokenization matches training tokenization, which happens with anything that uses Transformers, but doesn't seem to be the case with llama.cpp.

LLaMa-3 has no mention of legacy in its tokenizer_config.json so they likely no longer follow this behaviour, and llama.cpp won't need any changes in this regard and works as is.

I used the latest KoboldCPP here because I can't figure out how to tokenize with llama.cpp since I only ever use KoboldCPP and wanted to make this issue sooner rather than later. I assume they tokenize the same.

from transformers import AutoTokenizer
import requests


string_to_test = "<|im_start|>user\nTest Input<|im_end|>\n<|im_start|>assistant\nTest Response<|im_end|>"

# https://huggingface.co/cognitivecomputations/dolphin-2.8-mistral-7b-v02
# Transformers version: 4.40.1 (Latest)
tokenizer = AutoTokenizer.from_pretrained("cognitivecomputations/dolphin-2.8-mistral-7b-v02")

# https://huggingface.co/bartowski/dolphin-2.8-mistral-7b-v02-GGUF/blob/main/dolphin-2.8-mistral-7b-v02-Q4_K_S.gguf
# KoboldCPP version: 1.64.1 (Latest)
koboldcpp_string_to_test = (
    requests.post(
        f"http://127.0.0.1:5001/api/extra/tokencount",
        json={"prompt": string_to_test},
    ).json()["ids"]
)

print(tokenizer.encode(string_to_test))
# [1, 32001, 2188, 13, 1963, 11232, 32000, 28705, 13, 32001, 13892, 13, 1963, 12107, 32000]
# ['<s>', '<|im_start|>', '▁user', '<0x0A>', 'Test', '▁Input', '<|im_end|>', '▁', '<0x0A>', '<|im_start|>', '▁assistant', '<0x0A>', 'Test', '▁Response', '<|im_end|>']

print(koboldcpp_string_to_test)
# [1, 32001, 1838, 13, 1963, 11232, 32000, 13, 32001, 489, 11143, 13, 1963, 12107, 32000]
# ['<s>', '<|im_start|>', 'user', '<0x0A>', 'Test', '▁Input', '<|im_end|>', '<0x0A>', '<|im_start|>', 'ass', 'istant', '<0x0A>', 'Test', '▁Response', '<|im_end|>']

@ehartford pinging you here since I used your model to test and figured you would want to know about this behaviour.

The text was updated successfully, but these errors were encountered:

steampunque · 2024-05-06T14:54:33Z

I think the MistralAI models behave differently compared to chatml based models. This note is from the mistralai 8x7B model on HF:

"
As reference, here is the pseudo-code used to tokenize instructions during fine-tuning:

def tokenize(text):
return tok.encode(text, add_special_tokens=False)

[BOS_ID] +
tokenize("[INST]") + tokenize(USER_MESSAGE_1) + tokenize("[/INST]") +
tokenize(BOT_MESSAGE_1) + [EOS_ID] +
…
tokenize("[INST]") + tokenize(USER_MESSAGE_N) + tokenize("[/INST]") +
tokenize(BOT_MESSAGE_N) + [EOS_ID]

In the pseudo-code above, note that the tokenize method should not add a BOS or EOS token automatically, but should add a prefix space.

In the Transformers library, one can use chat templates which make sure the right format is applied.
"

The problem I found was the jinja chat templates of even the official mistralai models on HF do not seem to correspond to the above note. However, when I implement the tokenization exactly as described above through a patched llama.cpp where I take complete control of both chat template definitions and whether the space prefix is added in the llama_tokenize() call mistral 8x7B seemed to behave noticeably differently on a couple short test prompts (got rid of a leading : in one response which should not have been there).

The dolphin 2.8 transformers tokenization can be matched using these template definitions with spaces manually added in the template definitions:

# ChatML with special tokens and inserted spaces to match transformers tokenizer spaces
CHATML_BOH="<|im_start|>"
CHATML_EOS="<|im_end|>"
CHATML_SYSTEM="${CHATML_BOH} system\n"
CHATML_USER="${CHATML_BOH} user\n"
CHATML_ASSISTANT="${CHATML_BOH} assistant\n"
CHATML_SUFFIX="$CHATML_EOS \n"

Most likely llama3 instruct from Meta does not have these spaces (guessing), but I am not sure about the new dolphin 2.9 and hermes2 pro chatml based llama3 fine tunes coming out if they are still in there or not.

I think the bottom line is exactly matching instruct tune templates is going to always be hit or miss unless the model creators define text in -> token out test cases for one full turn in their documentation exactly as you have done in your issue note. I am still not sure about Mistral 7B 0.1, 0.2, 8x22b etc. if they are changing this space thing from fine tune to fine tune or not and it seems hard to reverse engineer from the model itself except by testing with and without spaces in various places and empirically determining which works best.

xzuyn · 2024-06-01T06:03:22Z

This issue still exists.

github-actions · 2024-07-16T01:06:51Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

xzuyn added the bug-unconfirmed label May 6, 2024

xzuyn mentioned this issue May 15, 2024

EOT token incorrectly set for Mistral-v0.2 trained with added ChatML tokens #7271

Closed

github-actions bot added the stale label Jul 2, 2024

github-actions bot closed this as completed Jul 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spaces are not being added after added tokens when `legacy: true` is used #7094

Spaces are not being added after added tokens when `legacy: true` is used #7094

xzuyn commented May 6, 2024

steampunque commented May 6, 2024

xzuyn commented Jun 1, 2024

github-actions bot commented Jul 16, 2024

Spaces are not being added after added tokens when legacy: true is used #7094

Spaces are not being added after added tokens when legacy: true is used #7094

Comments

xzuyn commented May 6, 2024

steampunque commented May 6, 2024

xzuyn commented Jun 1, 2024

github-actions bot commented Jul 16, 2024

Spaces are not being added after added tokens when `legacy: true` is used #7094

Spaces are not being added after added tokens when `legacy: true` is used #7094