Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing of merges field in tokenizer.json fails #391

Closed
Tracreed opened this issue Oct 21, 2024 · 1 comment · Fixed by #392
Closed

Parsing of merges field in tokenizer.json fails #391

Tracreed opened this issue Oct 21, 2024 · 1 comment · Fixed by #392

Comments

@Tracreed
Copy link

So I was trying to run any example that used a model downloaded from huggingface. Like the qwen2_chat example. But I kept running into the issue of getting the error

cargo run --release --bin qwen2_chat qwen2-0.5b/model.rten qwen2-0.5b/tokenizer.json
    Finished `release` profile [optimized] target(s) in 0.02s
     Running `/home/trac/coding/rten/target/release/qwen2_chat qwen2-0.5b/model.rten qwen2-0.5b/tokenizer.json`
Error: JsonError(Error("invalid type: sequence, expected a string", line: 757273, column: 1))

After trying a few other examples. I decided to take a look at the code simply. And to my surprise it was very straightforward.

Looking at the tokenizer.json that I was downloading with optimum. It was structured like this.

{
  "version": "1.0",
  "truncation": null,
  "padding": null,
   ...
  "model": {
    "type": "BPE",
    "dropout": null,
    "unk_token": null,
    "continuing_subword_prefix": "",
    "end_of_word_suffix": "",
    "fuse_unk": false,
    "byte_fallback": false,
    "ignore_merges": false,
    "vocab": {
      "!": 0,
      "\"": 1,
      "#": 2,
    },
    "merges": [
      [
        "Ġ",
        "Ġ"
      ],
     ...
    ]
  }
}

But the actual json expected had the merges modeled a bit differently

#[derive(Deserialize)]
pub(crate) struct BpeModel {
/// Mapping from token text to token ID.
pub vocab: HashMap<String, TokenId>,
/// List of `<token_a> [SPACE] <token_b>` containing tokens to merge.
pub merges: Vec<String>,
}

So to fix it for myself at least. I had to do

pub merges: Vec<Vec<String>>,

Seems to me they changed the formatting of the json file since 10 months ago when this was implemented originally. Not sure if there is a need to detect the difference in the version of the tokenizer file. But it does differ and none of the examples right now work if you actually follow the steps of using optimum-cli and downloading the files through it and trying to run the models.

Lastly. My change to get it working at least for me locally was to change this.

let merges: Vec<_> = model.merges.iter().map(|s| s.as_str()).collect();

to

let merge_vec: Vec<_> = model.merges.into_iter().map(|m| m.join(" ")).collect();
let merges: Vec<_> = merge_vec.iter().map(|s| s.as_str()).collect();

which is definitely not the best solution. But I didn't feel like putting that much thought into it right now.

@robertknight
Copy link
Owner

robertknight commented Oct 21, 2024

Thanks for the bug report. You're right that this was changed. The canonical implementation in the tokenizers crate models this field as:

#[derive(Debug, Deserialize)]
#[serde(untagged)]
enum MergeType {
    Tuple(Vec<(String, String)>),
    Legacy(Vec<String>),
}

Reference https://github.com/huggingface/tokenizers/blob/9b77c054ef4297c7057fa8db875368c7c02f1bfc/tokenizers/src/models/bpe/serialization.rs#L87

The reason for the change was to fix a limitation that tokens couldn't contain spaces - huggingface/tokenizers#909. See also llama.cpp issue.

@robertknight robertknight changed the title Incorrect parsing of tokenizer.json files Parsing of merges field in tokenizer.json fails Oct 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants