Parsing of `merges` field in tokenizer.json fails #391

Tracreed · 2024-10-21T15:42:20Z

So I was trying to run any example that used a model downloaded from huggingface. Like the qwen2_chat example. But I kept running into the issue of getting the error

cargo run --release --bin qwen2_chat qwen2-0.5b/model.rten qwen2-0.5b/tokenizer.json
    Finished `release` profile [optimized] target(s) in 0.02s
     Running `/home/trac/coding/rten/target/release/qwen2_chat qwen2-0.5b/model.rten qwen2-0.5b/tokenizer.json`
Error: JsonError(Error("invalid type: sequence, expected a string", line: 757273, column: 1))

After trying a few other examples. I decided to take a look at the code simply. And to my surprise it was very straightforward.

Looking at the tokenizer.json that I was downloading with optimum. It was structured like this.

{
  "version": "1.0",
  "truncation": null,
  "padding": null,
   ...
  "model": {
    "type": "BPE",
    "dropout": null,
    "unk_token": null,
    "continuing_subword_prefix": "",
    "end_of_word_suffix": "",
    "fuse_unk": false,
    "byte_fallback": false,
    "ignore_merges": false,
    "vocab": {
      "!": 0,
      "\"": 1,
      "#": 2,
    },
    "merges": [
      [
        "Ġ",
        "Ġ"
      ],
     ...
    ]
  }
}

But the actual json expected had the merges modeled a bit differently

rten/rten-text/src/tokenizers/json.rs

Lines 36 to 43 in 530141d

    
           #[derive(Deserialize)] 
        
           pub(crate) struct BpeModel { 
        
               /// Mapping from token text to token ID. 
        
               pub vocab: HashMap<String, TokenId>, 
        
               /// List of `<token_a> [SPACE] <token_b>` containing tokens to merge. 
        
               pub merges: Vec<String>, 
        
           }

So to fix it for myself at least. I had to do

pub merges: Vec<Vec<String>>,

Seems to me they changed the formatting of the json file since 10 months ago when this was implemented originally. Not sure if there is a need to detect the difference in the version of the tokenizer file. But it does differ and none of the examples right now work if you actually follow the steps of using optimum-cli and downloading the files through it and trying to run the models.

Lastly. My change to get it working at least for me locally was to change this.

rten/rten-text/src/tokenizers.rs

Line 320 in 530141d

let merges: Vec<_> = model.merges.iter().map(|s| s.as_str()).collect();

to

let merge_vec: Vec<_> = model.merges.into_iter().map(|m| m.join(" ")).collect();
let merges: Vec<_> = merge_vec.iter().map(|s| s.as_str()).collect();

which is definitely not the best solution. But I didn't feel like putting that much thought into it right now.

robertknight · 2024-10-21T16:14:38Z

Thanks for the bug report. You're right that this was changed. The canonical implementation in the tokenizers crate models this field as:

#[derive(Debug, Deserialize)]
#[serde(untagged)]
enum MergeType {
    Tuple(Vec<(String, String)>),
    Legacy(Vec<String>),
}

Reference https://github.com/huggingface/tokenizers/blob/9b77c054ef4297c7057fa8db875368c7c02f1bfc/tokenizers/src/models/bpe/serialization.rs#L87

The reason for the change was to fix a limitation that tokens couldn't contain spaces - huggingface/tokenizers#909. See also llama.cpp issue.

robertknight changed the title ~~Incorrect parsing of tokenizer.json files~~ Parsing of merges field in tokenizer.json fails Oct 21, 2024

robertknight mentioned this issue Oct 23, 2024

Support newer merges format in tokenizer.json files #392

Merged

robertknight closed this as completed in #392 Oct 23, 2024

robertknight closed this as completed in 1c277d2 Oct 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parsing of `merges` field in tokenizer.json fails #391

Parsing of `merges` field in tokenizer.json fails #391

Tracreed commented Oct 21, 2024

robertknight commented Oct 21, 2024 •

edited

Loading

Parsing of merges field in tokenizer.json fails #391

Parsing of merges field in tokenizer.json fails #391

Comments

Tracreed commented Oct 21, 2024

robertknight commented Oct 21, 2024 • edited Loading

Parsing of `merges` field in tokenizer.json fails #391

Parsing of `merges` field in tokenizer.json fails #391

robertknight commented Oct 21, 2024 •

edited

Loading