Investigate tokenizer behaviour to understand why it is different to Huggingface Tokenizer #4

thakkarparth007 · 2023-04-12T07:47:50Z

If we consider the following prompt, then huggingface's tokenizer says there are 1144 tokens whereas the 2B model's log show 1473 tokens. The 6B model's logs show 1222 tokens. I downloaded the models from the google drive and have not quantized myself. I'm not sure of the cause of this discrepancy.

"# Here are some relevant code fragments from other files of the repo:\n# --------------------------------------------------\n# the below code fragment can be found in:\n# mindflow/core/git/add.py\n# --------------------------------------------------\n# import subprocess\n# from typing import Tuple\n# \n# from mindflow.utils.execute import execute_no_trace\n# \n# \n# def run_add(args: Tuple[str]):\n#     \"\"\"\n#     Add command.\n#     \"\"\"\n#     command = [\"git\", \"add\"] + list(args)\n# \n#     # Execute the git diff command and retrieve the output as a string\n#     execute_no_trace(command)\n# --------------------------------------------------\n# the below code fragment can be found in:\n# mindflow/core/git/pr.py\n# --------------------------------------------------\n#         return\n# \n#     title, body = title_body_tuple\n# \n#     command: List[str] = [\"gh\", \"pr\", \"create\"] + list(args) + [\"--title\", title, \"--body\", body]  # type: ignore\n#     print(execute_no_trace(command))\n# \n# \n# def create_title_and_body(\n#     base_branch, title: Optional[str], body: Optional[str]\n# ) -> Optional[Tuple[str, str]]:\n#     settings = Settings()\n# \n#     diff_output = run_diff((base_branch,))\n#     if not diff_output:\n#         diff_output = \"\"\n# \n#     title_response: Union[ModelError, str]\n#     body_response: Union[ModelError, str]\n#     if title is None and body is None:\n# --------------------------------------------------\n# the below code fragment can be found in:\n# mindflow/core/git/pr.py\n# --------------------------------------------------\n# from mindflow.utils.prompts import PR_BODY_PREFIX\n# from mindflow.utils.prompts import PR_TITLE_PREFIX\n# \n# \n# def run_pr(args: Tuple[str], title: Optional[str] = None, body: Optional[str] = None):\n#     base_branch = get_flag_value(args, [\"--base\", \"-B\"])\n# \n#     if base_branch is None:\n#         # Determine the name of the default branch\n#         base_branch = (\n#             subprocess.check_output([\"git\", \"symbolic-ref\", \"refs/remotes/origin/HEAD\"])\n#             .decode()\n#             .strip()\n#             .split(\"/\")[-1]\n#         )\n# \n#     if not title or not body:\n#         title_body_tuple = create_title_and_body(base_branch, title, body)\n# \n#     if not title_body_tuple:\n# --------------------------------------------------\n# the below code fragment can be found in:\n# mindflow/core/git/pr.py\n# --------------------------------------------------\n#             subprocess.check_output([\"git\", \"symbolic-ref\", \"refs/remotes/origin/HEAD\"])\n#             .decode()\n#             .strip()\n#             .split(\"/\")[-1]\n#         )\n# \n#     if not title or not body:\n#         title_body_tuple = create_title_and_body(base_branch, title, body)\n# \n#     if not title_body_tuple:\n#         return\n# \n#     title, body = title_body_tuple\n# \n#     command: List[str] = [\"gh\", \"pr\", \"create\"] + list(args) + [\"--title\", title, \"--body\", body]  # type: ignore\n#     print(execute_no_trace(command))\n# \n# \n# def create_title_and_body(\n#     base_branch, title: Optional[str], body: Optional[str]\n# --------------------------------------------------\n\nfrom typing import Optional, Tuple, List\n\nfrom mindflow.core.git.pr import create_title_and_body\nfrom mindflow.utils.command_parse import get_flag_value\nfrom mindflow.utils.execute import execute_no_trace\n\n\ndef run_mr(\n    args: Tuple[str], title: Optional[str] = None, description: Optional[str] = None\n):\n    base_branch = get_flag_value(args, [\"--target-branch\", \"-b\"])\n\n    if base_branch is None:\n        # Determine the name of the default branch\n        base_branch = (\n            subprocess.check_output([\"git\", \"symbolic-ref\", \"refs/remotes/origin/HEAD\"])"

The text was updated successfully, but these errors were encountered:

ravenscroftj · 2023-04-12T19:47:12Z

Thanks for the ticket, I think this could be bit of a tricky one to debug because the GGML GPT-J tokenizer is implemented from scratch versus the Huggingface Codegen tokenizer and the latter also has a bunch of token merging logic which I don't think GGML's tokenizer has (I will try to confirm).

I can't comment on whether this is likely to significantly impact the performance of the model - that would need testing empirically.

Was there a specific use case you have in mind that this is blocking?

thakkarparth007 · 2023-04-12T20:55:59Z

Hey, yeah I was planning to use this for benchmarking 4bit performance of codegen models. Most of the prompts I have are over 1500 tokens or more, and these overflow 2048 tokens when tokenized incorrectly. I guess one way to get around this is to accept pretokenized inputs.

ravenscroftj · 2023-04-12T21:06:43Z

Ah OK that makes sense thanks for clarifying. I will look into the tokenizer behaviour properly probably over the weekend but in the mean time I will see if I can add a rest endpoint to codegen server that accepts an array of tokens as a json list. Then you can pretokenize your input using the huggingface tokenizer. I'll keep you posted!

thakkarparth007 · 2023-04-12T23:00:49Z

Thanks! I just created a PR here to allow pretokenized inputs: ravenscroftj/ggml#2

It seems to work fine for me.

ravenscroftj · 2023-04-13T06:20:35Z

That's really cool thank you for your contribution - I have accepted the MR. I will leave this ticket open as a reminder to look into the tokenizer behaviour anyway.

Sidenote - I'd be really interested in your evaluation of the 4 bit model if you're willing to share it!

thakkarparth007 · 2023-04-13T08:41:04Z

Thanks!

I have performed a preliminary evaluation of the 6B-4bit model on Python. I ran the model on ~2000 code completion scenarios in Python (I have a custom dataset) and found about 15% degradation in the exact match metric at first line. Here's how the graph looks like:

I manually looked at some of the mispredictions and they seemed okay to me, but were getting penalized because it wasn't an exact match. I think one interesting thing to do would be to check how different the probabilities of the 16bit and 4bit predictions are

ravenscroftj changed the title ~~Incorrect tokenization~~ Investigate tokenizer behaviour to understand why it is different to Huggingface Tokenizer Apr 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate tokenizer behaviour to understand why it is different to Huggingface Tokenizer #4

Investigate tokenizer behaviour to understand why it is different to Huggingface Tokenizer #4

thakkarparth007 commented Apr 12, 2023

ravenscroftj commented Apr 12, 2023 •

edited

Loading

thakkarparth007 commented Apr 12, 2023

ravenscroftj commented Apr 12, 2023

thakkarparth007 commented Apr 12, 2023

ravenscroftj commented Apr 13, 2023

thakkarparth007 commented Apr 13, 2023 •

edited

Loading

Investigate tokenizer behaviour to understand why it is different to Huggingface Tokenizer #4

Investigate tokenizer behaviour to understand why it is different to Huggingface Tokenizer #4

Comments

thakkarparth007 commented Apr 12, 2023

ravenscroftj commented Apr 12, 2023 • edited Loading

thakkarparth007 commented Apr 12, 2023

ravenscroftj commented Apr 12, 2023

thakkarparth007 commented Apr 12, 2023

ravenscroftj commented Apr 13, 2023

thakkarparth007 commented Apr 13, 2023 • edited Loading

ravenscroftj commented Apr 12, 2023 •

edited

Loading

thakkarparth007 commented Apr 13, 2023 •

edited

Loading