Skip to content
This repository was archived by the owner on Sep 30, 2023. It is now read-only.

Investigate tokenizer behaviour to understand why it is different to Huggingface Tokenizer #4

Open
thakkarparth007 opened this issue Apr 12, 2023 · 6 comments

Comments

@thakkarparth007
Copy link

If we consider the following prompt, then huggingface's tokenizer says there are 1144 tokens whereas the 2B model's log show 1473 tokens. The 6B model's logs show 1222 tokens. I downloaded the models from the google drive and have not quantized myself. I'm not sure of the cause of this discrepancy.

"# Here are some relevant code fragments from other files of the repo:\n# --------------------------------------------------\n# the below code fragment can be found in:\n# mindflow/core/git/add.py\n# --------------------------------------------------\n# import subprocess\n# from typing import Tuple\n# \n# from mindflow.utils.execute import execute_no_trace\n# \n# \n# def run_add(args: Tuple[str]):\n#     \"\"\"\n#     Add command.\n#     \"\"\"\n#     command = [\"git\", \"add\"] + list(args)\n# \n#     # Execute the git diff command and retrieve the output as a string\n#     execute_no_trace(command)\n# --------------------------------------------------\n# the below code fragment can be found in:\n# mindflow/core/git/pr.py\n# --------------------------------------------------\n#         return\n# \n#     title, body = title_body_tuple\n# \n#     command: List[str] = [\"gh\", \"pr\", \"create\"] + list(args) + [\"--title\", title, \"--body\", body]  # type: ignore\n#     print(execute_no_trace(command))\n# \n# \n# def create_title_and_body(\n#     base_branch, title: Optional[str], body: Optional[str]\n# ) -> Optional[Tuple[str, str]]:\n#     settings = Settings()\n# \n#     diff_output = run_diff((base_branch,))\n#     if not diff_output:\n#         diff_output = \"\"\n# \n#     title_response: Union[ModelError, str]\n#     body_response: Union[ModelError, str]\n#     if title is None and body is None:\n# --------------------------------------------------\n# the below code fragment can be found in:\n# mindflow/core/git/pr.py\n# --------------------------------------------------\n# from mindflow.utils.prompts import PR_BODY_PREFIX\n# from mindflow.utils.prompts import PR_TITLE_PREFIX\n# \n# \n# def run_pr(args: Tuple[str], title: Optional[str] = None, body: Optional[str] = None):\n#     base_branch = get_flag_value(args, [\"--base\", \"-B\"])\n# \n#     if base_branch is None:\n#         # Determine the name of the default branch\n#         base_branch = (\n#             subprocess.check_output([\"git\", \"symbolic-ref\", \"refs/remotes/origin/HEAD\"])\n#             .decode()\n#             .strip()\n#             .split(\"/\")[-1]\n#         )\n# \n#     if not title or not body:\n#         title_body_tuple = create_title_and_body(base_branch, title, body)\n# \n#     if not title_body_tuple:\n# --------------------------------------------------\n# the below code fragment can be found in:\n# mindflow/core/git/pr.py\n# --------------------------------------------------\n#             subprocess.check_output([\"git\", \"symbolic-ref\", \"refs/remotes/origin/HEAD\"])\n#             .decode()\n#             .strip()\n#             .split(\"/\")[-1]\n#         )\n# \n#     if not title or not body:\n#         title_body_tuple = create_title_and_body(base_branch, title, body)\n# \n#     if not title_body_tuple:\n#         return\n# \n#     title, body = title_body_tuple\n# \n#     command: List[str] = [\"gh\", \"pr\", \"create\"] + list(args) + [\"--title\", title, \"--body\", body]  # type: ignore\n#     print(execute_no_trace(command))\n# \n# \n# def create_title_and_body(\n#     base_branch, title: Optional[str], body: Optional[str]\n# --------------------------------------------------\n\nfrom typing import Optional, Tuple, List\n\nfrom mindflow.core.git.pr import create_title_and_body\nfrom mindflow.utils.command_parse import get_flag_value\nfrom mindflow.utils.execute import execute_no_trace\n\n\ndef run_mr(\n    args: Tuple[str], title: Optional[str] = None, description: Optional[str] = None\n):\n    base_branch = get_flag_value(args, [\"--target-branch\", \"-b\"])\n\n    if base_branch is None:\n        # Determine the name of the default branch\n        base_branch = (\n            subprocess.check_output([\"git\", \"symbolic-ref\", \"refs/remotes/origin/HEAD\"])"
@ravenscroftj
Copy link
Owner

ravenscroftj commented Apr 12, 2023

Thanks for the ticket, I think this could be bit of a tricky one to debug because the GGML GPT-J tokenizer is implemented from scratch versus the Huggingface Codegen tokenizer and the latter also has a bunch of token merging logic which I don't think GGML's tokenizer has (I will try to confirm).

I can't comment on whether this is likely to significantly impact the performance of the model - that would need testing empirically.

Was there a specific use case you have in mind that this is blocking?

@thakkarparth007
Copy link
Author

Hey, yeah I was planning to use this for benchmarking 4bit performance of codegen models. Most of the prompts I have are over 1500 tokens or more, and these overflow 2048 tokens when tokenized incorrectly. I guess one way to get around this is to accept pretokenized inputs.

@ravenscroftj
Copy link
Owner

Ah OK that makes sense thanks for clarifying. I will look into the tokenizer behaviour properly probably over the weekend but in the mean time I will see if I can add a rest endpoint to codegen server that accepts an array of tokens as a json list. Then you can pretokenize your input using the huggingface tokenizer. I'll keep you posted!

@thakkarparth007
Copy link
Author

Thanks! I just created a PR here to allow pretokenized inputs: ravenscroftj/ggml#2

It seems to work fine for me.

@ravenscroftj
Copy link
Owner

That's really cool thank you for your contribution - I have accepted the MR. I will leave this ticket open as a reminder to look into the tokenizer behaviour anyway.

Sidenote - I'd be really interested in your evaluation of the 4 bit model if you're willing to share it!

@thakkarparth007
Copy link
Author

thakkarparth007 commented Apr 13, 2023

Thanks!

I have performed a preliminary evaluation of the 6B-4bit model on Python. I ran the model on ~2000 code completion scenarios in Python (I have a custom dataset) and found about 15% degradation in the exact match metric at first line. Here's how the graph looks like:
image

I manually looked at some of the mispredictions and they seemed okay to me, but were getting penalized because it wasn't an exact match. I think one interesting thing to do would be to check how different the probabilities of the 16bit and 4bit predictions are

@ravenscroftj ravenscroftj changed the title Incorrect tokenization Investigate tokenizer behaviour to understand why it is different to Huggingface Tokenizer Apr 14, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants