Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Please support phi-1_5 and phi-2 #34

Closed
RadixSeven opened this issue Apr 13, 2024 · 1 comment
Closed

Please support phi-1_5 and phi-2 #34

RadixSeven opened this issue Apr 13, 2024 · 1 comment
Assignees

Comments

@RadixSeven
Copy link
Contributor

In the README, we were asked to report unsupported models. The supported_models.yaml lists microsoft/phi-1_5 as supported.

However, when I run:

python cfg_generate.py -m "microsoft/phi-1_5" 'You would represent "My dog Sparky is 7 years old and weighs 21 kg" in JSON as '

Where cfg_generate.py is only a minor modification from the code in the README:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import argparse
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers_cfg.grammar_utils import IncrementalGrammarConstraint
from transformers_cfg.generation.logits_process import (
    GrammarConstrainedLogitsProcessor,
)

DEFAULT_GRAMMAR = "examples/grammars/json.ebnf"

DEFAULT_MODEL = "openai-community/gpt2-xl"


def main():
    parser = argparse.ArgumentParser(
        description="Generate text using a specified model and grammar."
    )
    parser.add_argument(
        "-m",
        "--model_id",
        default=DEFAULT_MODEL,
        help=f"The ID of the model to use. (default: {DEFAULT_MODEL})",
    )
    parser.add_argument(
        "-g",
        "--grammar_file",
        default=DEFAULT_GRAMMAR,
        help=f"The path to the grammar file. (default: {DEFAULT_GRAMMAR})",
    )
    parser.add_argument(
        "prompts", nargs="+", help="The prompts to use for generation."
    )
    args = parser.parse_args()

    device = torch.device("cpu")
    print(f"Using device: {device}")

    # Load model and tokenizer
    tokenizer = AutoTokenizer.from_pretrained(args.model_id)
    tokenizer.pad_token = tokenizer.eos_token
    model = AutoModelForCausalLM.from_pretrained(args.model_id).to(device)
    model.generation_config.pad_token_id = model.generation_config.eos_token_id

    # Load json grammar
    with open(args.grammar_file, "r") as file:
        grammar_str = file.read()
    grammar = IncrementalGrammarConstraint(grammar_str, "root", tokenizer)
    grammar_processor = GrammarConstrainedLogitsProcessor(grammar)

    # Generate
    input_ids = tokenizer(
        args.prompts,
        add_special_tokens=False,
        return_tensors="pt",
        padding=True,
    )["input_ids"]
    output = model.generate(
        input_ids,
        max_length=50,
        logits_processor=[grammar_processor],
        repetition_penalty=1.1,
        num_return_sequences=1,
    )

    # decode output
    generations = tokenizer.batch_decode(output, skip_special_tokens=True)
    print(generations)


if __name__ == "__main__":
    main()

I get:

Using device: cpu
tokenizer_config.json: 100%|███████████████████| 237/237 [00:00<00:00, 4.06MB/s]
vocab.json: 100%|████████████████████████████| 798k/798k [00:00<00:00, 10.9MB/s]
merges.txt: 100%|████████████████████████████| 456k/456k [00:00<00:00, 9.77MB/s]
tokenizer.json: 100%|██████████████████████| 2.11M/2.11M [00:00<00:00, 15.0MB/s]
added_tokens.json: 100%|███████████████████| 1.08k/1.08k [00:00<00:00, 21.9MB/s]
special_tokens_map.json: 100%|███████████████| 99.0/99.0 [00:00<00:00, 1.64MB/s]
config.json: 100%|█████████████████████████████| 864/864 [00:00<00:00, 15.9MB/s]
pytorch_model.bin: 100%|███████████████████| 2.84G/2.84G [03:33<00:00, 13.3MB/s]
generation_config.json: 100%|████████████████| 74.0/74.0 [00:00<00:00, 1.07MB/s]
WARNING:transformers_cfg.vocab_struct:Warning: unrecognized tokenizer: using default token formatting
Traceback (most recent call last):
  File "/home/eric/Prj/cfg_llm_security/cfg_generate.py", line 73, in <module>
    main()
  File "/home/eric/Prj/cfg_llm_security/cfg_generate.py", line 59, in main
    output = model.generate(
             ^^^^^^^^^^^^^^^
  File "/home/eric/venv/cfg_llm_security/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/eric/venv/cfg_llm_security/lib/python3.11/site-packages/transformers/generation/utils.py", line 1479, in generate
    return self.greedy_search(
           ^^^^^^^^^^^^^^^^^^^
  File "/home/eric/venv/cfg_llm_security/lib/python3.11/site-packages/transformers/generation/utils.py", line 2353, in greedy_search
    next_tokens_scores = logits_processor(input_ids, next_token_logits)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/eric/venv/cfg_llm_security/lib/python3.11/site-packages/transformers/generation/logits_process.py", line 97, in __call__
    scores = processor(input_ids, scores)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/eric/venv/cfg_llm_security/lib/python3.11/site-packages/transformers_cfg/generation/logits_process.py", line 102, in __call__
    return self.process_logits(input_ids, scores)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/eric/venv/cfg_llm_security/lib/python3.11/site-packages/transformers_cfg/generation/logits_process.py", line 95, in process_logits
    self.mask_logits(scores, scores.device)
  File "/home/eric/venv/cfg_llm_security/lib/python3.11/site-packages/transformers_cfg/generation/logits_process.py", line 57, in mask_logits
    logits[~acceptance] = -math.inf
    ~~~~~~^^^^^^^^^^^^^
IndexError: The shape of the mask [1, 50295] at index 1 does not match the shape of the indexed tensor [1, 51200] at index 1

Because of the following line in the errors:

WARNING:transformers_cfg.vocab_struct:Warning: unrecognized tokenizer: using default token formatting

I suspect that the problem is that the tokenizer selected by the AutoTokenizer has changed from what your code is expecting. The same thing happens for phi-2.

I am working on a time-sensitive project right now, so I won't be helping more than just reporting the bug. Feel free to close this as "won't do;" I won't feel bad. I've already received the rest of your work for free. (And if I can't complete the project without fixing it, I might submit a pull request.)

@Saibo-creator Saibo-creator self-assigned this Apr 16, 2024
@Saibo-creator
Copy link
Collaborator

Saibo-creator commented Apr 16, 2024

Hello @RadixSeven
Thank you for reporting this error! I understand the cause of the error and will incorporate a solution and explanation in this discussion. Another user has reported the same issue with T5 as well.

reason

The problem stems from a deliberate design decision in Phi (and similar models like T5), involving a discrepancy between the tokenizer vocabulary (50295) and the model embedding size (51200). This discrepancy allows for the future addition of special tokens.

fix

To resolve this, adding model.resize_token_embeddings(len(tokenizer)) before performing inference will solve the issue. This is an appropriate solution.

    model = AutoModelForCausalLM.from_pretrained(args.model_id).to(device)
    model.generation_config.pad_token_id = model.generation_config.eos_token_id
    model.resize_token_embeddings(len(tokenizer))

    # Load json grammar
    with open(args.grammar_file, "r") as file:

I get following output with you example: 'You would represent "My dog Sparky is 7 years old and weighs 21 kg" in JSON as {"name":["Dog","Sparky"],"age":[7,21],"weight":[21]}'

We will soon update our code to handle this automatically, so users won't have to manage it on their own.

More details:

There are two related discussions from HF community:

Similar practice has also been done with other models such as T5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants