Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

do not support unknown special tokens #234

Closed
yxchng opened this issue Dec 8, 2024 · 1 comment
Closed

do not support unknown special tokens #234

yxchng opened this issue Dec 8, 2024 · 1 comment

Comments

@yxchng
Copy link

yxchng commented Dec 8, 2024

How to resolve this error when finetuning?

ValueError: For now, we do not support unknown special tokens
In the future, if there is a need for this, we can add special tokens to the tokenizer
starting from rank 100261 - 100263 and then 100266 - 100275.
And finally, we can re-construct the enc object back
@leestott
Copy link
Contributor

This error typically occurs when the tokenizer encounters special tokens that it doesn't recognize. Here are some steps to resolve this issue:

  1. Identify the Unknown Tokens: Check your dataset to identify any special tokens that are not recognized by the tokenizer.

  2. Add Special Tokens to the Tokenizer:

    • You can add the special tokens to the tokenizer manually. Here's an example in Python using the Hugging Face transformers library:

      from transformers import AutoTokenizer
      
      tokenizer = AutoTokenizer.from_pretrained('your-model-name')
      
      special_tokens_dict = {'additional_special_tokens': ['<special1>', '<special2>', '<special3>']}
      num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)
      
      print(f"Added {num_added_toks} special tokens.")
  3. Reconstruct the Encoding Object: After adding the special tokens, you may need to re-encode your dataset to ensure the new tokens are properly integrated.

  4. Update the Model: If you're using a pre-trained model, make sure to update it to recognize the new special tokens:

    from transformers import AutoModelForSequenceClassification
    
    model = AutoModelForSequenceClassification.from_pretrained('your-model-name')
    model.resize_token_embeddings(len(tokenizer))
  5. Re-run the Fine-tuning Process: With the tokenizer and model updated, you can re-run your fine-tuning process.

I would also suggest you look at the Phi-3 Fine Tuning with Microsoft Olive lab in the resources provided https://github.com/microsoft/Phi-3CookBook/blob/main/code/04.Finetuning/olive-lab/readme.md

@leestott leestott closed this as completed Jan 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants