-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Tokenizer Class: Error Handling, Flexibility #640
base: main
Are you sure you want to change the base?
Conversation
Hi @zeelsheladiya! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at [email protected]. Thanks! |
meta-llama#640 In this commit, I have addressed the feedback from the code review by replacing instances of `Union[str, None]` with `Optional[str]` in the Tokenizer class. This change aligns with the Python Typing documentation's recommendation for better type hinting.
Hi @DamienAllonsius , I hope you're doing well! I wanted to let you know that I've made the requested change to the code. I replaced the usage of Union[str, None] with Optional[str] as per your feedback. You can review the updated code at. I'd greatly appreciate it if you could take a look and let me know if everything looks good to you now. If you have any further suggestions or feedback, please feel free to share them. Thank you for your time and guidance throughout this process. I look forward to hearing your thoughts on the changes. |
Hi @zeelsheladiya I solved the conflicts with main. |
Hi @DamienAllonsius, As per my Understanding, The We can use a list comprehension to check each token ID returned by Please Let me know your thoughts. :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, tested the code but could not generate any character with unk_id encoding
@@ -13,29 +13,32 @@ | |||
|
|||
class Tokenizer: | |||
"""tokenizing and encoding/decoding text using SentencePiece.""" | |||
def __init__(self, model_path: str): | |||
def __init__(self, model_path: Optional[str] = None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why make this optional? How would you initialize the tokenizer without the model file?
# Handle unknown tokens | ||
t = [token_id if token_id in range(self.n_words) else self.unk_id for token_id in t] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you have an example of an input string that requires this?
try: | ||
t = self.sp_model.encode(s) | ||
except Exception as e: | ||
raise ValueError(f"Error during tokenization: {e}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need this, the exception itself should be clear enough?
Input Validation and Error Handling:
bos
andeos
parameters in theencode
method to ensure they are boolean values.Flexible Model Loading:
Handling Unknown Tokens:
unk_id
. Tokens outside the vocabulary range are replaced with the<UNK>
token.These changes contribute to the overall reliability and usability of the Tokenizer class, enabling smoother integration into various projects.