-
Notifications
You must be signed in to change notification settings - Fork 826
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mitigate prompt injection attacks by supporting "safe" encoding (encoding without special tokens) #1347
Comments
Hey! This is planned! The equivalent was merged for transformers in this PR. The changes for rust are a little bit more advanced, but definitely on my todo! |
Great news, thank you for the update @ArthurZucker, I would have loved to help but I'm unfortunately not very knowledgeable in Rust. I'll gladly follow the topic and test the feature when it comes out! |
I feel like this is not a sane default. It has merits in certain context, possibly, but I wouldn't call that safe by any means. Prompt injection is by far not limited to injecting special tokens. Basically, any form of text can escape already. This is feels like a very weak form of safety, and defeats the purpose of having a very flexible input ground (where users can create arbitrary complex prompts, like for chat, without having to handle any special new API in this lib). We can definitely add it, it should be quite easy, since we should only be skipping the added_vocabulary step I think (depends if the special tokens are also in the core vocab, this might vary from tokenizer to tokenizer). |
Hi @Narsil, I'm not sure I'm understanding your point, I also think that injecting special tokens is not the only way to do prompt injection, but it's at least one failure mode that should be addressable, the proposal was not meant to solve the entire issue. As for the safety part of it, I see your point, I think the rationale behind having this default is that there is an inherent ambiguity with respect to how special tokens should be handled. Without an explicit intent from the developer, no sane default can be inferred because there are situations in which one way to handle special tokens is desired but not the other. In these sort of cases, I tend to think that throwing an exception is a sane default, at least a better one than silently making a wrong assumption. I'm not particularly attached to the idea of throwing an exception though, having a warning, requiring a mandatory keyword argument or even just explicitly documenting the default behavior and providing an alternative behavior also accomplish roughly the same goal. |
@ArthurZucker Great idea. I also like "split_special_tokens" to handle special tokens huggingface/transformers#26468. |
Any updates? |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
The issue is still relevant |
Yep sorry, I'll finally have time to pick it up! |
Hi @ArthurZucker, great news, no need to be sorry, keep up the amazing work 🚀 If you need help during testing don't hesitate to reach out! |
Sorry for the delay have a lot on my plate but prioritizing a release next week! Including this |
There may already exist a way of accomplishing what I'm going to describe but I didn't find it by reading the documentation.
In certain applications, we should be careful about how special tokens are encoded as they can be used to trigger special capabilities in models, or give them special positional clues (system prompt, etc.). Hence, when serving a model to end users, we need to prevent injection attacks, in which the user sends the representation of a special token as plain text (eg.
<SYSTEM>
), and the tokenizer interprets this text as a special token. In this regard, OpenAI'stiktoken
tokenizer has a very safe default of raising an exception if it encounters text that corresponds to a special token, see the corresponding docstring. This effectively forces the developer to be very intentional about how special tokens should be handled, thus preventing injection attacks.Such a default behavior would break existing code, an alternative would be to have a
.safe_encode
method that would throw an exception if it encounters text that corresponds to a special token, mirroring whattiktoken
is doing, and allowing/disallowing special tokens using a whitelist/blacklist argument. Disallowed special tokens should be treated as plain text and NOT as representation of special tokens, ie.<SYSTEM>
should be tokenized as["<", "SYSTEM", ">"]
or otherwise depending on the vocabulary, but most importantly it should NOT be interpreted as the<SYSTEM>
token unless explicitly enabled by the developer.Is there an existing way of mirroring
tiktoken
behavior, and if not, would such a feature be useful to the library?The text was updated successfully, but these errors were encountered: