-
Notifications
You must be signed in to change notification settings - Fork 11k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama3 custom regex split #6965
Conversation
As you can see in commits:
Finally, In bytes([0xFF,0xFD,0,0]).decode("utf-32") # Ok, returns '\ufdff'
bytes([0xFF,0xFE,0,0]).decode("utf-32") # Fail, returns empty string '', expected '\ufeff'
bytes([0xFF,0xFF,0,0]).decode("utf-32") # Ok, returns '\uffff'
bytes([0xFE,0xFE,0,0]).decode("utf-32") # Ok, returns '\ufefe' So I used |
Also I'm looking for generating the table |
Maybe we can generate a not-so-random corpus trying to cover all category tokens. |
Instead of doing: #define CODEPOINT_TYPE_UNIDENTIFIED 0
#define CODEPOINT_TYPE_NUMBER 1
#define CODEPOINT_TYPE_LETTER 2
#define CODEPOINT_TYPE_SEPARATOR 3
#define CODEPOINT_TYPE_ACCENT_MARK 4
#define CODEPOINT_TYPE_PUNCTUATION 5
#define CODEPOINT_TYPE_SYMBOL 6
#define CODEPOINT_TYPE_CONTROL 7
int unicode_cpt_type(uint32_t cp);
int unicode_cpt_type(const std::string & utf8); What do you think about using bitfields? struct unicode_codepoint_flags {
// CODEPOINT_TYPE_*
uint8_t is_undefined : 1;
uint8_t is_number : 1; // regex: \p{N}
uint8_t is_letter : 1; // regex: \p{L}
uint8_t is_separator : 1; // regex: \p{Z}
uint8_t is_accent_mark : 1; // regex: \p{M}
uint8_t is_punctuation : 1; // regex: \p{P}
uint8_t is_symbol : 1; // regex: \p{S}
uint8_t is_control : 1; // regex: \p{C}
// helper flags
uint8_t is_whitespace : 1; // regex: \s
uint8_t is_lowercase : 1;
uint8_t is_uppercase : 1;
uint8_t is_nfd : 1;
};
unicode_codepoint_flags unicode_cpt_flags(uint32_t cp);
unicode_codepoint_flags unicode_cpt_flags(const std::string & utf8); This way we can track in parallel the unicode category and additional useful flags like |
Let's change the target branch to |
A regex engine is forbidden because parsing by hand and syncing unicode tables is easier, but a huge library with dozens of files is an acceptable |
Yes, seems like a good idea p.s. I changed the target branch to |
Threre are still models failing (bert, phi-3) but are not related to this PR (llama-3).
Ok, im going to open another with a new |
9d346d0
to
12a7b69
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I discovered fail case using llama-bpe
: Cửa Việt
src: 'Cửa Việt'
res: 'Cửa Việt'
tok: 34 91163 11655 26298 83
main : failed test: 'Cửa Việt'
main : detokenized to: 'Cửa Việt' instead of 'Cửa Việt'
main : expected tokens: 34 'C', 91163 'ửa', 101798 ' Việt',
main : got tokens: 34 'C', 91163 'ửa', 11655 ' Vi', 26298 'ệ', 83 't',
This also fails on master
Other than that, I think this PR is great - nice job!
I couldn't test the random tokenizer tests because I'm on Mac atm, but will give this a try later on my Linux box.
Will merge this after the CI is green
gg's intent is to merge after the CI is green... all check has passsed... merging |
Oh, that one was not easy to fix. The core of the problem is this missing if: When the The "model": {
"type": "BPE",
"dropout": null,
"unk_token": null,
"continuing_subword_prefix": null,
"end_of_word_suffix": null,
"fuse_unk": false,
"byte_fallback": false,
"ignore_merges": true,
...
} Not sure if we need recover this variable from the gguf file, or just hardcode to This is a patch to check that it fixes the problem (llama-3 and gpt-2 passed the brute force tests): diff --git a/llama.cpp b/llama.cpp
index da7c0d9f..73deb462 100644
--- a/llama.cpp
+++ b/llama.cpp
@@ -12298,6 +12298,12 @@ struct llm_tokenizer_bpe {
int index = 0;
size_t offset = 0;
+ static const bool ignore_merges = true;
+ if(ignore_merges && vocab.token_to_id.find(word) != vocab.token_to_id.end()) {
+ symbols.emplace_back(llm_symbol{-1, -1, word.c_str(), word.size()});
+ offset = word.size();
+ }
+
while (offset < word.size()) {
llm_symbol sym;
size_t char_len = std::min(word.size() - offset, (size_t) ::utf8_len(word[offset])); Also I realized my brute force test is wrong. My intention here is to check all vocab words, but I'm stripping whitespaces... fix to come. |
Implementation of
unicode_regex_split_custom_llama3()
.