-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request] [stdlib] [proposal] Unicode Indexing for String
and buffer idx for StringSlice
in __getitem__
and __setitem__
#3246
Comments
(As discussed internally with @connorg): |
you're right, totally forgot that existed I'll leave the issue open until we get the utf32 indexing done. If somebody else wants to tackle it be my guest. I was thinking if nobody takes it on by this weekend to give it a go and try to setup something faster for counting utf8 continuation bytes with SIMD since it is used in other places. |
@JoeLoser opened PR #3295 but there are some issues with other functions such as I think it might be worth it to rethink if we really want utf32 indexing in a utf8 encoded String. It might be better to just build a PythonString that uses utf32. Or to just let indexing be by byte offset and provide an explicit method for python style indexing (this is my favorite one tbh.). |
I implemented CrazyString and talk about the code point indexing solution for UTF-8 strings in this video https://youtu.be/31Ka0bUTo2U (the code point indexing part starts at 20:57) if it is something Mojo std lib team is interested in I can write a proposal and we could start implementing it. I personally think that UTF-32 based solution can be valuable in some use cases, but a UTF-8 representation with an index as implemented in PyPy and in CrazyString has a better balance between space and computational efficiency. |
Hi @mzaks cool approach. Though I see a couple of problems for a generic String to do that:
In case you're interested I finished working on PR #3529 which is to do indexing faster. And btw that flip and shift trick to count num_bytes is awesome, I put it as a global helper |
It depends on the type of mutation. Concatenation of two strings, need a recompute for the right side. Removing from the end is also cheap. Removing from the front will trigger full re-indexing. Removing in the middle can be seen as concatenation of two strings where left side is removing from the end. It also would be possible to compute index only on demand, by sacrificing an entry in the index to be the marker. Index is not needed for iteration, so it might be that with lazy index creation most of the use cases will avoid the computation at all.
The index entries are the offset of 32th code point, so a string with byte_length smaller than
I updated the gist with more complete CrazyString implementation and more test cases. Have a look at this test: https://gist.github.com/mzaks/78f7d38f63fb234dadb1dae11f2ee3ae#file-test_crazy_string-mojo-L281 There I test exactly what you asked, the indexing per span can be found here: https://gist.github.com/mzaks/78f7d38f63fb234dadb1dae11f2ee3ae#file-crazy_string-mojo-L382 As you can see in the code it gets a bit more complicated when step is not 1, otherwise it same as looking up two indices.
This needs to be measured, converting UTF-8 to UTF-32 will have a computational overhead on each instantiation and up to 4x in memory usage. And don't forget, although UTF-32 solves the problem of indexing by code point, this is often not enough, often we need to index by user perceived characters which might be grapheme clusters, meaning that it is a composition of multiple UTF-32 chars. Python does not support this kind of indexing but Swift does. I witnessed questionable results in large data pipelines written in Python which do not take grapheme clusters into account. I think we should be able to do better in Mojo. |
That sounds very well optimized, but will need a very big involvement in adjusting Also, the lazy version of index evaluation would mean that memory load & computation is involved every time a string needs to be sliced and for small strings it might be faster to just count the continuation bytes (see PR #3529 that I think will be very fast). I think we can also mix both approaches.
These optimization sounds like it will only affect big strings then. Reading the code leaves me that impression as well.
Agree, we'll need to benchmark all approaches side to side very thoroughly for different sizes to be sure of the effects. I'm hoping PR #3523 gets merged for those purposes.
Therein lies the problem written in Python XD. Nah just kidding, I also have them in Python but the queries themselves run on SQL so np there but I'd never use Python for big data text processing, I'd go directly to numpy and do it manually or use the SQL engine's native tools.
We could build separate functions for that, but we need to make |
Right and for this we don't need an index, we use the num_bytes function on the first code point byte. As a matter of fact the already implemented _StringSliceIter already works this way, without the need for an index.
Generally a StringSlice does not own the buffer and is rather a lightweight instance, which should not have an index. Short string (<32 bytes) will also not have an index as it is trivial / not expensive to calculate those on the fly.
It is rather the other way around, it's an optimization for small strings. Say we have a string of 200 bytes, we would need to reserve an index of at most 7 entries. Now if we go with Int as a type for entry, which will be the simplest solution, we have 7*8 = 56 bytes of overhead, where we actually need just 7 bytes. 56 / 200 = 28% where 7 / 200 = 3.5% |
You actually left me thinking about that yesterday and I realized we have unnecesary branching there (see PR #3546), but you still need to count bytes for iterator len (only gets executed upon instantiation AFAIK). So yeah the impact of index there will be neglectible.
The problem with StringSlice not having an index would be that the implementation for slicing both will be different, String would have an index and StringSlice not but both need to index by unicode codepoints. So maybe we could add an optional index to StringSlice as well.
For me more than ~50 bytes is big 🤷♂️ . But anyhow, the effect of the index would also depend on system simd bitwidth: the implementation I did for I'm still thinking, but what if we just give the user the option to have 2x the memory for the String but in exchange make a lookup table with the summed up previous continuation bytes ? (this one has no option for lazy evaluation since it would be redundant). It would be kind of interesting because it gives the trade-off of mutation being slower but indexing would become |
It is a question of API design for the type. I would argue we don't need a code point based indexing for the slice. Code point indexing is often needed for more complex/abstract use cases, high performance use cases like parsing are based on iterations which are byte based as we discussed above. So users can easily convert StringSlice to a string if they do have complex operations which is probably anyways what they want due to data ownership.
32 code points can occupy between 32 and 128 bytes. It is also imaginable to make this constant configurable or architecture dependent.
Strictly speaking the 32 code point based indexing is also O(1), as N is not involved in the "tail computation". The space for indexing each code point is 2x only for N<256 it's 3x for N<2^16, 5x for N<2^32 and 9x for N>2^32. |
Yes, I also like the idea of leaving StringSlice be byte offset indexed. That's why I opened this feature request, but see Joe's comment where he does make a point that people who care that much could just use a
Yes that would be a great Mojo use case 🔥
I don't really understand this paragraph, if you mean that UTF32 indexing is Please correct me if I'm getting the algorithm wrong. If you have a 128 byte String you'd need 4x 32 continuation byte counts for the index. When indexing, say, to unicode index 70, you need to load the 0-32 count slot and the 32-64 and add them up, if they are 0, then offset directly, else load byte 32 onwards and count continuation bytes for offset. I don't see how this is |
This is what I mean with question of API, a StringSlce does not have to implement the
Ah now I see where the confusion comes from. In the index we don't store the count of bytes, we store the absolute offset from the begging of the pointer. So in case of index into 70th code point we look up the second element in the index which gives us the offset to the 64th code point and then we need to iterate over the remaining 6 code points. So there is no need for additions, we have an O(1) lookup into the next 32ths code point and then we iterate the through the remaining buffer which can be at most 124 bytes (31 * 4). This is also why the offsets need to be represented as uint32 if the string is longer than 2^16 bytes. |
👍 I like that idea :)
Ahh ok now I get it. I think this could seriously become it's own type of string for large sequences ( |
This approach is actually not new, it is used in PyPy string (https://mail.python.org/pipermail/pypy-dev/2016-March/014277.html) In CrazyString I provide a bool parameter to mark the type as code point indexed, we could however have an index type where user can define if they want a byte, code point, or grapheme cluster based indexing. Might be too fancy though 😄. |
bytes MODULAR_ORIG_COMMIT_REV_ID: 7f87f1d40ee48279b20abac457d7bf3d2b326e5d
Review Mojo's priorities
What is your request?
When accessing a String index in Python it uses unicode codepoints (utf32), so we should support the same accessing mechanism for
String
. Sadly, this will kill performance if we implement it in a for loop and count utf8 continuation bytes.Thus, I propose we let
StringSlice
be buffer indexed, i.e. doingitem[3]
will get you the fourth byte regardless of encoding. To provide an alternative for people who want better performance.What is your motivation for this change?
We currently only do buffer indexing in String which is not conformant with Python
Any other details?
No response
The text was updated successfully, but these errors were encountered: