[Feature Request] [stdlib] [proposal] Unicode Indexing for `String` and buffer idx for `StringSlice` in `getitem` and `setitem` #3246

martinvuyk · 2024-07-14T15:00:47Z

Review Mojo's priorities

I have read the roadmap and priorities and I believe this request falls within the priorities.

What is your request?

When accessing a String index in Python it uses unicode codepoints (utf32), so we should support the same accessing mechanism for String. Sadly, this will kill performance if we implement it in a for loop and count utf8 continuation bytes.

Thus, I propose we let StringSlice be buffer indexed, i.e. doing item[3] will get you the fourth byte regardless of encoding. To provide an alternative for people who want better performance.

What is your motivation for this change?

We currently only do buffer indexing in String which is not conformant with Python

Any other details?

No response

The text was updated successfully, but these errors were encountered:

JoeLoser · 2024-07-17T22:35:52Z

(As discussed internally with @connorg):

This is a bit tricky. I see and agree with the argument, but I do worry about being inconsistent between these two fundamental types. If folks get used to codepoint-based indexing with String and then start to use StringSlice, the could easily be confused and write incorrect code without realizing it.

For folks who want to work on raw bytes, they can use StringSlice.as_bytes_slice()[idx] to be explicit about it.

martinvuyk · 2024-07-18T13:13:32Z

they can use StringSlice.as_bytes_slice()[idx] to be explicit about it

you're right, totally forgot that existed

I'll leave the issue open until we get the utf32 indexing done. If somebody else wants to tackle it be my guest. I was thinking if nobody takes it on by this weekend to give it a go and try to setup something faster for counting utf8 continuation bytes with SIMD since it is used in other places.

martinvuyk · 2024-07-23T19:27:59Z

@JoeLoser opened PR #3295 but there are some issues with other functions such as String.format, String.rfind, and some others may have an issue but might not be well tested. I'm worried that too many places assume string is indexed by byte offset directly.

I think it might be worth it to rethink if we really want utf32 indexing in a utf8 encoded String. It might be better to just build a PythonString that uses utf32. Or to just let indexing be by byte offset and provide an explicit method for python style indexing (this is my favorite one tbh.).

mzaks · 2024-09-24T13:07:12Z

I implemented CrazyString and talk about the code point indexing solution for UTF-8 strings in this video https://youtu.be/31Ka0bUTo2U (the code point indexing part starts at 20:57) if it is something Mojo std lib team is interested in I can write a proposal and we could start implementing it. I personally think that UTF-32 based solution can be valuable in some use cases, but a UTF-8 representation with an index as implemented in PyPy and in CrazyString has a better balance between space and computational efficiency.

martinvuyk · 2024-09-24T14:21:28Z

Hi @mzaks cool approach. Though I see a couple of problems for a generic String to do that:

It would only work well with static strings or you'd have to recompute the index on every mutating method
Why don't you just use 2 bits for every value? I don't quite understand the logic behind fn _index_byte_width()
The calculations and index are good to calculate unicode length of a String without needing to count every time, but many operations need to index into the String by a certain amount of those codepoints. So if you store the index you'd have to load and sum by index[start_idx: end_idx] which might not always coincide with a power of 2 so the SIMD vec would have to be masked dynamically as well. I'm not sure what performance impact that would have, thought It'd probably be less than loading the whole String only for large strings.
Overall the overhead is maybe only worth it for large strings

In case you're interested I finished working on PR #3529 which is to do indexing faster. And btw that flip and shift trick to count num_bytes is awesome, I put it as a global helper _utf8_first_byte_sequence_length in PR #3528

mzaks · 2024-09-24T16:16:43Z

It would only work well with static strings or you'd have to recompute the index on every mutating method

It depends on the type of mutation. Concatenation of two strings, need a recompute for the right side. Removing from the end is also cheap. Removing from the front will trigger full re-indexing. Removing in the middle can be seen as concatenation of two strings where left side is removing from the end. It also would be possible to compute index only on demand, by sacrificing an entry in the index to be the marker. Index is not needed for iteration, so it might be that with lazy index creation most of the use cases will avoid the computation at all.

Why don't you just use 2 bits for every value? I don't quite understand the logic behind fn _index_byte_width()

The index entries are the offset of 32th code point, so a string with byte_length smaller than 256 will have all the entries representable in 1 byte, 2bytes means that we can index string not longer than 2^16 bytes, if we don't want the dynamic approach we could go with 4 bytes or even 8 bytes, but I think those few ops in the already non trivial computation will not make it a difference. Although as always it needs to be measured and profiled to be 100%. sure

... many operations need to index into the String by a certain amount of those codepoints ...

I updated the gist with more complete CrazyString implementation and more test cases. Have a look at this test:

https://gist.github.com/mzaks/78f7d38f63fb234dadb1dae11f2ee3ae#file-test_crazy_string-mojo-L281

There I test exactly what you asked, the indexing per span can be found here:

https://gist.github.com/mzaks/78f7d38f63fb234dadb1dae11f2ee3ae#file-crazy_string-mojo-L382

As you can see in the code it gets a bit more complicated when step is not 1, otherwise it same as looking up two indices.

Overall the overhead is maybe only worth it for large strings

This needs to be measured, converting UTF-8 to UTF-32 will have a computational overhead on each instantiation and up to 4x in memory usage. And don't forget, although UTF-32 solves the problem of indexing by code point, this is often not enough, often we need to index by user perceived characters which might be grapheme clusters, meaning that it is a composition of multiple UTF-32 chars. Python does not support this kind of indexing but Swift does. I witnessed questionable results in large data pipelines written in Python which do not take grapheme clusters into account. I think we should be able to do better in Mojo.

martinvuyk · 2024-09-24T16:58:50Z

It depends on the type of mutation. Concatenation of two strings, need a recompute for the right side. Removing from the end is also cheap. Removing from the front will trigger full re-indexing. Removing in the middle can be seen as concatenation of two strings where left side is removing from the end. It also would be possible to compute index only on demand, by sacrificing an entry in the index to be the marker. Index is not needed for iteration, so it might be that with lazy index creation most of the use cases will avoid the computation at all.

That sounds very well optimized, but will need a very big involvement in adjusting String to work like that. Index is not needed for iteration, yes it is. You need to know how many bytes to skip on every character iteration.

Also, the lazy version of index evaluation would mean that memory load & computation is involved every time a string needs to be sliced and for small strings it might be faster to just count the continuation bytes (see PR #3529 that I think will be very fast). I think we can also mix both approaches.

The index entries are the offset of 32th code point, so a string with byte_length smaller than 256 will have all the entries representable in 1 byte, 2bytes means that we can index string not longer than 2^16 bytes, if we don't want the dynamic approach we could go with 4 bytes or even 8 bytes, but I think those few ops in the already non trivial computation will not make it a difference.

These optimization sounds like it will only affect big strings then. Reading the code leaves me that impression as well.

Although as always it needs to be measured and profiled to be 100%. sure

Agree, we'll need to benchmark all approaches side to side very thoroughly for different sizes to be sure of the effects. I'm hoping PR #3523 gets merged for those purposes.

I witnessed questionable results in large data pipelines written in Python

Therein lies the problem written in Python XD. Nah just kidding, I also have them in Python but the queries themselves run on SQL so np there but I'd never use Python for big data text processing, I'd go directly to numpy and do it manually or use the SQL engine's native tools.

Python does not support this kind of indexing but Swift does

We could build separate functions for that, but we need to make __getitem__ dunders behave exactly like Python

mzaks · 2024-09-25T11:30:55Z

... You need to know how many bytes to skip on every character iteration.

Right and for this we don't need an index, we use the num_bytes function on the first code point byte. As a matter of fact the already implemented _StringSliceIter already works this way, without the need for an index.

Also, the lazy version of index evaluation would mean that memory load & computation is involved every time a string needs to be sliced and for small strings it might be faster to just count the continuation bytes (see PR #3529 that I think will be very fast). I think we can also mix both approaches.

Generally a StringSlice does not own the buffer and is rather a lightweight instance, which should not have an index.

Short string (<32 bytes) will also not have an index as it is trivial / not expensive to calculate those on the fly.

These optimization sounds like it will only affect big strings then. Reading the code leaves me that impression as well.

It is rather the other way around, it's an optimization for small strings. Say we have a string of 200 bytes, we would need to reserve an index of at most 7 entries. Now if we go with Int as a type for entry, which will be the simplest solution, we have 7*8 = 56 bytes of overhead, where we actually need just 7 bytes. 56 / 200 = 28% where 7 / 200 = 3.5%

martinvuyk · 2024-09-25T13:13:20Z

Right and for this we don't need an index, we use the num_bytes function on the first code point byte. As a matter of fact the already implemented _StringSliceIter already works this way, without the need for an index.

You actually left me thinking about that yesterday and I realized we have unnecesary branching there (see PR #3546), but you still need to count bytes for iterator len (only gets executed upon instantiation AFAIK). So yeah the impact of index there will be neglectible.

Generally a StringSlice does not own the buffer and is rather a lightweight instance, which should not have an index.
Short string (<32 bytes) will also not have an index as it is trivial / not expensive to calculate those on the fly.

The problem with StringSlice not having an index would be that the implementation for slicing both will be different, String would have an index and StringSlice not but both need to index by unicode codepoints. So maybe we could add an optional index to StringSlice as well.

It is rather the other way around, it's an optimization for small strings. Say we have a string of 200 bytes, we would need to reserve an index of at most 7 entries. Now if we go with Int as a type for entry, which will be the simplest solution, we have 7*8 = 56 bytes of overhead, where we actually need just 7 bytes. 56 / 200 = 28% where 7 / 200 = 3.5%

For me more than ~50 bytes is big 🤷‍♂️ . But anyhow, the effect of the index would also depend on system simd bitwidth: the implementation I did for _count_utf8_continuation_bytes() uses the maximum available width and then progressively uses lower bitwidth to load and count until less than a multiple of 8 is left. So for a system of byte width 64, it is just 1 memory load, an &, a !=, and reduce_add() to count 64 bytes. If we build an index for 32 bytes we'd have to load the 2 consecutive nums (1 mem load) and sum them up. So yes, less operations but we really need to measure the improvement to see at which point there is a payoff.

I'm still thinking, but what if we just give the user the option to have 2x the memory for the String but in exchange make a lookup table with the summed up previous continuation bytes ? (this one has no option for lazy evaluation since it would be redundant). It would be kind of interesting because it gives the trade-off of mutation being slower but indexing would become O(1) as if it were native UTF32. We could also try to pack the bits to spare some bytes since we know the maximum possible values for each index.

mzaks · 2024-09-25T15:51:25Z

The problem with StringSlice not having an index would be that the implementation for slicing both will be different, String would have an index and StringSlice not but both need to index by unicode codepoints. So maybe we could add an optional index to StringSlice as well.

It is a question of API design for the type. I would argue we don't need a code point based indexing for the slice. Code point indexing is often needed for more complex/abstract use cases, high performance use cases like parsing are based on iterations which are byte based as we discussed above. So users can easily convert StringSlice to a string if they do have complex operations which is probably anyways what they want due to data ownership.

For me more than ~50 bytes is big 🤷‍♂️ . But anyhow, the effect of the index would also depend on system simd bitwidth: the implementation I did for _count_utf8_continuation_bytes() uses the maximum available width and then progressively uses lower bitwidth to load and count until less than a multiple of 8 is left. So for a system of byte width 64, it is just 1 memory load, an &, a !=, and reduce_add() to count 64 bytes. If we build an index for 32 bytes we'd have to load the 2 consecutive nums (1 mem load) and sum them up. So yes, less operations but we really need to measure the improvement to see at which point there is a payoff.

32 code points can occupy between 32 and 128 bytes. It is also imaginable to make this constant configurable or architecture dependent.

I'm still thinking, but what if we just give the user the option to have 2x the memory for the String but in exchange make a lookup table with the summed up previous continuation bytes ? (this one has no option for lazy evaluation since it would be redundant). It would be kind of interesting because it gives the trade-off of mutation being slower but indexing would become O(1) as if it were native UTF32. We could also try to pack the bits to spare some bytes since we know the maximum possible values for each index.

Strictly speaking the 32 code point based indexing is also O(1), as N is not involved in the "tail computation". The space for indexing each code point is 2x only for N<256 it's 3x for N<2^16, 5x for N<2^32 and 9x for N>2^32.

martinvuyk · 2024-09-25T16:41:01Z

It is a question of API design for the type. I would argue we don't need a code point based indexing for the slice. Code point indexing is often needed for more complex/abstract use cases, high performance use cases like parsing are based on iterations which are byte based as we discussed above. So users can easily convert StringSlice to a string if they do have complex operations which is probably anyways what they want due to data ownership.

Yes, I also like the idea of leaving StringSlice be byte offset indexed. That's why I opened this feature request, but see Joe's comment where he does make a point that people who care that much could just use a Span[UInt8] and it does reduce potential confusion by users. The iterator for String gives you StringSlice, imagine a newbie trying to understand why the __getitem__ behaves differently inside a for loop than outside :/ . Arguably most people don't really read documentation least of all dunder method documentation which is hidden and you have to look for it.

32 code points can occupy between 32 and 128 bytes. It is also imaginable to make this constant configurable or architecture dependent.

Yes that would be a great Mojo use case 🔥

Strictly speaking the 32 code point based indexing is also O(1), as N is not involved in the "tail computation". The space for indexing each code point is 2x only for N<256 it's 3x for N<2^16, 5x for N<2^32 and 9x for N>2^32.

I don't really understand this paragraph, if you mean that UTF32 indexing is O(1) then yes that is what I meant by "make a lookup table with the summed up previous continuation bytes [...] but indexing would become O(1) as if it were native UTF32". If you mean indexing, then there is this:

Please correct me if I'm getting the algorithm wrong. If you have a 128 byte String you'd need 4x 32 continuation byte counts for the index. When indexing, say, to unicode index 70, you need to load the 0-32 count slot and the 32-64 and add them up, if they are 0, then offset directly, else load byte 32 onwards and count continuation bytes for offset. I don't see how this is O(1) since it depends on n to get closest_lower_pow_2(n) to access the index counts (variable O(1) times if it's already pow of 2), and the amount of slots to sum also depend on n (so O(n)), so it really is a range of possible computation time (O(n + k1), O(n + k2)) which ends up simplifying to O(n) for big n.

mzaks · 2024-09-25T19:53:10Z

It is a question of API design for the type. I would argue we don't need a code point based indexing for the slice. Code point indexing is often needed for more complex/abstract use cases, high performance use cases like parsing are based on iterations which are byte based as we discussed above. So users can easily convert StringSlice to a string if they do have complex operations which is probably anyways what they want due to data ownership.

Yes, I also like the idea of leaving StringSlice be byte offset indexed. That's why I opened this feature request, but see Joe's comment where he does make a point that people who care that much could just use a Span[UInt8] and it does reduce potential confusion by users. The iterator for String gives you StringSlice, imagine a newbie trying to understand why the __getitem__ behaves differently inside a for loop than outside :/ . Arguably most people don't really read documentation least of all dunder method documentation which is hidden and you have to look for it.

This is what I mean with question of API, a StringSlce does not have to implement the __getitem__ method it can only implement the iterator method and the possibility to access the span which implements __getitem__ by byte.

Strictly speaking the 32 code point based indexing is also O(1), as N is not involved in the "tail computation". The space for indexing each code point is 2x only for N<256 it's 3x for N<2^16, 5x for N<2^32 and 9x for N>2^32.

I don't really understand this paragraph, if you mean that UTF32 indexing is O(1) then yes that is what I meant by "make a lookup table with the summed up previous continuation bytes [...] but indexing would become O(1) as if it were native UTF32". If you mean indexing, then there is this:

Please correct me if I'm getting the algorithm wrong. If you have a 128 byte String you'd need 4x 32 continuation byte counts for the index. When indexing, say, to unicode index 70, you need to load the 0-32 count slot and the 32-64 and add them up, if they are 0, then offset directly, else load byte 32 onwards and count continuation bytes for offset. I don't see how this is O(1) since it depends on n to get closest_lower_pow_2(n) to access the index counts (variable O(1) times if it's already pow of 2), and the amount of slots to sum also depend on n (so O(n)), so it really is a range of possible computation time (O(n + k1), O(n + k2)) which ends up simplifying to O(n) for big n.

Ah now I see where the confusion comes from. In the index we don't store the count of bytes, we store the absolute offset from the begging of the pointer. So in case of index into 70th code point we look up the second element in the index which gives us the offset to the 64th code point and then we need to iterate over the remaining 6 code points. So there is no need for additions, we have an O(1) lookup into the next 32ths code point and then we iterate the through the remaining buffer which can be at most 124 bytes (31 * 4). This is also why the offsets need to be represented as uint32 if the string is longer than 2^16 bytes.

martinvuyk · 2024-09-25T20:30:45Z

This is what I mean with question of API, a StringSlce does not have to implement the getitem method it can only implement the iterator method and the possibility to access the span which implements getitem by byte.

👍 I like that idea :)

In the index we don't store the count of bytes, we store the absolute offset from the begging of the pointer.

Ahh ok now I get it. I think this could seriously become it's own type of string for large sequences (BigString ?), we could even add grapheme clusters to this indexing scheme... WDYT?

mzaks · 2024-09-25T22:34:57Z

Ahh ok now I get it. I think this could seriously become it's own type of string for large sequences (BigString ?), we could even add grapheme clusters to this indexing scheme... WDYT?

This approach is actually not new, it is used in PyPy string (https://mail.python.org/pipermail/pypy-dev/2016-March/014277.html)

In CrazyString I provide a bool parameter to mark the type as code point indexed, we could however have an index type where user can define if they want a byte, code point, or grapheme cluster based indexing. Might be too fancy though 😄.

bytes MODULAR_ORIG_COMMIT_REV_ID: 7f87f1d40ee48279b20abac457d7bf3d2b326e5d

martinvuyk added enhancement New feature or request mojo-repo Tag all issues with this label labels Jul 14, 2024

martinvuyk mentioned this issue Jul 23, 2024

[BUG] [stdlib] Fix String.format() to use explicit byte indexing and not assume String.__getitem__ indexes directly to byte offset. #3296

Closed

martinvuyk mentioned this issue Sep 22, 2024

[BUG] Fix String.find() family of functions to work by unicode codepoints #3526

Open

martinvuyk referenced this issue Feb 2, 2025

[stdlib] polish: Change string __len__() methods to return length in

e9ce5dc

bytes MODULAR_ORIG_COMMIT_REV_ID: 7f87f1d40ee48279b20abac457d7bf3d2b326e5d

ematejska added mojo-playground mojo Issues that are related to mojo labels Feb 26, 2025 — with Linear

ematejska removed the mojo-playground label Feb 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] [stdlib] [proposal] Unicode Indexing for `String` and buffer idx for `StringSlice` in `getitem` and `setitem` #3246

[Feature Request] [stdlib] [proposal] Unicode Indexing for `String` and buffer idx for `StringSlice` in `getitem` and `setitem` #3246

martinvuyk commented Jul 14, 2024 •

edited

Loading

JoeLoser commented Jul 17, 2024

martinvuyk commented Jul 18, 2024

martinvuyk commented Jul 23, 2024

mzaks commented Sep 24, 2024 •

edited

Loading

martinvuyk commented Sep 24, 2024

mzaks commented Sep 24, 2024 •

edited

Loading

martinvuyk commented Sep 24, 2024

mzaks commented Sep 25, 2024

martinvuyk commented Sep 25, 2024

mzaks commented Sep 25, 2024 •

edited

Loading

martinvuyk commented Sep 25, 2024 •

edited

Loading

mzaks commented Sep 25, 2024

martinvuyk commented Sep 25, 2024

mzaks commented Sep 25, 2024

[Feature Request] [stdlib] [proposal] Unicode Indexing for String and buffer idx for StringSlice in __getitem__ and __setitem__ #3246

[Feature Request] [stdlib] [proposal] Unicode Indexing for String and buffer idx for StringSlice in __getitem__ and __setitem__ #3246

Comments

martinvuyk commented Jul 14, 2024 • edited Loading

Review Mojo's priorities

What is your request?

What is your motivation for this change?

Any other details?

JoeLoser commented Jul 17, 2024

martinvuyk commented Jul 18, 2024

martinvuyk commented Jul 23, 2024

mzaks commented Sep 24, 2024 • edited Loading

martinvuyk commented Sep 24, 2024

mzaks commented Sep 24, 2024 • edited Loading

martinvuyk commented Sep 24, 2024

mzaks commented Sep 25, 2024

martinvuyk commented Sep 25, 2024

mzaks commented Sep 25, 2024 • edited Loading

martinvuyk commented Sep 25, 2024 • edited Loading

mzaks commented Sep 25, 2024

martinvuyk commented Sep 25, 2024

mzaks commented Sep 25, 2024

[Feature Request] [stdlib] [proposal] Unicode Indexing for `String` and buffer idx for `StringSlice` in `getitem` and `setitem` #3246

[Feature Request] [stdlib] [proposal] Unicode Indexing for `String` and buffer idx for `StringSlice` in `getitem` and `setitem` #3246

martinvuyk commented Jul 14, 2024 •

edited

Loading

mzaks commented Sep 24, 2024 •

edited

Loading

mzaks commented Sep 24, 2024 •

edited

Loading

mzaks commented Sep 25, 2024 •

edited

Loading

martinvuyk commented Sep 25, 2024 •

edited

Loading