-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[stdlib] [proposal] String, ASCII, Unicode, UTF, Graphemes #3988
Open
martinvuyk
wants to merge
7
commits into
modular:main
Choose a base branch
from
martinvuyk:string-ascii-unicode-utf-grapheme
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
3672938
String, ASCII, Unicode, UTF, Graphemes
martinvuyk f0f0f34
fix detail
martinvuyk 57ac861
fix detail
martinvuyk 03847e1
fix detail
martinvuyk 8b06d83
fix details
martinvuyk fd3848a
fix details
martinvuyk 4e7c8e1
fix details
martinvuyk File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,189 @@ | ||
# String, ASCII, Unicode, UTF, Graphemes | ||
|
||
The current proposal will attempt to unify the handling of the standards. | ||
Providing nice ergonomics, as Python-like defaults as possible, and keeping the | ||
door open for optimizations. | ||
|
||
## String | ||
|
||
String is currently an owning type which encodes string data in UTF-8, which is | ||
a variable length encoding format. Data can be 1 byte up to 4 bytes long. ASCII | ||
text, which is where English text falls, is 1 byte long. As such, the defaults | ||
should optimize for that character set given it is most typical on the internet | ||
and on backend systems. | ||
|
||
## ASCII | ||
|
||
Ascii is pretty optimizable, there are many known tricks. The big problem | ||
with supporting only ASCII is that other encodings can get corrupted. | ||
|
||
## UTF standards | ||
|
||
- UTF-8 is 1-4 sets of 8 bits long | ||
- UTF-16 is 1-2 sets of 16 bits long | ||
- UTF-32 is 1 set of 32 bits long | ||
|
||
### When slicing by unicode codepoint e.g. "🔥🔥🔥" (\U0001f525\U0001f525\U0001f525) | ||
|
||
- UTF-8: 12 sets (12 bytes) long. The first byte of each fire can be used to | ||
know the length, and the next bytes are what is known as a continuation byte. | ||
There are several approaches to achieve the slicing, they can be explored with | ||
benchmarking later on. | ||
- UTF-16: 6 sets long. It's very similar in procedure to UTF-8. | ||
- UTF-32: It is fastest since it's direct index access. This is not the case | ||
when supporting graphemes. | ||
|
||
## Graphemes | ||
|
||
Graphemes are an extension which allows unicode codepoints to modify the end | ||
result through concatenation with other codepoints. Visually they are shown as a | ||
single unit e.g. é is actually comprised of `e` followed by `´` (\x65\u0301). | ||
|
||
Graphemes are more expensive to slice because one can't use faster algorithms, | ||
only skipping each unicode codepoint one at a time and checking the last byte | ||
to see if it's an extended cluster *. | ||
|
||
*: extended clusters is the full feature complete grapheme standard. There | ||
exists the posibility of implementing only a subset, but that defeats the | ||
purpose of going through the hassle of supporting it in the first place. | ||
|
||
## Context | ||
|
||
C, C++, and Rust use UTF-8 with no default support for graphemes. | ||
|
||
### Swift | ||
|
||
Swift is an interesting case study. For compatibility with Objective-C they went | ||
for UTF-16. They decided to support grapheme clusters by default. They recently | ||
changed the preferred encoding from UTF-16 to UTF-8. They have a `Character` | ||
that I think inspired our current `Char` type, it is a generic representation of | ||
a Character in any encoding, that can be from one codepoint up to any grapheme | ||
cluster. | ||
|
||
### Python | ||
|
||
Python currently uses UTF-32 for its string type. So the slicing and indexing | ||
is simple and fast (but consumes a lot of memory, they have some tricks to | ||
reduce it). They do not support graphemes by default. Pypi is implementing a | ||
UTF-8 version of Python strings, which keeps the length state every x | ||
characters. | ||
|
||
### Mojo | ||
|
||
Mojo aims to be close to Python yet be faster and customizable, taking advantage | ||
of heterogeneous hardware and modern type system features. | ||
|
||
## Value vs. Reference | ||
|
||
Our current `Char` type uses a u32 as storage, every time an iterator that | ||
yields `Char`is used, an instance is parsed from the internal UTF-8 encoded | ||
`StringSlice` (into UTF-32). | ||
|
||
The default iterator for `String` returns a `StringSlice` which is a view into | ||
the character in the UTF-8 encoded `StringSlice`. This is much more efficient | ||
and does not add any complexity into the type system nor developer headspace. | ||
|
||
## Now, onto the Proposal | ||
|
||
### Goals | ||
|
||
#### Hold off on developing Char further and remove it from stdlib.builtin | ||
|
||
`Char` is currently expensive to create and use compared to a `StringSlice` | ||
which is a view over the original data. There is also the problem that it forces | ||
UTF-32 on the data, and those transformations are expensive. | ||
|
||
We can revisit `Char` later on making it take encoding into account. But the | ||
current state of the type makes it add more complexity than strictly necessary. | ||
|
||
#### Full ASCII optimizations | ||
If someone wishes to use a `String` as if it's an ASCII-only String, then there | ||
should either be a parameter that signifies that, or the stdlib/community should | ||
add an `ASCIIString` type which has all optimizations possible for such | ||
scenarios. | ||
|
||
#### Mostly ASCII optimizations | ||
Many functions can make use of branch predition, instruction and data | ||
prefetching, and algorithmic tricks to make processing faster for languages | ||
which have mostly ASCII characters but still keep full unicode support. | ||
|
||
#### Grapheme support | ||
Grapheme support should exist but be opt-in due to their high compute cost. | ||
|
||
### One concrete (tentative) way forward | ||
|
||
With a clear goal in mind, this is a concrete (tentative) way forward. | ||
|
||
#### Add parameters to String and StringSlice | ||
|
||
```mojo | ||
struct Encoding: | ||
alias UTF8 = 0 | ||
alias UTF16 = 1 | ||
alias UTF32 = 2 | ||
alias ASCII = 3 | ||
|
||
struct Indexing: | ||
alias DIRECT = 0 | ||
alias CODEPOINT = 1 | ||
alias GRAPHEME = 2 | ||
|
||
alias ASCIIString = String[encoding=Encoding.ASCII, indexing=Indexing.DIRECT] | ||
|
||
struct String[ | ||
encoding: Encoding = Encoding.UTF8, | ||
indexing: Indexing = Indexing.CODEPOINT, | ||
]: | ||
... # data is bitcasted to bigger DTypes when encoding is 16 or 32 bits | ||
|
||
struct StringSlice[ | ||
encoding: Encoding = Encoding.UTF8, | ||
indexing: Indexing = Indexing.CODEPOINT, | ||
]: | ||
... # data is bitcasted to bigger DTypes when encoding is 16 or 32 bits | ||
``` | ||
|
||
#### What this requires | ||
|
||
First, we can add the parameters and constraint on the supported encodings. | ||
|
||
Then, we need to rewrite every function signature that makes two `String`s | ||
interact with one another, where the code doesn't require both to be the | ||
defaults. | ||
|
||
Many of those functions will need to be branched when the encodings are | ||
different and allocations for parsing added. But the default same-encoding | ||
ones will remain the same **(as long as the code is encoding and | ||
indexing-agnostic)**. | ||
|
||
The implementations can be added with time making liberal use of `constrained`. | ||
|
||
#### Adapt StringSliceIter | ||
|
||
The default iterator should iterate in the way in which the `String` is | ||
parametrized. The iterator could then have functions which return the other | ||
types of iterators for ergonomics (or have constructors for each) *. | ||
|
||
*: this is a workaround until we have generic iterator support. | ||
|
||
e.g. | ||
```mojo | ||
data = String("123 \n\u2029🔥") | ||
for c in data: | ||
... # StringSlice in the same encoding and indexing scheme by default | ||
|
||
# Once we have Char developed enough | ||
for c in iter(data).chars[encoding.UTF8](): | ||
... # Char type instances (UTF8 packed in a UInt32 ?) | ||
for c in iter(data).chars[encoding.UTF32](): | ||
... # Char type instances | ||
|
||
# We could also have lazy iterators which return StringSlices according to the | ||
# encoding and indexing parameter. | ||
# In the case of Encoding.ASCII unicode separators (\x85, \u2028, \u2029) can | ||
# be ignored and thus the processing is much faster. | ||
for c in iter(data).split(): | ||
... # StringSlice lazily split by all whitespace | ||
for c in iter(data).splitlines(): | ||
... # StringSlice lazily split by all newline characters | ||
``` |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have numbers on this? The creation but also using different methods on it. Let's avoid optimizing things without data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't need numbers to know that the bit-shifting, masking, and for loops going on in decoding utf32 from utf8 is expensive compared to a pointer and a length which is what
Span
is and whatStringSlice
uses underneath.Comparing a 16 byte long SIMD vector is going to be more expensive* than using count leading zeros (most CPUs have a specialized circuit) and bitwise or-ing (1 micro-op) with a comparison to ascii-max (see #3896).
*: In the context in which this function is used, where the number of bytes for a sequence is a prerequisite for a lot of follow-up code, where the throughput advantage of SIMD is not realized given its latency stalls the pipeline. I have done benchmarking and found such cases in #3697 and #3528
A pointer and a length is always going to be less expensive than transforming data, when an algorithmic throughput difference is not part of the equation. This could be the case for example when transforming to another mathematical plane to avoid solving differential equations. But it is not the case here IMO.