diff --git a/proposals/string-ascii-unicode-utf-grapheme.md b/proposals/string-ascii-unicode-utf-grapheme.md new file mode 100644 index 0000000000..619f0faf60 --- /dev/null +++ b/proposals/string-ascii-unicode-utf-grapheme.md @@ -0,0 +1,189 @@ +# String, ASCII, Unicode, UTF, Graphemes + +The current proposal will attempt to unify the handling of the standards. +Providing nice ergonomics, as Python-like defaults as possible, and keeping the +door open for optimizations. + +## String + +String is currently an owning type which encodes string data in UTF-8, which is +a variable length encoding format. Data can be 1 byte up to 4 bytes long. ASCII +text, which is where English text falls, is 1 byte long. As such, the defaults +should optimize for that character set given it is most typical on the internet +and on backend systems. + +## ASCII + +Ascii is pretty optimizable, there are many known tricks. The big problem +with supporting only ASCII is that other encodings can get corrupted. + +## UTF standards + +- UTF-8 is 1-4 sets of 8 bits long +- UTF-16 is 1-2 sets of 16 bits long +- UTF-32 is 1 set of 32 bits long + +### When slicing by unicode codepoint e.g. "๐Ÿ”ฅ๐Ÿ”ฅ๐Ÿ”ฅ" (\U0001f525\U0001f525\U0001f525) + +- UTF-8: 12 sets (12 bytes) long. The first byte of each fire can be used to +know the length, and the next bytes are what is known as a continuation byte. +There are several approaches to achieve the slicing, they can be explored with +benchmarking later on. +- UTF-16: 6 sets long. It's very similar in procedure to UTF-8. +- UTF-32: It is fastest since it's direct index access. This is not the case +when supporting graphemes. + +## Graphemes + +Graphemes are an extension which allows unicode codepoints to modify the end +result through concatenation with other codepoints. Visually they are shown as a +single unit e.g. eฬ is actually comprised of `e` followed by `ยด` (\x65\u0301). + +Graphemes are more expensive to slice because one can't use faster algorithms, +only skipping each unicode codepoint one at a time and checking the last byte +to see if it's an extended cluster *. + +*: extended clusters is the full feature complete grapheme standard. There +exists the posibility of implementing only a subset, but that defeats the +purpose of going through the hassle of supporting it in the first place. + +## Context + +C, C++, and Rust use UTF-8 with no default support for graphemes. + +### Swift + +Swift is an interesting case study. For compatibility with Objective-C they went +for UTF-16. They decided to support grapheme clusters by default. They recently +changed the preferred encoding from UTF-16 to UTF-8. They have a `Character` +that I think inspired our current `Char` type, it is a generic representation of +a Character in any encoding, that can be from one codepoint up to any grapheme +cluster. + +### Python + +Python currently uses UTF-32 for its string type. So the slicing and indexing +is simple and fast (but consumes a lot of memory, they have some tricks to +reduce it). They do not support graphemes by default. Pypi is implementing a +UTF-8 version of Python strings, which keeps the length state every x +characters. + +### Mojo + +Mojo aims to be close to Python yet be faster and customizable, taking advantage +of heterogeneous hardware and modern type system features. + +## Value vs. Reference + +Our current `Char` type uses a u32 as storage, every time an iterator that +yields `Char`is used, an instance is parsed from the internal UTF-8 encoded +`StringSlice` (into UTF-32). + +The default iterator for `String` returns a `StringSlice` which is a view into +the character in the UTF-8 encoded `StringSlice`. This is much more efficient +and does not add any complexity into the type system nor developer headspace. + +## Now, onto the Proposal + +### Goals + +#### Hold off on developing Char further and remove it from stdlib.builtin + +`Char` is currently expensive to create and use compared to a `StringSlice` +which is a view over the original data. There is also the problem that it forces +UTF-32 on the data, and those transformations are expensive. + +We can revisit `Char` later on making it take encoding into account. But the +current state of the type makes it add more complexity than strictly necessary. + +#### Full ASCII optimizations +If someone wishes to use a `String` as if it's an ASCII-only String, then there +should either be a parameter that signifies that, or the stdlib/community should +add an `ASCIIString` type which has all optimizations possible for such +scenarios. + +#### Mostly ASCII optimizations +Many functions can make use of branch predition, instruction and data +prefetching, and algorithmic tricks to make processing faster for languages +which have mostly ASCII characters but still keep full unicode support. + +#### Grapheme support +Grapheme support should exist but be opt-in due to their high compute cost. + +### One concrete (tentative) way forward + +With a clear goal in mind, this is a concrete (tentative) way forward. + +#### Add parameters to String and StringSlice + +```mojo +struct Encoding: + alias UTF8 = 0 + alias UTF16 = 1 + alias UTF32 = 2 + alias ASCII = 3 + +struct Indexing: + alias DIRECT = 0 + alias CODEPOINT = 1 + alias GRAPHEME = 2 + +alias ASCIIString = String[encoding=Encoding.ASCII, indexing=Indexing.DIRECT] + +struct String[ + encoding: Encoding = Encoding.UTF8, + indexing: Indexing = Indexing.CODEPOINT, +]: + ... # data is bitcasted to bigger DTypes when encoding is 16 or 32 bits + +struct StringSlice[ + encoding: Encoding = Encoding.UTF8, + indexing: Indexing = Indexing.CODEPOINT, +]: + ... # data is bitcasted to bigger DTypes when encoding is 16 or 32 bits +``` + +#### What this requires + +First, we can add the parameters and constraint on the supported encodings. + +Then, we need to rewrite every function signature that makes two `String`s +interact with one another, where the code doesn't require both to be the +defaults. + +Many of those functions will need to be branched when the encodings are +different and allocations for parsing added. But the default same-encoding +ones will remain the same **(as long as the code is encoding and +indexing-agnostic)**. + +The implementations can be added with time making liberal use of `constrained`. + +#### Adapt StringSliceIter + +The default iterator should iterate in the way in which the `String` is +parametrized. The iterator could then have functions which return the other +types of iterators for ergonomics (or have constructors for each) *. + +*: this is a workaround until we have generic iterator support. + +e.g. +```mojo +data = String("123 \n\u2029๐Ÿ”ฅ") +for c in data: + ... # StringSlice in the same encoding and indexing scheme by default + +# Once we have Char developed enough +for c in iter(data).chars[encoding.UTF8](): + ... # Char type instances (UTF8 packed in a UInt32 ?) +for c in iter(data).chars[encoding.UTF32](): + ... # Char type instances + +# We could also have lazy iterators which return StringSlices according to the +# encoding and indexing parameter. +# In the case of Encoding.ASCII unicode separators (\x85, \u2028, \u2029) can +# be ignored and thus the processing is much faster. +for c in iter(data).split(): + ... # StringSlice lazily split by all whitespace +for c in iter(data).splitlines(): + ... # StringSlice lazily split by all newline characters +```