Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[stdlib] [proposal] String, ASCII, Unicode, UTF, Graphemes #3988

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
189 changes: 189 additions & 0 deletions proposals/string-ascii-unicode-utf-grapheme.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,189 @@
# String, ASCII, Unicode, UTF, Graphemes

The current proposal will attempt to unify the handling of the standards.
Providing nice ergonomics, as Python-like defaults as possible, and keeping the
door open for optimizations.

## String

String is currently an owning type which encodes string data in UTF-8, which is
a variable length encoding format. Data can be 1 byte up to 4 bytes long. ASCII
text, which is where English text falls, is 1 byte long. As such, the defaults
should optimize for that character set given it is most typical on the internet
and on backend systems.

## ASCII

Ascii is pretty optimizable, there are many known tricks. The big problem
with supporting only ASCII is that other encodings can get corrupted.

## UTF standards

- UTF-8 is 1-4 sets of 8 bits long
- UTF-16 is 1-2 sets of 16 bits long
- UTF-32 is 1 set of 32 bits long

### When slicing by unicode codepoint e.g. "🔥🔥🔥" (\U0001f525\U0001f525\U0001f525)

- UTF-8: 12 sets (12 bytes) long. The first byte of each fire can be used to
know the length, and the next bytes are what is known as a continuation byte.
There are several approaches to achieve the slicing, they can be explored with
benchmarking later on.
- UTF-16: 6 sets long. It's very similar in procedure to UTF-8.
- UTF-32: It is fastest since it's direct index access. This is not the case
when supporting graphemes.

## Graphemes

Graphemes are an extension which allows unicode codepoints to modify the end
result through concatenation with other codepoints. Visually they are shown as a
single unit e.g. é is actually comprised of `e` followed by `´` (\x65\u0301).

Graphemes are more expensive to slice because one can't use faster algorithms,
only skipping each unicode codepoint one at a time and checking the last byte
to see if it's an extended cluster *.

*: extended clusters is the full feature complete grapheme standard. There
exists the posibility of implementing only a subset, but that defeats the
purpose of going through the hassle of supporting it in the first place.

## Context

C, C++, and Rust use UTF-8 with no default support for graphemes.

### Swift

Swift is an interesting case study. For compatibility with Objective-C they went
for UTF-16. They decided to support grapheme clusters by default. They recently
changed the preferred encoding from UTF-16 to UTF-8. They have a `Character`
that I think inspired our current `Char` type, it is a generic representation of
a Character in any encoding, that can be from one codepoint up to any grapheme
cluster.

### Python

Python currently uses UTF-32 for its string type. So the slicing and indexing
is simple and fast (but consumes a lot of memory, they have some tricks to
reduce it). They do not support graphemes by default. Pypi is implementing a
UTF-8 version of Python strings, which keeps the length state every x
characters.

### Mojo

Mojo aims to be close to Python yet be faster and customizable, taking advantage
of heterogeneous hardware and modern type system features.

## Value vs. Reference

Our current `Char` type uses a u32 as storage, every time an iterator that
yields `Char`is used, an instance is parsed from the internal UTF-8 encoded
`StringSlice` (into UTF-32).

The default iterator for `String` returns a `StringSlice` which is a view into
the character in the UTF-8 encoded `StringSlice`. This is much more efficient
and does not add any complexity into the type system nor developer headspace.

## Now, onto the Proposal

### Goals

#### Hold off on developing Char further and remove it from stdlib.builtin

`Char` is currently expensive to create and use compared to a `StringSlice`
Copy link
Contributor

@gabrieldemarmiesse gabrieldemarmiesse Feb 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have numbers on this? The creation but also using different methods on it. Let's avoid optimizing things without data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have numbers on this?

I don't need numbers to know that the bit-shifting, masking, and for loops going on in decoding utf32 from utf8 is expensive compared to a pointer and a length which is what Span is and what StringSlice uses underneath.

also using different methods on it

Comparing a 16 byte long SIMD vector is going to be more expensive* than using count leading zeros (most CPUs have a specialized circuit) and bitwise or-ing (1 micro-op) with a comparison to ascii-max (see #3896).

*: In the context in which this function is used, where the number of bytes for a sequence is a prerequisite for a lot of follow-up code, where the throughput advantage of SIMD is not realized given its latency stalls the pipeline. I have done benchmarking and found such cases in #3697 and #3528

Let's avoid optimizing things without data.

A pointer and a length is always going to be less expensive than transforming data, when an algorithmic throughput difference is not part of the equation. This could be the case for example when transforming to another mathematical plane to avoid solving differential equations. But it is not the case here IMO.

which is a view over the original data. There is also the problem that it forces
UTF-32 on the data, and those transformations are expensive.

We can revisit `Char` later on making it take encoding into account. But the
current state of the type makes it add more complexity than strictly necessary.

#### Full ASCII optimizations
If someone wishes to use a `String` as if it's an ASCII-only String, then there
should either be a parameter that signifies that, or the stdlib/community should
add an `ASCIIString` type which has all optimizations possible for such
scenarios.

#### Mostly ASCII optimizations
Many functions can make use of branch predition, instruction and data
prefetching, and algorithmic tricks to make processing faster for languages
which have mostly ASCII characters but still keep full unicode support.

#### Grapheme support
Grapheme support should exist but be opt-in due to their high compute cost.

### One concrete (tentative) way forward

With a clear goal in mind, this is a concrete (tentative) way forward.

#### Add parameters to String and StringSlice

```mojo
struct Encoding:
alias UTF8 = 0
alias UTF16 = 1
alias UTF32 = 2
alias ASCII = 3

struct Indexing:
alias DIRECT = 0
alias CODEPOINT = 1
alias GRAPHEME = 2

alias ASCIIString = String[encoding=Encoding.ASCII, indexing=Indexing.DIRECT]

struct String[
encoding: Encoding = Encoding.UTF8,
indexing: Indexing = Indexing.CODEPOINT,
]:
... # data is bitcasted to bigger DTypes when encoding is 16 or 32 bits

struct StringSlice[
encoding: Encoding = Encoding.UTF8,
indexing: Indexing = Indexing.CODEPOINT,
]:
... # data is bitcasted to bigger DTypes when encoding is 16 or 32 bits
```

#### What this requires

First, we can add the parameters and constraint on the supported encodings.

Then, we need to rewrite every function signature that makes two `String`s
interact with one another, where the code doesn't require both to be the
defaults.

Many of those functions will need to be branched when the encodings are
different and allocations for parsing added. But the default same-encoding
ones will remain the same **(as long as the code is encoding and
indexing-agnostic)**.

The implementations can be added with time making liberal use of `constrained`.

#### Adapt StringSliceIter

The default iterator should iterate in the way in which the `String` is
parametrized. The iterator could then have functions which return the other
types of iterators for ergonomics (or have constructors for each) *.

*: this is a workaround until we have generic iterator support.

e.g.
```mojo
data = String("123 \n\u2029🔥")
for c in data:
... # StringSlice in the same encoding and indexing scheme by default

# Once we have Char developed enough
for c in iter(data).chars[encoding.UTF8]():
... # Char type instances (UTF8 packed in a UInt32 ?)
for c in iter(data).chars[encoding.UTF32]():
... # Char type instances

# We could also have lazy iterators which return StringSlices according to the
# encoding and indexing parameter.
# In the case of Encoding.ASCII unicode separators (\x85, \u2028, \u2029) can
# be ignored and thus the processing is much faster.
for c in iter(data).split():
... # StringSlice lazily split by all whitespace
for c in iter(data).splitlines():
... # StringSlice lazily split by all newline characters
```
Loading