'𠮷'.length should return 1 instead of 2 #1428

Cat-sushi · 2021-02-02T00:32:12Z

I understand that this is a kind of massive breaking change, and I'm looking for a migration path, but haven't found one yet.

As you can see, current String.length is confusing and harmful especially for non-english speakers.
I believe there are almost no correct usage of current String.length, because we can't manipulate String with it correctly.
The root cause of this is that String is sequence of ints representing code units but not code points even at surface of the API.
So, this proposal include introducing Rune (not Runes) as int representing a code point and changing String as a sequence of Runes at least from view point of API.
This proposal might also include deprecating current operator [] of String and introducing runeAt() to String.
Don't misunderstand me, I don't assert that the internal representation of String should be sequence of code points, by this proposal.

The text was updated successfully, but these errors were encountered:

Cat-sushi · 2021-02-02T01:13:56Z

I hope a character code constant of #886 is a Rune as a int extended by #42 or #1426.

rakudrama · 2021-02-02T06:11:23Z

Look at https://pub.dev/packages/characters
You should be able to get the right result with '𠮷'.characters.length

rakudrama · 2021-02-02T06:19:17Z

@lrhn @kwalrath Lasse - do you think the String class documentation should mention the Characters package / class in the 'See Also` section?

Cat-sushi · 2021-02-02T06:20:57Z

@rakudrama
The extension characters introduces Characters which is a iterable of grapheme clusters, but not a code point dedicated type.
My point is that, Rune as int representing code point should be a first citizen of the language defined by dart:core and seemingly String should be a sequence of Rune.

Cat-sushi · 2021-02-02T06:33:55Z

Sorry, Rune is not mandatory, which can be alternated by int.
But '𠮷'.length is confusing even after introduction of characters extension.

rakudrama · 2021-02-02T06:52:21Z

@Cat-sushi
It is also a reasonable position that using individual code points / runes is incorrect for many of the same reasons that using UTF-16 code units is incorrect.

For example,
'🇬🇵'.length --> 4
'🇬🇵'.runes.length --> 2
'🇬🇵'.characters.length --> 1

How does your string.runeAt(i) differ from string.runes.elementAt(i)?

You say you are looking for a migration path. What is it that you are migrating, and what do you want to migrate to?
Would analyzer warnings about using String.length, String.[] and String.codeUnitAt be helpful in finding or avoiding incorrect uses?

Cat-sushi · 2021-02-02T07:18:33Z

The characters extension is a big step for non-english speaking country, and I appreciated the introduction very much.
Now, I'm looking at a next step.

The code point is the primitive of natural languages, but not necessarily of computers, only which can represent every single character all over the world.
I'm talking about natural languages, so forget emoji, for now.
As you may know, code unit of UTF-16 is the primitive of major programing languages including dart for a long time because of performance reasons.
But, I think it is a kind of million dollar mistake, because length and operator [] of String can't even manipulate every single character correctly, in the era of Internet.
I think, dart as a modern, sophisticated, high level and client side language is the language deserves introduction of code point as a feature of the language core.

As I said, I don't have a migration path, now.
But, I also said that almost all existing pieces of code are not correct.

Would analyzer warnings about using String.length, String.[] and String.codeUnitAt be helpful in finding or avoiding incorrect uses?

Possibly, yes.

Cat-sushi · 2021-02-02T07:25:22Z

For your information, character constant in the proposal #886 is being code point.
To keep consistent, String should be a sequence of code points, for this reason, too.

Cat-sushi · 2021-02-02T08:33:18Z

I hope #42 or #1426 provides a sophisticated solution, including a migration path.

Cat-sushi · 2021-02-02T08:41:57Z

Cf.
Dart string manipulation done right 👉 | by Tao Dong | Dart | Medium
It's true not only for emoji but also Han-ji.

lrhn · 2021-02-02T09:55:09Z

It's not actually true that "Code point is the primitive of natural language". That primitive is the grapheme cluster, "user-perceived characters" as the Unicode standard describes them.

Even writing é might can one or two code points (because of the redundant pre-composed character, but also the e + combining accent), and I've seen some platforms (Mac IIRC) produce the two code-point version just writing normal text. Caring about code point for text like that would be a mistake, just as grave as splitting a two code-unit surrogate pair.

So, as @rakudrama says, focusing on code points is simply not enough to solve the real problems that people see with non-trivial grapheme clusters. It's one step above code units, but not a significant step, because you still have all the same kinds of problems, just with slightly fewer cases each. The only real solution is grapheme clusters, which means characters.

Changing the length of String would be extremely breaking, won't match the other functions like indexOf or lastIndexOf (unless they too change and that's even more breaking), inefficient (you need to iterate the entire string to figure out its length), and with no way to make it efficient when compiled to JavaScript. So, not going to just do that.

The approach I'd go for is to use #1426 to provide a Characters-like API on top of String. Then you can opt-in to using Characters instead String by simply writing the type.

A String would still be a sequence of UTF-16 code units (because that's what it is in JavaScript, and we can't change that).

Cat-sushi · 2021-02-02T11:59:43Z

Yes, I knew.
But, e U+0065 and ́ U+0301 are logically separable, in addition there is a combined code point é U+00E9.
On the other hand, 𠮷 U+20BB7 is not separable, which is much more important be represented by single code point supported by the language core.
I mean code pint is not perfect, but much better and good enough to represent characters in most case.
It is not a one/ zero discussion, I prefer better world with dart.
And, I must repeat that almost all pieces of code with length, operetaro [] or codeUnitAt() are already incorrect, regardless of migration.

By the way, what is the problem with Javascript?
I think Google always proposees correct things to Javascript wold, so I believe the obstacle, if exists, will be removed by Google.

The approach I'd go for is to use #1426 to provide a Characters-like API on top of String.

Sounds good.

Then you can opt-in to using Characters instead String by simply writing the type.

Again, It should be a part of language core, because it is the primitive of natural languages.

Cat-sushi · 2021-02-02T13:18:44Z

@lrhn

Changing the length of String would be extremely breaking, won't match the other functions like indexOf or lastIndexOf (unless they too change and that's even more breaking), inefficient (you need to iterate the entire string to figure out its length),

Do you think length, indexOf or lastIndexOf can manipulate natural languages correctly?
Don't ignore natural languages other than english any more, it is not politically correct (sorry, I'm joking).
But, please Imagine the world where english can't be manipulate with functionality in language core.
We are oppressed for a long long time.

Java introducing GC in practical world regardless of its inefficiency is revolutional, and Dart introducing sound null safety in practical world regardless of its difficulty is also revolutional.
I think, it is the time to introduce code point in the language core to make the language more revolutional.
It is the era of the Internet and Flutter, it is drastically changed from 2011!

lrhn · 2021-02-02T15:49:29Z

I agree that doing something to make natural language processing easier and have fewer pitfalls is a good idea.
I disagree completely that code points are relevant to that. You need to work with grapheme clusters. Anything below that on the abstraction scale will not suffice, and will just have the same problems as using code units.

So, if we change anything, it will be to provide a grapheme cluster based abstraction, not a code point based one.
And most likely it will be on top of, or in addition to, the current String because otherwise we break a lot of existing code.
And I do mean a lot.

Some of that existing code is correct (parsing JSON works fine by looking at code units because there is no parsing-relevant character which is not a single code unit, only string contents, same for parsing URIs, dates, integers, etc.). Some is just "good enough" and breaking it won't make people happier.

I don't believe that the String API is something we can change by a single breaking change, it's something that's going to take a very long adjustment and migration period, possibly through several layers of abstraction being added, and at the end, maybe we can deprecate the plain String. I'm not sure that will ever be possible. (This is something we have thought about a lot, it's not just me being a contrarian. I do want it to happen, but I also don't want to break something so fundamental without very good reason and an easy way to fix it).

yumetodo · 2021-02-02T16:08:12Z

Hello everybody. recently many programing languages are trying to support Unicode. C++20 supports char8_t, ECMAScript supports code point iterator, C#, golang, and Dart also challenging.

Maybe, as already you know, in Unicode, there are 4 ways to count string. Bytes, Code units, Code points, and Grapheme clusters.
ref: https://unicode.org/faq/char_combmark.html#7

These 4 ways are parallel and have equality. @irhn says that code points are not a significant step. However, these ways are used for the different purpose each other. For this point, "step" is not a suit for this problem.

Code points can be used for logical separation in Unicode.

Of course, all the approaches to express character in constant length ware fail to realize one's great ambition. Supporting Grapheme clusters are important. However, code points support is also important.

I think that Code points are a minimal unit to count string logically. For example, now we try to detect grapheme cluster boundary.

separate string into Code points.
investigate the Code point category for each.
parse Code point categories

Any other manipulation related to Unicode(such as tolower, etc), Code points are important.

Why string manipulation is difficult is we need to switch these 4 ways depending on the contexts.
Which is the most important unit to manipulate string? The answer is nothing. All of the 4 ways are equally important.

I don't have enough knowledge of what API is exactly lacking in Dart. However, code point should be also a first citizen. Don't forget the "equally".

lrhn · 2021-02-02T16:31:05Z

It's absolutely correct that code points have their uses (which is also why we have String.runes to access them).

Every layer of abstraction has a reason and an impact (I often parse things directly from UTF-8 bytes read from disk instead of creating an intermediate string, which means I have to go through all the lower levels of abstraction myself—and sometimes that's just what you need to do for a particular problem).

Code points is the first level of abstraction which is free from representation artifacts. UTF-8 bytes or UTF-16 code units (stored as bytes in one of two possible byte orders) are choices of byte representations of code points.
The code points themselves (or, to be precise, the scalar values) are logical units which cannot be combined to form a larger unit of meaning. A grapheme cluster is not a thing in itself (in Unicode), it exists as a cluster of code points (although it can be canonicalized in different ways, so those different representations can be said to represent "the same thing", typically a single glyph, but that glyph is not itself a primitive Unicode concept).

Code points are indeed, in a way, more fundamental than code units or grapheme clusters. I'm not disputing that.

I'm am saying, and maintaining, that for the goal of handling "natural language" (which I take to mean "user perceived characters", because breaking those is usually what people complain about getting wrong), grapheme clusters is the only thing which matters. As stated, looking at code points means you still have to do the grapheme cluster separation to know where the individual characters start. You're not done when you have the code points. You might be closer, but half a solution is not a solution.

Dart represents "strings" as sequences of UTF-16 code units. You can access the individual code units using .codeUnitAt. The length of a string object is the number of UTF-16 code units.

Then Dart provides multiple views on top of that:

A List<int> of code units (.codeUnits),
An Iterable<int> of code points (.runes), and
An Iterable<String> of grapheme clusters (.characters from package:characters).

They all matter (as @yumetode says), and they are all available. For handling natural text, you need to use characters/grapheme clusters, not code points.

Changing String to be code point based, as this issue is asking for, is not going to fix natural text issues. It is going to affect performance of things where the code points don't matter (and those things exist too, say, like parsing JSON). It's going to be very hard to compile such an API efficiently to JavaScript, which is still a goal for Dart.

There are levels above grapheme clusters too, like the Unicode word boundary or line breaking algorithms. That's beyond the scope of what we're talking about here.

Cat-sushi · 2021-02-02T22:45:50Z

The title of this issue is kind of sensational, but I won't deny the necessity of the API for code units of UTF-16.

I knew dart have all views of string, code units of UTF-16, code points and grapheme clusters (new), which are defined respective different layers.

view	class	defined layer	unit size (bytes)
code units of UTF-16	`String`	language core (dart:core) with special syntax support	2
code points	`Runes`	core API (dart:core) without syntax support	< 4
grapheme clusters	`Characters`	official API (package:characters) out of the SDK	not fixed

The exposure of code units of UTF-16 is much higher than the others.
I think this situation is confusing and misleading, and it is code points that should be the default thing with special (direct) syntax support.

I think also code point (and code unit of UTF-16) should have a dedicated type extending int with #42 or #1426 like char8_t in C++20.

Cat-sushi · 2021-02-03T01:30:49Z

I understand that grapheme clusters are the one for correctness of natural language manipulation, but replacing String by Characters is not practical.

Thank you for your discussion.
I would post some other proposals.

Cat-sushi added the request Requests to resolve a particular developer problem label Feb 2, 2021

kwalrath mentioned this issue Feb 2, 2021

Mention the characters package wherever String is covered dart-lang/site-www#2907

Closed

Cat-sushi closed this as completed Feb 3, 2021

Cat-sushi mentioned this issue Feb 3, 2021

Add syntax for grapheme clusters literals. #1432

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

'𠮷'.length should return 1 instead of 2 #1428

'𠮷'.length should return 1 instead of 2 #1428

Cat-sushi commented Feb 2, 2021 •

edited

Loading

Cat-sushi commented Feb 2, 2021 •

edited

Loading

rakudrama commented Feb 2, 2021

rakudrama commented Feb 2, 2021

Cat-sushi commented Feb 2, 2021

Cat-sushi commented Feb 2, 2021

rakudrama commented Feb 2, 2021

Cat-sushi commented Feb 2, 2021 •

edited

Loading

Cat-sushi commented Feb 2, 2021 •

edited

Loading

Cat-sushi commented Feb 2, 2021

Cat-sushi commented Feb 2, 2021

lrhn commented Feb 2, 2021

Cat-sushi commented Feb 2, 2021

Cat-sushi commented Feb 2, 2021

lrhn commented Feb 2, 2021

yumetodo commented Feb 2, 2021

lrhn commented Feb 2, 2021 •

edited

Loading

Cat-sushi commented Feb 2, 2021

Cat-sushi commented Feb 3, 2021

'𠮷'.length should return 1 instead of 2 #1428

'𠮷'.length should return 1 instead of 2 #1428

Comments

Cat-sushi commented Feb 2, 2021 • edited Loading

Cat-sushi commented Feb 2, 2021 • edited Loading

rakudrama commented Feb 2, 2021

rakudrama commented Feb 2, 2021

Cat-sushi commented Feb 2, 2021

Cat-sushi commented Feb 2, 2021

rakudrama commented Feb 2, 2021

Cat-sushi commented Feb 2, 2021 • edited Loading

Cat-sushi commented Feb 2, 2021 • edited Loading

Cat-sushi commented Feb 2, 2021

Cat-sushi commented Feb 2, 2021

lrhn commented Feb 2, 2021

Cat-sushi commented Feb 2, 2021

Cat-sushi commented Feb 2, 2021

lrhn commented Feb 2, 2021

yumetodo commented Feb 2, 2021

lrhn commented Feb 2, 2021 • edited Loading

Cat-sushi commented Feb 2, 2021

Cat-sushi commented Feb 3, 2021

Cat-sushi commented Feb 2, 2021 •

edited

Loading

Cat-sushi commented Feb 2, 2021 •

edited

Loading

Cat-sushi commented Feb 2, 2021 •

edited

Loading

Cat-sushi commented Feb 2, 2021 •

edited

Loading

lrhn commented Feb 2, 2021 •

edited

Loading