Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'𠮷'.length should return 1 instead of 2 #1428

Closed
Cat-sushi opened this issue Feb 2, 2021 · 18 comments
Closed

'𠮷'.length should return 1 instead of 2 #1428

Cat-sushi opened this issue Feb 2, 2021 · 18 comments
Labels
request Requests to resolve a particular developer problem

Comments

@Cat-sushi
Copy link

Cat-sushi commented Feb 2, 2021

I understand that this is a kind of massive breaking change, and I'm looking for a migration path, but haven't found one yet.

As you can see, current String.length is confusing and harmful especially for non-english speakers.
I believe there are almost no correct usage of current String.length, because we can't manipulate String with it correctly.
The root cause of this is that String is sequence of ints representing code units but not code points even at surface of the API.
So, this proposal include introducing Rune (not Runes) as int representing a code point and changing String as a sequence of Runes at least from view point of API.
This proposal might also include deprecating current operator [] of String and introducing runeAt() to String.
Don't misunderstand me, I don't assert that the internal representation of String should be sequence of code points, by this proposal.

@Cat-sushi Cat-sushi added the request Requests to resolve a particular developer problem label Feb 2, 2021
@Cat-sushi
Copy link
Author

Cat-sushi commented Feb 2, 2021

I hope a character code constant of #886 is a Rune as a int extended by #42 or #1426.

@rakudrama
Copy link
Member

Look at https://pub.dev/packages/characters
You should be able to get the right result with '𠮷'.characters.length

@rakudrama
Copy link
Member

@lrhn @kwalrath Lasse - do you think the String class documentation should mention the Characters package / class in the 'See Also` section?

@Cat-sushi
Copy link
Author

@rakudrama
The extension characters introduces Characters which is a iterable of grapheme clusters, but not a code point dedicated type.
My point is that, Rune as int representing code point should be a first citizen of the language defined by dart:core and seemingly String should be a sequence of Rune.

@Cat-sushi
Copy link
Author

Sorry, Rune is not mandatory, which can be alternated by int.
But '𠮷'.length is confusing even after introduction of characters extension.

@rakudrama
Copy link
Member

@Cat-sushi
It is also a reasonable position that using individual code points / runes is incorrect for many of the same reasons that using UTF-16 code units is incorrect.

For example,
'🇬🇵'.length --> 4
'🇬🇵'.runes.length --> 2
'🇬🇵'.characters.length --> 1

How does your string.runeAt(i) differ from string.runes.elementAt(i)?

You say you are looking for a migration path. What is it that you are migrating, and what do you want to migrate to?
Would analyzer warnings about using String.length, String.[] and String.codeUnitAt be helpful in finding or avoiding incorrect uses?

@Cat-sushi
Copy link
Author

Cat-sushi commented Feb 2, 2021

The characters extension is a big step for non-english speaking country, and I appreciated the introduction very much.
Now, I'm looking at a next step.

The code point is the primitive of natural languages, but not necessarily of computers, only which can represent every single character all over the world.
I'm talking about natural languages, so forget emoji, for now.
As you may know, code unit of UTF-16 is the primitive of major programing languages including dart for a long time because of performance reasons.
But, I think it is a kind of million dollar mistake, because length and operator [] of String can't even manipulate every single character correctly, in the era of Internet.
I think, dart as a modern, sophisticated, high level and client side language is the language deserves introduction of code point as a feature of the language core.

As I said, I don't have a migration path, now.
But, I also said that almost all existing pieces of code are not correct.

Would analyzer warnings about using String.length, String.[] and String.codeUnitAt be helpful in finding or avoiding incorrect uses?

Possibly, yes.

@Cat-sushi
Copy link
Author

Cat-sushi commented Feb 2, 2021

For your information, character constant in the proposal #886 is being code point.
To keep consistent, String should be a sequence of code points, for this reason, too.

@Cat-sushi
Copy link
Author

I hope #42 or #1426 provides a sophisticated solution, including a migration path.

@Cat-sushi
Copy link
Author

Cf.
Dart string manipulation done right 👉 | by Tao Dong | Dart | Medium
It's true not only for emoji but also Han-ji.

@lrhn
Copy link
Member

lrhn commented Feb 2, 2021

It's not actually true that "Code point is the primitive of natural language". That primitive is the grapheme cluster, "user-perceived characters" as the Unicode standard describes them.

Even writing é might can one or two code points (because of the redundant pre-composed character, but also the e + combining accent), and I've seen some platforms (Mac IIRC) produce the two code-point version just writing normal text. Caring about code point for text like that would be a mistake, just as grave as splitting a two code-unit surrogate pair.

So, as @rakudrama says, focusing on code points is simply not enough to solve the real problems that people see with non-trivial grapheme clusters. It's one step above code units, but not a significant step, because you still have all the same kinds of problems, just with slightly fewer cases each. The only real solution is grapheme clusters, which means characters.

Changing the length of String would be extremely breaking, won't match the other functions like indexOf or lastIndexOf (unless they too change and that's even more breaking), inefficient (you need to iterate the entire string to figure out its length), and with no way to make it efficient when compiled to JavaScript. So, not going to just do that.

The approach I'd go for is to use #1426 to provide a Characters-like API on top of String. Then you can opt-in to using Characters instead String by simply writing the type.

A String would still be a sequence of UTF-16 code units (because that's what it is in JavaScript, and we can't change that).

@Cat-sushi
Copy link
Author

Yes, I knew.
But, e U+0065 and ́ U+0301 are logically separable, in addition there is a combined code point é U+00E9.
On the other hand, 𠮷 U+20BB7 is not separable, which is much more important be represented by single code point supported by the language core.
I mean code pint is not perfect, but much better and good enough to represent characters in most case.
It is not a one/ zero discussion, I prefer better world with dart.
And, I must repeat that almost all pieces of code with length, operetaro [] or codeUnitAt() are already incorrect, regardless of migration.

By the way, what is the problem with Javascript?
I think Google always proposees correct things to Javascript wold, so I believe the obstacle, if exists, will be removed by Google.

The approach I'd go for is to use #1426 to provide a Characters-like API on top of String.

Sounds good.

Then you can opt-in to using Characters instead String by simply writing the type.

Again, It should be a part of language core, because it is the primitive of natural languages.

@Cat-sushi
Copy link
Author

@lrhn

Changing the length of String would be extremely breaking, won't match the other functions like indexOf or lastIndexOf (unless they too change and that's even more breaking), inefficient (you need to iterate the entire string to figure out its length),

Do you think length, indexOf or lastIndexOf can manipulate natural languages correctly?
Don't ignore natural languages other than english any more, it is not politically correct (sorry, I'm joking).
But, please Imagine the world where english can't be manipulate with functionality in language core.
We are oppressed for a long long time.

Java introducing GC in practical world regardless of its inefficiency is revolutional, and Dart introducing sound null safety in practical world regardless of its difficulty is also revolutional.
I think, it is the time to introduce code point in the language core to make the language more revolutional.
It is the era of the Internet and Flutter, it is drastically changed from 2011!

@lrhn
Copy link
Member

lrhn commented Feb 2, 2021

I agree that doing something to make natural language processing easier and have fewer pitfalls is a good idea.
I disagree completely that code points are relevant to that. You need to work with grapheme clusters. Anything below that on the abstraction scale will not suffice, and will just have the same problems as using code units.

So, if we change anything, it will be to provide a grapheme cluster based abstraction, not a code point based one.
And most likely it will be on top of, or in addition to, the current String because otherwise we break a lot of existing code.
And I do mean a lot.

Some of that existing code is correct (parsing JSON works fine by looking at code units because there is no parsing-relevant character which is not a single code unit, only string contents, same for parsing URIs, dates, integers, etc.). Some is just "good enough" and breaking it won't make people happier.

I don't believe that the String API is something we can change by a single breaking change, it's something that's going to take a very long adjustment and migration period, possibly through several layers of abstraction being added, and at the end, maybe we can deprecate the plain String. I'm not sure that will ever be possible. (This is something we have thought about a lot, it's not just me being a contrarian. I do want it to happen, but I also don't want to break something so fundamental without very good reason and an easy way to fix it).

@yumetodo
Copy link

yumetodo commented Feb 2, 2021

Hello everybody. recently many programing languages are trying to support Unicode. C++20 supports char8_t, ECMAScript supports code point iterator, C#, golang, and Dart also challenging.


Maybe, as already you know, in Unicode, there are 4 ways to count string. Bytes, Code units, Code points, and Grapheme clusters.
ref: https://unicode.org/faq/char_combmark.html#7

These 4 ways are parallel and have equality. @irhn says that code points are not a significant step. However, these ways are used for the different purpose each other. For this point, "step" is not a suit for this problem.

Code points can be used for logical separation in Unicode.

Of course, all the approaches to express character in constant length ware fail to realize one's great ambition. Supporting Grapheme clusters are important. However, code points support is also important.

I think that Code points are a minimal unit to count string logically. For example, now we try to detect grapheme cluster boundary.

  1. separate string into Code points.
  2. investigate the Code point category for each.
  3. parse Code point categories

Any other manipulation related to Unicode(such as tolower, etc), Code points are important.

Why string manipulation is difficult is we need to switch these 4 ways depending on the contexts.
Which is the most important unit to manipulate string? The answer is nothing. All of the 4 ways are equally important.

I don't have enough knowledge of what API is exactly lacking in Dart. However, code point should be also a first citizen. Don't forget the "equally".

@lrhn
Copy link
Member

lrhn commented Feb 2, 2021

It's absolutely correct that code points have their uses (which is also why we have String.runes to access them).

Every layer of abstraction has a reason and an impact (I often parse things directly from UTF-8 bytes read from disk instead of creating an intermediate string, which means I have to go through all the lower levels of abstraction myself—and sometimes that's just what you need to do for a particular problem).

Code points is the first level of abstraction which is free from representation artifacts. UTF-8 bytes or UTF-16 code units (stored as bytes in one of two possible byte orders) are choices of byte representations of code points.
The code points themselves (or, to be precise, the scalar values) are logical units which cannot be combined to form a larger unit of meaning. A grapheme cluster is not a thing in itself (in Unicode), it exists as a cluster of code points (although it can be canonicalized in different ways, so those different representations can be said to represent "the same thing", typically a single glyph, but that glyph is not itself a primitive Unicode concept).

Code points are indeed, in a way, more fundamental than code units or grapheme clusters. I'm not disputing that.

I'm am saying, and maintaining, that for the goal of handling "natural language" (which I take to mean "user perceived characters", because breaking those is usually what people complain about getting wrong), grapheme clusters is the only thing which matters. As stated, looking at code points means you still have to do the grapheme cluster separation to know where the individual characters start. You're not done when you have the code points. You might be closer, but half a solution is not a solution.

Dart represents "strings" as sequences of UTF-16 code units. You can access the individual code units using .codeUnitAt. The length of a string object is the number of UTF-16 code units.

Then Dart provides multiple views on top of that:

  • A List<int> of code units (.codeUnits),
  • An Iterable<int> of code points (.runes), and
  • An Iterable<String> of grapheme clusters (.characters from package:characters).

They all matter (as @yumetode says), and they are all available. For handling natural text, you need to use characters/grapheme clusters, not code points.

Changing String to be code point based, as this issue is asking for, is not going to fix natural text issues. It is going to affect performance of things where the code points don't matter (and those things exist too, say, like parsing JSON). It's going to be very hard to compile such an API efficiently to JavaScript, which is still a goal for Dart.

There are levels above grapheme clusters too, like the Unicode word boundary or line breaking algorithms. That's beyond the scope of what we're talking about here.

@Cat-sushi
Copy link
Author

The title of this issue is kind of sensational, but I won't deny the necessity of the API for code units of UTF-16.

I knew dart have all views of string, code units of UTF-16, code points and grapheme clusters (new), which are defined respective different layers.

view class defined layer unit size (bytes)
code units of UTF-16 String language core (dart:core) with special syntax support 2
code points Runes core API (dart:core) without syntax support < 4
grapheme clusters Characters official API (package:characters) out of the SDK not fixed

The exposure of code units of UTF-16 is much higher than the others.
I think this situation is confusing and misleading, and it is code points that should be the default thing with special (direct) syntax support.

I think also code point (and code unit of UTF-16) should have a dedicated type extending int with #42 or #1426 like char8_t in C++20.

@Cat-sushi
Copy link
Author

I understand that grapheme clusters are the one for correctness of natural language manipulation, but replacing String by Characters is not practical.

Thank you for your discussion.
I would post some other proposals.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
request Requests to resolve a particular developer problem
Projects
None yet
Development

No branches or pull requests

4 participants