-
Notifications
You must be signed in to change notification settings - Fork 207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
'𠮷'.length should return 1 instead of 2 #1428
Comments
Look at https://pub.dev/packages/characters |
@rakudrama |
Sorry, |
@Cat-sushi For example, How does your You say you are looking for a migration path. What is it that you are migrating, and what do you want to migrate to? |
The The code point is the primitive of natural languages, but not necessarily of computers, only which can represent every single character all over the world. As I said, I don't have a migration path, now.
Possibly, yes. |
For your information, character constant in the proposal #886 is being code point. |
Cf. |
It's not actually true that "Code point is the primitive of natural language". That primitive is the grapheme cluster, "user-perceived characters" as the Unicode standard describes them. Even writing So, as @rakudrama says, focusing on code points is simply not enough to solve the real problems that people see with non-trivial grapheme clusters. It's one step above code units, but not a significant step, because you still have all the same kinds of problems, just with slightly fewer cases each. The only real solution is grapheme clusters, which means Changing the The approach I'd go for is to use #1426 to provide a A |
Yes, I knew. By the way, what is the problem with Javascript?
Sounds good.
Again, It should be a part of language core, because it is the primitive of natural languages. |
Do you think Java introducing GC in practical world regardless of its inefficiency is revolutional, and Dart introducing sound null safety in practical world regardless of its difficulty is also revolutional. |
I agree that doing something to make natural language processing easier and have fewer pitfalls is a good idea. So, if we change anything, it will be to provide a grapheme cluster based abstraction, not a code point based one. Some of that existing code is correct (parsing JSON works fine by looking at code units because there is no parsing-relevant character which is not a single code unit, only string contents, same for parsing URIs, dates, integers, etc.). Some is just "good enough" and breaking it won't make people happier. I don't believe that the |
Hello everybody. recently many programing languages are trying to support Unicode. C++20 supports char8_t, ECMAScript supports code point iterator, C#, golang, and Dart also challenging. Maybe, as already you know, in Unicode, there are 4 ways to count string. Bytes, Code units, Code points, and Grapheme clusters. These 4 ways are parallel and have equality. @irhn says that code points are not a significant step. However, these ways are used for the different purpose each other. For this point, "step" is not a suit for this problem. Code points can be used for logical separation in Unicode. Of course, all the approaches to express character in constant length ware fail to realize one's great ambition. Supporting Grapheme clusters are important. However, code points support is also important. I think that Code points are a minimal unit to count string logically. For example, now we try to detect grapheme cluster boundary.
Any other manipulation related to Unicode(such as Why string manipulation is difficult is we need to switch these 4 ways depending on the contexts. I don't have enough knowledge of what API is exactly lacking in Dart. However, code point should be also a first citizen. Don't forget the "equally". |
It's absolutely correct that code points have their uses (which is also why we have Every layer of abstraction has a reason and an impact (I often parse things directly from UTF-8 bytes read from disk instead of creating an intermediate string, which means I have to go through all the lower levels of abstraction myself—and sometimes that's just what you need to do for a particular problem). Code points is the first level of abstraction which is free from representation artifacts. UTF-8 bytes or UTF-16 code units (stored as bytes in one of two possible byte orders) are choices of byte representations of code points. Code points are indeed, in a way, more fundamental than code units or grapheme clusters. I'm not disputing that. I'm am saying, and maintaining, that for the goal of handling "natural language" (which I take to mean "user perceived characters", because breaking those is usually what people complain about getting wrong), grapheme clusters is the only thing which matters. As stated, looking at code points means you still have to do the grapheme cluster separation to know where the individual characters start. You're not done when you have the code points. You might be closer, but half a solution is not a solution. Dart represents "strings" as sequences of UTF-16 code units. You can access the individual code units using Then Dart provides multiple views on top of that:
They all matter (as @yumetode says), and they are all available. For handling natural text, you need to use characters/grapheme clusters, not code points. Changing There are levels above grapheme clusters too, like the Unicode word boundary or line breaking algorithms. That's beyond the scope of what we're talking about here. |
The title of this issue is kind of sensational, but I won't deny the necessity of the API for code units of UTF-16. I knew dart have all views of string, code units of UTF-16, code points and grapheme clusters (new), which are defined respective different layers.
The exposure of code units of UTF-16 is much higher than the others. I think also code point (and code unit of UTF-16) should have a dedicated type extending |
I understand that grapheme clusters are the one for correctness of natural language manipulation, but replacing Thank you for your discussion. |
I understand that this is a kind of massive breaking change, and I'm looking for a migration path, but haven't found one yet.
As you can see, current
String.length
is confusing and harmful especially for non-english speakers.I believe there are almost no correct usage of current
String.length
, because we can't manipulateString
with it correctly.The root cause of this is that
String
is sequence ofint
s representing code units but not code points even at surface of the API.So, this proposal include introducing
Rune
(notRunes
) asint
representing a code point and changingString
as a sequence ofRune
s at least from view point of API.This proposal might also include deprecating current
operator []
ofString
and introducingruneAt()
toString
.Don't misunderstand me, I don't assert that the internal representation of
String
should be sequence of code points, by this proposal.The text was updated successfully, but these errors were encountered: