-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Commit
__len__()
methods to return length in
bytes MODULAR_ORIG_COMMIT_REV_ID: 7f87f1d40ee48279b20abac457d7bf3d2b326e5d
- Loading branch information
There are no files selected for viewing
1 comment
on commit e9ce5dc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, I don't agree with this assessment and the subsequent decision. This is not the case in languages with non-ascii characters. And Python does it with unicode codepoints because most of the APIs return offsets in unicode codepoints. Not only is it because the underlying encoding is UTF32, it's also because in non-english languages people aren't thinking in terms of bytes but unicode characters.
We could just implement a specialized stringlike type (e.g.
RawString
) which has these kind of optimizations, but I strongly disagree with deviating so much from Python's string by default. This kind of imposing complexity on the end user is something I strongly dislike about other low level languages, IMO the door should be open for performance but the default should be simplicity.PS: allowing these kind of optimizations when working on raw bytes is why I've been pushing so much for PR #3548
I've been working towards making several functions return unicode codepoints like Python for months (them working by byte offset will most likely also break a lot of Python code). This started before issue #3246 and started getting serious with issue #3526. I've been slowly making changes on every place I could find that worked assuming indexing is by byte offset.
@ConnorGray, @JoeLoser, @lsh I strongly dislike this decision, and it feels very one-sided. I think we can find some middle ground by having either a parametrized way to indicate indexing type or build a new type which behaves by byte offset. IMO Span
already covers this use-case (and building a StringSlice
from a Span
is practically free).
Hi, I don't agree with this assessment and the subsequent decision. This is not the case in languages with non-ascii characters. And Python does it with unicode codepoints because most of the APIs return offsets in unicode codepoints. Not only is it because the underlying encoding is UTF32, it's also because in non-english languages people aren't thinking in terms of bytes but unicode characters.
We could just implement a specialized stringlike type (e.g.
RawString
) which has these kind of optimizations, but I strongly disagree with deviating so much from Python's string by default. This kind of imposing complexity on the end user is something I strongly dislike about other low level languages, IMO the door should be open for performance but the default should be simplicity.PS: allowing these kind of optimizations when working on raw bytes is why I've been pushing so much for PR #3548