-
-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarify that size limits are in bytes #1234
Conversation
In almost all instances we are already talking about UTF-8 encoded ASCII. As such this is not a behavioural change apart from in theory legacy identifiers, that never were part of the spec. fixes matrix-org#1001 Signed-off-by: Nicolas Werner <[email protected]>
I probably missed some and some of these are basically reundant, because it says "Needs to be [a-z]" or similar right in front of it, but some people still disagree and bytes should be clear enough (UTF-8 encoding is implicit by the spec already). |
(I also probably missed some) |
Signed-off-by: Nicolas Werner <[email protected]>
This looks like it should be an MSC for a future room version |
It depends on how the spec clarification turns out, but in general this shouldn't break anything, considering in pretty much all instances it restricts the character set to ASCII before talking about codepoints or characters. In which case bytes and characters are equivalent. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just ftr: we're looking into these to determine if they're all able to be changed to bytes as some are not immediately obvious or likely require an MSC on their own. However, some do look valid, which is good.
It will take a while to go through all of these - please be patient and understanding with us.
From a pure spec perspective these would be fine, imo, since usually the charset is constrained to ASCII before it is talked about codepoints. Now if that works in reality with implementations in the wild is a different story, so take the time you need. |
Some of the usages of "characters" and such look very deliberate is all. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[@deepbluev7] some of these are basically reundant
yes.
[@turt2live] Some of the usages of "characters" and such look very deliberate is all.
also this.
I feel like I'd be happier reviewing this if we removed the noise from places where characters and bytes are the same thing (possibly to a separate PR), so that we can better focus on the places it makes a difference.
In general, where we are actually changing things, I don't think its sufficient to simply s/characters/bytes/ without specifying which encoding we are talking about. Generally we need to say "the UTF-8 encoding must not exceed 255 bytes" or words to that effect.
Also: it would be good to add something like this to each change:
{{% changed-in v="1.10" %}} Clarified that length limit is in bytes, not codepoints.
description: Unique ID for the message, used for idempotence. Arbitrary | ||
utf8 string, of maximum length 32 codepoints. | ||
utf8 string, of maximum length 32 bytes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a pre-existing problem, but message_id
should not be a UTF-8 string (or rather, it does not need to be one and should not be interpreted as one). It should be a regular unicode string, just like all the other strings in matrix. Assuming it is then sent over an HTTP request which uses a JSON body with UTF-8 encoding, it will be encoded as UTF-8.
In particular: if it were a UTF-8 string, then bytes such as 0xC0
would be invalid. In practice, 0xC0
is a valid value in this string; it is "LATIN CAPITAL LETTER A WITH GRAVE".
In short, this text needs rewording, rather than simply s/codepoints/bytes/:
description: Unique ID for the message, used for idempotence. Arbitrary | |
utf8 string, of maximum length 32 codepoints. | |
utf8 string, of maximum length 32 bytes. | |
description: Unique ID for the message, used for idempotence. Arbitrary | |
string, whose UTF-8 encoding must not exceed 32 bytes. |
@deepbluev7 are you still interested in working on this? |
In almost all instances we are already talking about UTF-8 encoded ASCII. As such this is not a behavioural change apart from in theory legacy identifiers, that never were part of the spec.
fixes #1001
Signed-off-by: Nicolas Werner [email protected]
See also:
matrix-org/synapse#13710
matrix-org/gomatrixserverlib#338
Preview: https://pr1234--matrix-spec-previews.netlify.app