Abstract

This document proposes a format for human-readable IDs (specifically, room aliases) within Matrix.

Background

UTF-8 is the dominant character encoding for Unicode on the web. However, using Unicode as the character set for human-readable IDs is troublesome. There are many different characters which appear identical to each other, but would produce different IDs. In addition, there are non-printable characters which cannot be rendered by the end-user. This creates an opportunity for phishing/spoofing of IDs, commonly known as a homograph attack.

Web browsers encountered this problem when International Domain Names were introduced. A variety of checks were put in place in order to protect users. If an address failed the check, the raw punycode would be displayed to disambiguate the address.

The only human-readable IDs currently in Matrix are Room Aliases. Room aliases look like #localpart:domain. These aliases point to opaque non human-readable room IDs. These pointers can change to point at a different room ID at any time.

Proposal

Room aliases have the format:

#localpart:domain

As with other identifiers using the common identifier format, the domain is a server name - in this case, the server hosting this alias which may be contacted to resolve the alias to a room ID. The domain may be an internationalized domain name, encoded using punycode. When displaying the alias to users, Matrix clients may optionally decode any punycode-encoded parts of the domain to unicode.

The localpart is a UTF-8-encoded, NFC-normalised unicode string. The following constitute invalid localparts for room aliases:

XXX: we need to figure out which of thes things to actually forbid:

invalid utf8
- invalid byte sequences
- utf-16 surrogates U+D800 to U+DFFF
- codepoints after U+10FFFF
- overlong encodings
strings not in NFC
characters forbidden by NAMEPREP https://tools.ietf.org/html/rfc3491#section-5 ?
strings which contain any of the 107 blacklisted characters listed at http://kb.mozillazine.org/Network.IDN.blacklist_chars ?
strings which do not meet the bidi requirements https://tools.ietf.org/html/rfc5893 ? https://tools.ietf.org/html/rfc3454#section-6 ?
Things from more than one language? ["After stripping ", 0-9, +, -, [, ], _, and the space character `` `` it MUST NOT contain characters from more than one language, defined by the exemplar characters on http://cldr.unicode.org/ ]
strings whose first character is a Unicode combining mark?
strings which include the DISALLOWED code points in RFC5892. (This includes a lot fof things which didn't exist in 2010, like emoji, so I don't think we should take this list as-is.)
Complicated rules about CONTEXTO or CONTEXTJ code points in RFC5892.

The total length of the (utf-8 encoded) room alias, including the sigil and the server name, must not exceed 255 characters.

Servers should not allow clients to create aliases which are considered invalid according to any of the above rules. Servers should also reject attempts to resolve such aliases.

Provided an alias is valid, the following rules should be followed to normalise an alias for storage and lookup:

Normalise to NFKC
Remove characters listed in https://tools.ietf.org/html/rfc3454#appendix-B.1
Case-map according to https://tools.ietf.org/html/rfc3454#appendix-B.2

Rationale

Each ID is split into segments (localpart/domain) around the :. For this reason, : is a reserved character and cannot be a localpart character. The 107 blacklisted characters are used to prevent non-printable characters and spaces from being used. The decision to ban characters from more than 1 language matches the behaviour of Google Chrome for IDN handling. This is to protect against common homograph attacks such as ebаy.com (Cyrillic "a", rest is English). This would always result in a failed check. Even with this though there are limitations. For example, сахар is entirely Cyrillic, whereas caxap is entirely Latin.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

human-id-rules.rst

human-id-rules.rst

Abstract

Background

Proposal

Rationale

Files

human-id-rules.rst

Latest commit

History

human-id-rules.rst

File metadata and controls

Abstract

Background

Proposal

Rationale