This document proposes a format for human-readable IDs (specifically, room aliases) within Matrix.
UTF-8 is the dominant character encoding for Unicode on the web. However, using Unicode as the character set for human-readable IDs is troublesome. There are many different characters which appear identical to each other, but would produce different IDs. In addition, there are non-printable characters which cannot be rendered by the end-user. This creates an opportunity for phishing/spoofing of IDs, commonly known as a homograph attack.
Web browsers encountered this problem when International Domain Names were introduced. A variety of checks were put in place in order to protect users. If an address failed the check, the raw punycode would be displayed to disambiguate the address.
The only human-readable IDs currently in Matrix are Room Aliases. Room aliases
look like #localpart:domain
. These aliases point to opaque non
human-readable room IDs. These pointers can change to point at a different room
ID at any time.
Room aliases have the format:
#localpart:domain
As with other identifiers using the common identifier format, the domain
is
a server name - in this case, the server hosting this alias which may be
contacted to resolve the alias to a room ID. The domain
may be an
internationalized domain name, encoded using punycode. When displaying the
alias to users, Matrix clients may optionally decode any punycode-encoded parts
of the domain to unicode.
The localpart
is a UTF-8-encoded, NFC-normalised unicode string. The
following constitute invalid localparts for room aliases:
XXX: we need to figure out which of thes things to actually forbid:
- invalid utf8
- invalid byte sequences
- utf-16 surrogates U+D800 to U+DFFF
- codepoints after U+10FFFF
- overlong encodings
- strings not in NFC
- characters forbidden by NAMEPREP https://tools.ietf.org/html/rfc3491#section-5 ?
- strings which contain any of the 107 blacklisted characters listed at http://kb.mozillazine.org/Network.IDN.blacklist_chars ?
- strings which do not meet the bidi requirements https://tools.ietf.org/html/rfc5893 ? https://tools.ietf.org/html/rfc3454#section-6 ?
- Things from more than one language? ["After stripping
"
,0-9
,+
,-
,[
,]
,_
, and the space character `` `` it MUST NOT contain characters from more than one language, defined by the exemplar characters on http://cldr.unicode.org/ ] - strings whose first character is a Unicode combining mark?
- strings which include the DISALLOWED code points in RFC5892. (This includes a lot fof things which didn't exist in 2010, like emoji, so I don't think we should take this list as-is.)
- Complicated rules about CONTEXTO or CONTEXTJ code points in RFC5892.
The total length of the (utf-8 encoded) room alias, including the sigil and the server name, must not exceed 255 characters.
Servers should not allow clients to create aliases which are considered invalid according to any of the above rules. Servers should also reject attempts to resolve such aliases.
Provided an alias is valid, the following rules should be followed to normalise an alias for storage and lookup:
- Normalise to NFKC
- Remove characters listed in https://tools.ietf.org/html/rfc3454#appendix-B.1
- Case-map according to https://tools.ietf.org/html/rfc3454#appendix-B.2
Each ID is split into segments (localpart/domain) around the :
. For
this reason, :
is a reserved character and cannot be a localpart character.
The 107 blacklisted characters are used to prevent non-printable characters and
spaces from being used. The decision to ban characters from more than 1 language
matches the behaviour of Google Chrome for IDN handling. This is to protect
against common homograph attacks such as ebаy.com (Cyrillic "a", rest is
English). This would always result in a failed check. Even with this though
there are limitations. For example, сахар is entirely Cyrillic, whereas caxap is
entirely Latin.