Skip to content

Latest commit

 

History

History
108 lines (84 loc) · 4.5 KB

human-id-rules.rst

File metadata and controls

108 lines (84 loc) · 4.5 KB

Abstract

This document proposes a format for human-readable IDs (specifically, room aliases) within Matrix.

Background

UTF-8 is the dominant character encoding for Unicode on the web. However, using Unicode as the character set for human-readable IDs is troublesome. There are many different characters which appear identical to each other, but would produce different IDs. In addition, there are non-printable characters which cannot be rendered by the end-user. This creates an opportunity for phishing/spoofing of IDs, commonly known as a homograph attack.

Web browsers encountered this problem when International Domain Names were introduced. A variety of checks were put in place in order to protect users. If an address failed the check, the raw punycode would be displayed to disambiguate the address.

The only human-readable IDs currently in Matrix are Room Aliases. Room aliases look like #localpart:domain. These aliases point to opaque non human-readable room IDs. These pointers can change to point at a different room ID at any time.

Proposal

Room aliases have the format:

#localpart:domain

As with other identifiers using the common identifier format, the domain is a server name - in this case, the server hosting this alias which may be contacted to resolve the alias to a room ID. The domain may be an internationalized domain name, encoded using punycode. When displaying the alias to users, Matrix clients may optionally decode any punycode-encoded parts of the domain to unicode.

The localpart is a UTF-8-encoded, NFC-normalised unicode string. The following constitute invalid localparts for room aliases:

XXX: we need to figure out which of thes things to actually forbid:

The total length of the (utf-8 encoded) room alias, including the sigil and the server name, must not exceed 255 characters.

Servers should not allow clients to create aliases which are considered invalid according to any of the above rules. Servers should also reject attempts to resolve such aliases.

Provided an alias is valid, the following rules should be followed to normalise an alias for storage and lookup:

Rationale

Each ID is split into segments (localpart/domain) around the :. For this reason, : is a reserved character and cannot be a localpart character. The 107 blacklisted characters are used to prevent non-printable characters and spaces from being used. The decision to ban characters from more than 1 language matches the behaviour of Google Chrome for IDN handling. This is to protect against common homograph attacks such as ebаy.com (Cyrillic "a", rest is English). This would always result in a failed check. Even with this though there are limitations. For example, сахар is entirely Cyrillic, whereas caxap is entirely Latin.