From 6215072fd3b6cbe182c405846c7f07c16dd2927d Mon Sep 17 00:00:00 2001 From: Tulir Asokan Date: Mon, 1 May 2023 13:18:59 +0300 Subject: [PATCH 1/7] Clarify that arbitrary unicode is allowed in user/room IDs and room aliases. Signed-off-by: Tulir Asokan --- .../newsfragments/1506.clarification | 1 + content/appendices.md | 18 ++++++++++++++++-- 2 files changed, 17 insertions(+), 2 deletions(-) create mode 100644 changelogs/appendices/newsfragments/1506.clarification diff --git a/changelogs/appendices/newsfragments/1506.clarification b/changelogs/appendices/newsfragments/1506.clarification new file mode 100644 index 000000000..41ef5ac48 --- /dev/null +++ b/changelogs/appendices/newsfragments/1506.clarification @@ -0,0 +1 @@ +Clarify that arbitrary unicode is allowed in user/room IDs and room aliases. diff --git a/content/appendices.md b/content/appendices.md index 52940aa6c..bc0962ef6 100644 --- a/content/appendices.md +++ b/content/appendices.md @@ -598,6 +598,13 @@ character set: extended_user_id_char = %x21-39 / %x3B-7E ; all ASCII printing chars except : +##### User IDs over federation + +Due to a lack of validation in original Matrix homeserver implementations, +the localpart of user IDs over federation may contain any valid unicode +codepoints except `:`. A future spec change may create a new room version +to disallow such user IDs. + ##### Mapping from other character sets In certain circumstances it will be desirable to map from a wider @@ -645,6 +652,10 @@ Room IDs are case-sensitive. They are not meant to be human-readable. They are intended to be treated as fully opaque strings by clients. +The localpart of a room ID (`opaque_id` above) may contain any valid +unicode codepoints except `:`, but it is recommended to only include +ASCII letters and digits when generating them. + #### Room Aliases A room may have zero or more aliases. A room alias has the format: @@ -655,8 +666,11 @@ The `domain` of a room alias is the [server name](#server-name) of the homeserver which created the alias. Other servers may contact this homeserver to look up the alias. -Room aliases MUST NOT exceed 255 bytes (including the `#` sigil and the -domain). +The localpart of a room alias may contain any valid unicode codepoints +except `:`. + +Room aliases MUST NOT exceed 255 bytes as UTF-8 (including the `#` sigil +and the domain). #### Event IDs From 459634f0f18fbd48658aa30c32e3a1f26b032923 Mon Sep 17 00:00:00 2001 From: Travis Ralston Date: Wed, 9 Aug 2023 12:12:47 -0600 Subject: [PATCH 2/7] Clarify historical ID set further --- content/appendices.md | 18 ++++++++++-------- 1 file changed, 10 insertions(+), 8 deletions(-) diff --git a/content/appendices.md b/content/appendices.md index eabc60062..42512e157 100644 --- a/content/appendices.md +++ b/content/appendices.md @@ -611,17 +611,19 @@ characters permitted in user ID localparts. There are currently active users whose user IDs do not conform to the permitted character set, and a number of rooms whose history includes events with a `sender` which does not conform. In order to handle these rooms successfully, clients -and servers MUST accept user IDs with localparts from the expanded -character set: +and servers MUST accept user IDs with localparts consisting of any legal +unicode codepoint except for `:`, including zero characters. Localparts +MUST be valid UTF-8 sequences. - extended_user_id_char = %x21-39 / %x3B-7E ; all ASCII printing chars except : +Servers SHOULD NOT produce user IDs with localparts outside of the following +character set, and SHOULD NOT forward such user IDs to clients when referenced +outside the context of an event. For example, device list updates from "invalid" +user IDs would be dropped by the receiving server. -##### User IDs over federation + extended_user_id_char = %x21-39 / %x3B-7E ; all ASCII printing chars except : -Due to a lack of validation in original Matrix homeserver implementations, -the localpart of user IDs over federation may contain any valid unicode -codepoints except `:`. A future spec change may create a new room version -to disallow such user IDs. +A future room version may prevent users using a historical character set +from participating. Use of the historical character set is *deprecated*. ##### Mapping from other character sets From 97381fddb1b9f7e9a6fc0d8e9d34723b635b91e1 Mon Sep 17 00:00:00 2001 From: Tulir Asokan Date: Wed, 28 Feb 2024 00:59:34 +0200 Subject: [PATCH 3/7] Apply suggestions from code review Co-authored-by: Richard van der Hoff <1389908+richvdh@users.noreply.github.com> --- content/appendices.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/content/appendices.md b/content/appendices.md index 42512e157..10a356523 100644 --- a/content/appendices.md +++ b/content/appendices.md @@ -612,7 +612,7 @@ users whose user IDs do not conform to the permitted character set, and a number of rooms whose history includes events with a `sender` which does not conform. In order to handle these rooms successfully, clients and servers MUST accept user IDs with localparts consisting of any legal -unicode codepoint except for `:`, including zero characters. Localparts +unicode codepoint except for `:`, including the empty string. Localparts MUST be valid UTF-8 sequences. Servers SHOULD NOT produce user IDs with localparts outside of the following @@ -674,7 +674,7 @@ by clients. The localpart of a room ID (`opaque_id` above) may contain any valid unicode codepoints except `:`, but it is recommended to only include -ASCII letters and digits when generating them. +ASCII letters and digits (`A-Z`, `a-z`, `0-9`) when generating them. #### Room Aliases From 9b75333162dbbf64bf04a513c8de033fa2f17b51 Mon Sep 17 00:00:00 2001 From: Tulir Asokan Date: Wed, 28 Feb 2024 01:30:15 +0200 Subject: [PATCH 4/7] Clarify allowed control characters --- content/appendices.md | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/content/appendices.md b/content/appendices.md index 10a356523..52b149ae1 100644 --- a/content/appendices.md +++ b/content/appendices.md @@ -612,8 +612,8 @@ users whose user IDs do not conform to the permitted character set, and a number of rooms whose history includes events with a `sender` which does not conform. In order to handle these rooms successfully, clients and servers MUST accept user IDs with localparts consisting of any legal -unicode codepoint except for `:`, including the empty string. Localparts -MUST be valid UTF-8 sequences. +unicode codepoint except for `:` and `NUL` (U+0000), including other control +characters and the empty string. Localparts MUST be valid UTF-8 sequences. Servers SHOULD NOT produce user IDs with localparts outside of the following character set, and SHOULD NOT forward such user IDs to clients when referenced @@ -673,8 +673,9 @@ human-readable. They are intended to be treated as fully opaque strings by clients. The localpart of a room ID (`opaque_id` above) may contain any valid -unicode codepoints except `:`, but it is recommended to only include -ASCII letters and digits (`A-Z`, `a-z`, `0-9`) when generating them. +unicode codepoints, including control characters, except `:` and `NUL` +(U+0000), but it is recommended to only include ASCII letters and +digits (`A-Z`, `a-z`, `0-9`) when generating them. #### Room Aliases From 9facd55eec370becaefc5c0537eb5151c1f50780 Mon Sep 17 00:00:00 2001 From: Tulir Asokan Date: Tue, 14 Jan 2025 20:05:38 +0200 Subject: [PATCH 5/7] Update content/appendices.md Co-authored-by: Richard van der Hoff <1389908+richvdh@users.noreply.github.com> --- content/appendices.md | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/content/appendices.md b/content/appendices.md index 32b138f5e..0946e30fa 100644 --- a/content/appendices.md +++ b/content/appendices.md @@ -615,10 +615,11 @@ and servers MUST accept user IDs with localparts consisting of any legal unicode codepoint except for `:` and `NUL` (U+0000), including other control characters and the empty string. Localparts MUST be valid UTF-8 sequences. -Servers SHOULD NOT produce user IDs with localparts outside of the following -character set, and SHOULD NOT forward such user IDs to clients when referenced -outside the context of an event. For example, device list updates from "invalid" -user IDs would be dropped by the receiving server. +User IDs with localparts containing characters outside the range U+0021 to U+007E, or with +an empty localpart, are considered non-compliant. For current room versions, servers must +still accept events using such user IDs over federation; however they SHOULD NOT forward +such user IDs to clients when referenced outside the context of an event. For example, +device list updates from non-compliant user IDs would be dropped by the receiving server. extended_user_id_char = %x21-39 / %x3B-7E ; all ASCII printing chars except : From 01dd6df626b9f22bcc44f17b5a4cb5578c9bdc31 Mon Sep 17 00:00:00 2001 From: Tulir Asokan Date: Tue, 14 Jan 2025 20:06:08 +0200 Subject: [PATCH 6/7] Remove definition which isn't referenced anywhere The paragraph above has it built-in now --- content/appendices.md | 2 -- 1 file changed, 2 deletions(-) diff --git a/content/appendices.md b/content/appendices.md index 0946e30fa..04ed40523 100644 --- a/content/appendices.md +++ b/content/appendices.md @@ -621,8 +621,6 @@ still accept events using such user IDs over federation; however they SHOULD NOT such user IDs to clients when referenced outside the context of an event. For example, device list updates from non-compliant user IDs would be dropped by the receiving server. - extended_user_id_char = %x21-39 / %x3B-7E ; all ASCII printing chars except : - A future room version may prevent users using a historical character set from participating. Use of the historical character set is *deprecated*. From 30dd7cf76db7ba8d01ee8fe2b0291c53d8f59e64 Mon Sep 17 00:00:00 2001 From: Tulir Asokan Date: Wed, 22 Jan 2025 12:25:27 +0200 Subject: [PATCH 7/7] Apply suggestions from code review Co-authored-by: Richard van der Hoff <1389908+richvdh@users.noreply.github.com> --- content/appendices.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/content/appendices.md b/content/appendices.md index 04ed40523..7cf8e315a 100644 --- a/content/appendices.md +++ b/content/appendices.md @@ -612,8 +612,8 @@ users whose user IDs do not conform to the permitted character set, and a number of rooms whose history includes events with a `sender` which does not conform. In order to handle these rooms successfully, clients and servers MUST accept user IDs with localparts consisting of any legal -unicode codepoint except for `:` and `NUL` (U+0000), including other control -characters and the empty string. Localparts MUST be valid UTF-8 sequences. +non-surrogate Unicode code points except for `:` and `NUL` (U+0000), including other control +characters and the empty string. User IDs with localparts containing characters outside the range U+0021 to U+007E, or with an empty localpart, are considered non-compliant. For current room versions, servers must @@ -672,7 +672,7 @@ human-readable. They are intended to be treated as fully opaque strings by clients. The localpart of a room ID (`opaque_id` above) may contain any valid -unicode codepoints, including control characters, except `:` and `NUL` +non-surrogate Unicode code points, including control characters, except `:` and `NUL` (U+0000), but it is recommended to only include ASCII letters and digits (`A-Z`, `a-z`, `0-9`) when generating them. @@ -689,8 +689,8 @@ The `domain` of a room alias is the [server name](#server-name) of the homeserver which created the alias. Other servers may contact this homeserver to look up the alias. -The localpart of a room alias may contain any valid unicode codepoints -except `:`. +The localpart of a room alias may contain any valid non-surrogate Unicode codepoints +except `:` and `NUL`. The length of a room alias, including the `#` sigil and the domain, MUST NOT exceed 255 bytes.