Update 'qualifiers' rule in core spec #382 #398

johnmhoran · 2025-02-26T03:11:06Z

Reference: #382

@pombredanne @jkowalleck @mprpic @matt-phylum I've updated the qualifiers rule in the core spec's "Rules for each purl component" section, still need to

revisit the how-to-build and how-to-parse sections
review relevant issues/PRs to determine if any qualifiers-related items need to be addressed in faq.rst
add a few qualifiers-related tests (those that already exist all look good, but I'll add a few "is_invalid": true tests for the encoding details I've included in the qualifiers update).

Turning to these ^ in the morning.

Reference: #382 Signed-off-by: John M. Horan <[email protected]>

PURL-SPECIFICATION.rst

jkowalleck · 2025-02-26T08:51:42Z

PURL-SPECIFICATION.rst

+
+      - The ``key`` MUST be composed only of ASCII letters and numbers, '.', '-' and
+        '_' (period, dash and underscore).
+      - A ``key`` MUST start with an ASCII letter.


A key MUST start with an ASCII letter.

this is a change.
old:

A key cannot start with a number

the new spec would prevent keys like _foo, -bar, .bazz

I think this is a change I like and that makes sense, and I surmise that there are no sane qualifiers out there that do start with _, -, or .. For the sake of backward compatibility, we could either:

keep the original, updated: - A ``key`` MUST NOT start with a number.

keep this improved form: - A ``key`` MUST start with an ASCII letter.

Unless there is a strong group of PURLs in the wild that exist, I would prefer the improved form.

Given the comments thus far ^, I'm keeping as is unless/until I hear otherwise.

jkowalleck · 2025-02-26T08:55:02Z

PURL-SPECIFICATION.rst

+      - The ``key`` MUST be composed only of ASCII letters and numbers, '.', '-' and
+        '_' (period, dash and underscore).
+      - A ``key`` MUST start with an ASCII letter.
+      - A ``key`` MUST NOT be percent-encoded.


A key MUST NOT be percent-encoded.

I think this is wrong, and against other existing spec

purl-spec/PURL-SPECIFICATION.rst

Lines 245 to 248 in 8040ff0

It is OK to percent-encode ``purl`` components otherwise except for the ``type``.

Parsers and builders must always percent-decode and percent-encode ``purl``

components and component segments as explained in the "How to parse" and "How to

build" sections.

I think a key can be percent-encoded. and at some points, it must be percent encoded.
Otherwise, if percent-encoding MUST NOT happen, then i could not choose certain keys.
examples: foo&bar[]=bazz -> foo%26bar%5B%5D -- at least the & MUST be percent-encoded.

I might be wrong, please help me understand.

foo&bar[]=bazz is not a valid key according to the preceding rules. The allowed characters shouldn't require percent encoding but I don't see why the spec would forbid percent encoding.

The current spec already spells this out clearly, so this is not changing anything:
- A ``key`` must NOT be percent-encoded

Now since the % sign is not allowed in a key, technically, this is implied already and we could remove this sentence entirely. I like this to be explicit though.

Given the comments thus far ^, I'm keeping as is unless/until I hear otherwise.

matt-phylum · 2025-02-26T13:14:27Z

PURL-SPECIFICATION.rst

+      - The ``value`` MUST be composed only of the following characters, encoded
+        as described below and in keeping with RFC 3986 Section 2.2:


This is confusing and self contradictory. "The value MUST be composed only of the following characters", but then the following text defines a set containing all characters.

Thanks @matt-phylum -- I don't think it's confusing but in any event the revised update will be simplified and clarified.

matt-phylum · 2025-02-26T13:19:46Z

PURL-SPECIFICATION.rst

+
+           .. code-block:: none
+
+               '!', '$', '&', ''', '(', ')', '*', '+', ',', ';', '='


The previous item lists characters that should not be encoded (MUST NOT for canonical PURLs), but this item lists some characters that MUST be encoded and some characters that should not be encoded together.

Thanks, will look forward to your comments on the revised update once I've pushed it.

matt-phylum · 2025-02-26T13:37:24Z

PURL-SPECIFICATION.rst

+      - The ``value`` MUST be composed only of the following characters, encoded
+        as described below and in keeping with RFC 3986 Section 2.2:
+
+          "If data for a URI component would conflict with a reserved character's
+          purpose as a delimiter, then the conflicting data must be percent-encoded
+          before the URI is formed."
+          https://datatracker.ietf.org/doc/html/rfc3986#section-2.2
+
+        1. **All US-ASCII characters defined as "unreserved"** in RFC 3986 Section 2.3
+           (https://datatracker.ietf.org/doc/html/rfc3986#section-2.3):
+
+           .. code-block:: none
+
+               'A'-'Z', 'a'-'z', '0'-'9', '-', '.', '_', '~'
+
+           - These 66 characters do not need to be percent-encoded.
+
+        2. **All US-ASCII characters defined as "sub-delims"**, a subset of
+           the "reserved" characters, in RFC 3986 Section 2.2
+           (https://datatracker.ietf.org/doc/html/rfc3986#section-2.2):
+
+           .. code-block:: none
+
+               '!', '$', '&', ''', '(', ')', '*', '+', ',', ';', '='
+
+           - The '&' MUST be percent-encoded to avoid being incorrectly parsed
+             as a separator between multiple key-value pairs. See "How to parse
+             a purl string in its components" ("Split the qualifiers on '&'.
+             Each part is a key=value pair").
+
+           - The other 10 characters do not need to be percent-encoded, including
+             the '=' -- the parser will not mistakenly treat a '=' in the value
+             as a separator because it splits each key-value pair just once,
+             from left-to-right, on the first '=' it encounters, and thus there
+             is no conflict:
+
+             .. code-block:: none
+
+                 - For each pair, split the key=value once from left on '=':
+                   - The key is the lowercase left side
+                   - The value is the percent-decoded right side
+
+        3. **Four additional US-ASCII characters** identified in the "query"
+           definition in RFC 3986 Section 3.4 (https://datatracker.ietf.org/doc/html/rfc3986#section-3.4)
+           and the "pchar" definition in RFC 3986 Appendix A (https://datatracker.ietf.org/doc/html/rfc3986#appendix-A):
+
+           .. code-block:: none
+
+               ':', '@', '/', '?'
+
+           - The '?' MUST be percent-encoded to avoid being incorrectly parsed
+             as a ``qualifiers`` separator -- in the right-to-left parsing
+             (see "How to parse a purl string in its components"), an unencoded
+             '?' in the ``value`` would be the first '?' encountered by the
+             parser and incorrectly treated as a ``qualifiers`` separator.
+
+           - The other three characters do not need to be percent-encoded.
+
+        4. **All other US-ASCII characters**.
+
+           .. code-block:: none
+
+               - 33 control characters (ASCII codes 0-31 and 127)
+
+               - the 14 US-ASCII characters not covered in the preceding groups of US-ASCII characters:
+
+                 ' ' [space], '"', '#', '%', '<', '>', '[', '\', ']', '^', '`', '{', '|', '}'
+
+           - Each of these 47 US-ASCII characters MUST be percent-encoded.
+
+        5. **Any character that is not a US-ASCII character**
+           (i.e., characters with ASCII code > 127 and non-ASCII characters).
+
+           - All such characters MUST be UTF-8 encoded and then percent-encoded.


This is way too complicated to be one bullet point in the qualifiers section.

The encoding rules are:

The ASCII control characters 0 through 31 and 127 and any non-ASCII character greater than or equal to 128 MUST always be encoded in any component of a PURL.

% which MUST always be encoded in any percent encoded string.

The additional characters " < > SHOULD always be encoded in any component of a PURL.

The additional characters # @ ? MUST be encoded for qualifier values.

Plus should also be encoded to avoid interoperability problems: #261

Will be clarified in the update.

matt-phylum · 2025-02-26T13:38:34Z

PURL-SPECIFICATION.rst

+             '?' in the ``value`` would be the first '?' encountered by the
+             parser and incorrectly treated as a ``qualifiers`` separator.
+
+           - The other three characters do not need to be percent-encoded.


This contradicts preexisting rules:

purl-spec/PURL-SPECIFICATION.rst

Lines 238 to 241 in 8040ff0

- the '@' ``version`` separator must be encoded as ``%40`` elsewhere

- the '?' ``qualifiers`` separator must be encoded as ``%3F`` elsewhere

- the '=' ``qualifiers`` key/value separator must NOT be encoded

- the '#' ``subpath`` separator must be encoded as ``%23`` elsewhere

Will be clarified in the update.

matt-phylum · 2025-02-26T13:50:29Z

PURL-SPECIFICATION.rst

+  - ``purl`` parsers MUST return an error when the ``key`` or ``value`` contains
+    a prohibited character.


This is incorrect and incompatible with the preexisting spec.

In the unescaped form, no characters are prohibited, so you could have a valid PURL pkg:generic/name?key=%00 and the parser must handle this (or maybe it returns an error because it's string type can't represent null characters).

In the escaped form, some characters are escaped to avoid problems unrelated to the URL. http://example.com/?key=a "value" and http://example.com/?key=a%20%22value%22 are the same, but if you write Go to http://example.com/?key=a "value" or <a href="http://example.com/?key=a "value""> it doesn't work. At least I'm pretty sure why it's done. These characters have no meaning to the URL or PURL parsers so whether they are escaped or not doesn't make a difference to the parsing result.

Seems to me that inclusion of a prohibited character could only be "normalized" by removing/discarding such a character, which does not sound to me like mere normalization. I look to feedback from @pombredanne and others on this point.

Maybe I'm misunderstanding. There are no characters that are prohibited in the unencoded form of a qualifier value. In the encoded form, there are characters that are listed above as requiring escaping that don't actually require escaping in order for the PURL to be parsed successfully. If it can be unambiguously parsed and the result can be formatted into a PURL with the same meaning, I don't think that should be a "MUST return an error" condition. The characters do not need to be removed or discarded.

Hmmm. I must admit that I don't yet have a good understanding of when a violation of the spec should be normalized (allowing the parsing process to continue) vs. treating a spec violation as invalid (i.e., "is_invalid": true, raising an error/exception and thus halting the parsing process).

I'd be interested in reading any authorities/resources you could point me to on this important topic -- and in all events I'll happily defer to whatever approach you, @pombredanne, @jkowalleck, @mprpic and others in the community ultimately agree to.

matt-phylum · 2025-02-26T13:53:08Z

PURL-SPECIFICATION.rst

+
+                 ' ' [space], '"', '#', '%', '<', '>', '[', '\', ']', '^', '`', '{', '|', '}'
+
+           - Each of these 47 US-ASCII characters MUST be percent-encoded.


This is a change to the PURL canonicalization rules and will cause problems for anyone who is using canonicalized PURLs as keys.

Will be clarified in the update.

johnmhoran · 2025-02-26T20:15:17Z

@jkowalleck @matt-phylum Thanks very much for your comments. After touching base with @pombredanne, I'm going to give some thought to a substantially simplified approach re permitted characters and required/prohibited encoding and will update this PR accordingly. This will include:

a simpler encoding rule:
- the unreserved characters MUST NOT be percent-encoded and
- all other ASCII and non-ASCII characters MUST be percent-encoded when not used as separators/delimiters.
a reference for guidance to the relevant Wikipedia page(s) rather than RFC 3986 to keep things simple.

This next draft will be similar to the language I proposed early last week in the "Remove section on Character Encoding" PR -- see #389 (comment)

pombredanne

Thanks!
Here is some partial feedback. You may also want to consider moving the encoding details to an encoding section after all, that will be referenced by other components? I wonder also if we can streamline this section, as basically all non "unreserved" ASCII chars should be percent encoded

PURL-SPECIFICATION.rst

pombredanne · 2025-02-26T18:06:43Z

PURL-SPECIFICATION.rst

+  - The '?' separator is not part of the ``qualifiers`` component.
+  - The ``qualifiers`` component is a query string composed of one or more ``key=value``
+    pairs, each of which is separated by an ampersand '&'. A ``key`` and ``value``
+    are separated by the equal '=' character.


Suggested change

are separated by the equal '=' character.

- A ``key`` and ``value`` MUST be separated by the equal '=' character.

Change made.

PURL-SPECIFICATION.rst

pombredanne · 2025-02-26T18:15:34Z

PURL-SPECIFICATION.rst

+  - The '&' separator is not part of the ``key`` or the ``value``.
+  - The '=' separator is not part of the ``key`` or the ``value``.
+  - The ``key`` MUST be unique among the keys of the ``qualifiers`` string.
+  - The ``value`` MUST NOT be an empty string: a ``key=value`` pair with an empty ``value``


Suggested change

- The ``value`` MUST NOT be an empty string: a ``key=value`` pair with an empty ``value``

- A ``value`` MUST NOT be an empty string: a ``key=value`` pair with an empty ``value``

pombredanne · 2025-02-26T18:24:22Z

PURL-SPECIFICATION.rst

+
+      - The ``key`` MUST be composed only of ASCII letters and numbers, '.', '-' and
+        '_' (period, dash and underscore).
+      - A ``key`` MUST start with an ASCII letter.


I think this is a change I like and that makes sense, and I surmise that there are no sane qualifiers out there that do start with _, -, or .. For the sake of backward compatibility, we could either:

keep the original, updated: - A ``key`` MUST NOT start with a number.

keep this improved form: - A ``key`` MUST start with an ASCII letter.

Unless there is a strong group of PURLs in the wild that exist, I would prefer the improved form.

pombredanne · 2025-02-26T18:30:57Z

PURL-SPECIFICATION.rst

+      - The ``key`` MUST be composed only of ASCII letters and numbers, '.', '-' and
+        '_' (period, dash and underscore).
+      - A ``key`` MUST start with an ASCII letter.
+      - A ``key`` MUST NOT be percent-encoded.


The current spec already spells this out clearly, so this is not changing anything:
- A ``key`` must NOT be percent-encoded

Now since the % sign is not allowed in a key, technically, this is implied already and we could remove this sentence entirely. I like this to be explicit though.

Reference: #382 Signed-off-by: John M. Horan [email protected]

johnmhoran · 2025-02-28T23:59:19Z

@pombredanne @jkowalleck @matt-phylum and other colleagues: As long as we're still trying to figure out what we intend the rules to permit, require, prohibit etc., what do you all think about this language I proposed a few weeks ago in PR 389? This could fit in the Character encoding section.

A permitted character (as defined for each purl component in the "Rules for each purl component" section) MUST be percent-encoded unless it is:

(1) an Unreserved Character as defined in RFC 3986 section 2.3 (https://datatracker.ietf.org/doc/html/rfc3986#section-2.3) or

(2) expressly defined in this PURL-SPECIFICATION.rst as

    (a) a purl separator (and only when used as such a separator) or

    (b) a permitted character in one or more specifically identified purl components.

Reference: #382 Signed-off-by: John M. Horan <[email protected]>

johnmhoran · 2025-03-03T01:32:59Z

@pombredanne @jkowalleck @matt-phylum @mprpic I've just pushed an update with a simplified "qualifiers" subsection and a revised "Character encoding" section. Many of the "MUST"s etc. are mere placeholders (e.g., purl parsers MUST return an error when the key or value contains a prohibited character) until we reach agreement on what we want the various rules to provide.

(Please don't hesitate to include proposed language where you see a need for corrections, clarifications and the like.)

Update 'qualifiers' rule in core spec #382

1c1fb31

Reference: #382 Signed-off-by: John M. Horan <[email protected]>

johnmhoran added PURL core specification Format and syntax that define PURL (excludes PURL type definitions) PURL qualifiers component PURL encoding Ecma specification Work on the core specification labels Feb 26, 2025

jkowalleck reviewed Feb 26, 2025

View reviewed changes

matt-phylum reviewed Feb 26, 2025

View reviewed changes

pombredanne requested changes Feb 26, 2025

View reviewed changes

Merge branch 'main' into 382-update-qualifiers-component

b621d65

Reference: #382 Signed-off-by: John M. Horan [email protected]

Simplify 'qualifiers', revise 'Character encoding' #382

17685b2

Reference: #382 Signed-off-by: John M. Horan <[email protected]>

johnmhoran requested review from jkowalleck, matt-phylum and pombredanne March 3, 2025 16:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update 'qualifiers' rule in core spec #382 #398

Update 'qualifiers' rule in core spec #382 #398

johnmhoran commented Feb 26, 2025

jkowalleck Feb 26, 2025

pombredanne Feb 26, 2025

johnmhoran Feb 27, 2025

jkowalleck Feb 26, 2025

matt-phylum Feb 26, 2025

pombredanne Feb 26, 2025

johnmhoran Feb 27, 2025

matt-phylum Feb 26, 2025

johnmhoran Feb 27, 2025

matt-phylum Feb 26, 2025

johnmhoran Feb 27, 2025

matt-phylum Feb 26, 2025 •

edited

Loading

johnmhoran Feb 27, 2025

matt-phylum Feb 26, 2025

johnmhoran Feb 27, 2025

matt-phylum Feb 26, 2025

johnmhoran Feb 27, 2025

matt-phylum Feb 27, 2025

johnmhoran Feb 27, 2025

matt-phylum Feb 26, 2025

johnmhoran Feb 27, 2025

johnmhoran commented Feb 26, 2025

pombredanne left a comment

pombredanne Feb 26, 2025

johnmhoran Feb 27, 2025

pombredanne Feb 26, 2025

johnmhoran Feb 27, 2025

pombredanne Feb 26, 2025

pombredanne Feb 26, 2025

johnmhoran commented Feb 28, 2025

johnmhoran commented Mar 3, 2025

	It is OK to percent-encode ``purl`` components otherwise except for the ``type``.
	Parsers and builders must always percent-decode and percent-encode ``purl``
	components and component segments as explained in the "How to parse" and "How to
	build" sections.

		- The ``value`` MUST be composed only of the following characters, encoded
		as described below and in keeping with RFC 3986 Section 2.2:


		.. code-block:: none

		'!', '$', '&', ''', '(', ')', '*', '+', ',', ';', '='

	- the '@' ``version`` separator must be encoded as ``%40`` elsewhere
	- the '?' ``qualifiers`` separator must be encoded as ``%3F`` elsewhere
	- the '=' ``qualifiers`` key/value separator must NOT be encoded
	- the '#' ``subpath`` separator must be encoded as ``%23`` elsewhere

		- ``purl`` parsers MUST return an error when the ``key`` or ``value`` contains
		a prohibited character.


		' ' [space], '"', '#', '%', '<', '>', '[', '\', ']', '^', '`', '{', '\|', '}'

		- Each of these 47 US-ASCII characters MUST be percent-encoded.

	are separated by the equal '=' character.
	- A ``key`` and ``value`` MUST be separated by the equal '=' character.

	- The ``value`` MUST NOT be an empty string: a ``key=value`` pair with an empty ``value``
	- A ``value`` MUST NOT be an empty string: a ``key=value`` pair with an empty ``value``

Update 'qualifiers' rule in core spec #382 #398

Are you sure you want to change the base?

Update 'qualifiers' rule in core spec #382 #398

Conversation

johnmhoran commented Feb 26, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

matt-phylum Feb 26, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johnmhoran commented Feb 26, 2025

pombredanne left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johnmhoran commented Feb 28, 2025

johnmhoran commented Mar 3, 2025

matt-phylum Feb 26, 2025 •

edited

Loading