Clarify wording in spec for character groups #604

wooorm · 2019-09-10T19:09:39Z

Problem

The part of the spec that defines character groups uses labels that are inconsistent: there’s whitespace and there’s unicode whitespace, compared to ascii punctuation and punctuation.

The rest of the spec uses additional groups, or potentially confusing names. For example, taking the leaf blocks:

thematic breaks can contain (not at the start) spaces and tabs but may not contain non-whitespace characters except for the marker, leaving room for non-space-or-tab whitespace?
ATX headings must have a space character after the opening #, but leading and trailing whitespace is stripped from the content, so #\talpha is invalid but # \talpha is fine?
Setext heading underlines can have trailing spaces, but not other trailing whitespace?
Link reference definition labels can include infinity line endings (Newlines inside link labels? #586)
Blank lines only include spaces or tabs, not other whitespace

Maybe there are very good reasons for those, but I feel that a), as someone implementing the spec, it would help to streamline the names that are used, and b), as a user, it would help to have less different types of white space.

Solution

The unicode groups (“unicode whitespace” and “punctuation”) are only referenced for emphasis/importance. Maybe they can be moved down? That would make it more clear that whitespace / punctuation is about ASCII, as defined “above”.

Maybe it’s also a good idea to not include line endings in whitespace. There are several cases where “whitespace” is used, but line endings cannot occur (e.g., GH-586, although the rest of the link reference definition spec is very good at mentioning that one line ending is allowed).
That way the spec can be explicit about “space or tab characters”, “whitespace”, “white space or line endings”, etc.

If this is of interest, I can work on this!

The text was updated successfully, but these errors were encountered:

jgm · 2019-09-11T04:05:06Z

The unicode groups (“unicode whitespace” and “punctuation”) are only referenced for emphasis/importance. Maybe they can be moved down?

You mean move these to the Emphasis section? I'm not sure. Actually I think there's some advantage having Unicode Whitespace defined in the same place as Whitespace; that helps you see that there is a distinction being made between them.

Maybe it’s also a good idea to not include line endings in whitespace.

Maybe so. But one would have to be very careful not to break anything in making this change. It might also be worth treating FormFeed specially, and not making it whitespace -- I don't remember if there was a solid reason for that.

#\talpha is invalid but # \talpha is fine?

Well, actually you can use a tab there. Tabs are covered by this passage:

Tabs in lines are not expanded to spaces. However, in contexts where whitespace helps to define block structure, tabs behave as if they were replaced by spaces with a tab stop of 4 characters.

Since this is a context where whitespace helps define block structure, the tab acts as if it were expanded to spaces. Now, I agree, it would be much better if the spec were much more explicit about tabs, throughout. The reason it isn't is historical: originally we assumed a preprocessed source in which tabs had already been expanded to spaces. #386 recommends an overhaul of the spec, being explicit about tabs instead of relying on this passage. If you're interested in doing that, it would be welcome.

Setext heading underlines can have trailing spaces, but not other trailing whitespace?

See above.

thematic breaks can contain (not at the start) spaces and tabs but may not contain non-whitespace characters except for the marker, leaving room for non-space-or-tab whitespace?

Agreed, that's a bit of a wart that could be cleaned up.

wooorm · 2019-09-11T07:05:03Z

You mean move these to the Emphasis section? I'm not sure.

Fine too! And how about making the updating the names, such as:

ASCII whitespace / Unicode whitespace; ASCII punctuation / Unicode punctuation
Whitespace / Unicode whitespace; punctuation / Unicode punctuation

not include line endings in whitespace

Maybe so. But one would have to be very careful not to break anything in making this change. It might also be worth treating FormFeed specially, and not making it whitespace -- I don't remember if there was a solid reason for that.

Is there a reason line tabulation and form feed are included in whitespace at all? If those weren’t there, it would be easier to disambiguate between “spaces or tabs”, “line endings”, or “whitespace“ (being both).
Also: line tab isn’t part of unicode whitespace, but form feed is 🤔
Having line tab and and form feed in there also leaves the question whether they can indent things, like tabs.

Tabs are covered by this passage:

Right, I suspected that, but because other places explicitly name the tab, it leaves room open to wonder what should happen if it isn’t. And line tab and form feed make this more confusing.

Thanks for the context!

jgm · 2019-09-12T00:48:43Z

Is there a reason line tabulation and form feed are included in whitespace at all?

I've been trying to remember that. None I can think of at the moment. I'd be inclined to eject them.

But this would make "ASCII whitespace" problematic, since one might assume this label to apply to all ASCII whitespace characters.

Also: line tab isn’t part of unicode whitespace, but form feed is

Not sure why. That seems irrational.

Anyway, I think a general cleanup in the area would make sense, but it should include the space/tab issue noted above.

wooorm · 2019-09-12T08:23:10Z

To summarise, are we agreeing on:

Use the names whitespace / Unicode whitespace; punctuation / Unicode punctuation
Whitespace would be spaces, tabs, carriage return, line feed
We revisit every place that space / tab / whitespace is used and carefully decide what can be used, space characters, space characters or tab characters, whitespace characters (potentially including up to X line endings)

I can work on it.

Anyway, I think a general cleanup in the area would make sense, but it should include the space/tab issue noted above.

Are you talking about solving GH-386 together with the above summary, or..?

I’d also like to suggest using a separate word for “space” if it’s about the expanded size. E.g., take and ATX heading: The opening # character may be indented 0-3 spaces in combination with a block quote: the character > together with a following space. In the case of >\t# Alpha, where 1 “space” of the tab is used for the blockquote, and three for the heading.
Perhaps (a) space size / space or (b) space / space character?

jgm · 2019-09-12T15:18:15Z

I'm not agreed on "ASCII punctuation" -> "punctuation." I think specifying ASCII is important; too many people will misunderstand if it's not explicit there.

As for "whitespace", it's not the best word, but "ASCII whitespace" also seems wrong if we're excluding e.g. FF. I'm not sure it's bad just to take it as a technical term.

carefully decide what can be used

I think in most cases this is already decided; the work would mainly be replacing talk of spaces with talk of space or tabs, and this could get difficult or confusing in cases where only part of a tab may be used (list indentation for example, or #386).

Crissov · 2019-09-13T06:17:48Z

“basic whitespace” := whitespace characters in US-ASCII / ISO 646 / C0 / Basic Latin block? They are all in Unicode and in almost all other encodings.
“whitespace” := every character with certain Unicode properties.

wooorm · 2019-10-01T18:16:04Z

Rethinking this, I now prefer to be explicit: ASCII punctuation and Unicode punctuation.

For whitespace, if we have Unicode whitespace, and line endings, and are dropping line tab and form feed from whitespace, that only leaves spaces and tabs. In which case we can also be explicit everywhere whether just spaces, or spaces and tabs, (or even line endings) are allowed. This would resolve the whitespace issue.

jgm · 2019-10-01T18:31:59Z

Sounds reasonable to me.

wooorm · 2020-05-23T17:49:49Z

Closed by GH-618.

wooorm mentioned this issue Oct 2, 2019

Clarify wording in spec for character groups #618

Merged

wooorm closed this as completed May 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarify wording in spec for character groups #604

Clarify wording in spec for character groups #604

wooorm commented Sep 10, 2019

jgm commented Sep 11, 2019

wooorm commented Sep 11, 2019

jgm commented Sep 12, 2019

wooorm commented Sep 12, 2019

jgm commented Sep 12, 2019

Crissov commented Sep 13, 2019

wooorm commented Oct 1, 2019

jgm commented Oct 1, 2019

wooorm commented May 23, 2020

Clarify wording in spec for character groups #604

Clarify wording in spec for character groups #604

Comments

wooorm commented Sep 10, 2019

Problem

Solution

jgm commented Sep 11, 2019

wooorm commented Sep 11, 2019

jgm commented Sep 12, 2019

wooorm commented Sep 12, 2019

jgm commented Sep 12, 2019

Crissov commented Sep 13, 2019

wooorm commented Oct 1, 2019

jgm commented Oct 1, 2019

wooorm commented May 23, 2020