-
-
Notifications
You must be signed in to change notification settings - Fork 322
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarify wording in spec for character groups #604
Comments
You mean move these to the Emphasis section? I'm not sure. Actually I think there's some advantage having Unicode Whitespace defined in the same place as Whitespace; that helps you see that there is a distinction being made between them.
Maybe so. But one would have to be very careful not to break anything in making this change. It might also be worth treating FormFeed specially, and not making it whitespace -- I don't remember if there was a solid reason for that.
Well, actually you can use a tab there. Tabs are covered by this passage:
Since this is a context where whitespace helps define block structure, the tab acts as if it were expanded to spaces. Now, I agree, it would be much better if the spec were much more explicit about tabs, throughout. The reason it isn't is historical: originally we assumed a preprocessed source in which tabs had already been expanded to spaces. #386 recommends an overhaul of the spec, being explicit about tabs instead of relying on this passage. If you're interested in doing that, it would be welcome.
See above.
Agreed, that's a bit of a wart that could be cleaned up. |
Fine too! And how about making the updating the names, such as:
Is there a reason
Right, I suspected that, but because other places explicitly name the tab, it leaves room open to wonder what should happen if it isn’t. And line tab and form feed make this more confusing. Thanks for the context! |
I've been trying to remember that. None I can think of at the moment. I'd be inclined to eject them. But this would make "ASCII whitespace" problematic, since one might assume this label to apply to all ASCII whitespace characters.
Not sure why. That seems irrational. Anyway, I think a general cleanup in the area would make sense, but it should include the space/tab issue noted above. |
To summarise, are we agreeing on:
I can work on it.
Are you talking about solving GH-386 together with the above summary, or..? I’d also like to suggest using a separate word for “space” if it’s about the expanded size. E.g., take and ATX heading: |
I'm not agreed on "ASCII punctuation" -> "punctuation." I think specifying ASCII is important; too many people will misunderstand if it's not explicit there. As for "whitespace", it's not the best word, but "ASCII whitespace" also seems wrong if we're excluding e.g. FF. I'm not sure it's bad just to take it as a technical term.
I think in most cases this is already decided; the work would mainly be replacing talk of spaces with talk of space or tabs, and this could get difficult or confusing in cases where only part of a tab may be used (list indentation for example, or #386). |
“basic whitespace” := whitespace characters in US-ASCII / ISO 646 / C0 / Basic Latin block? They are all in Unicode and in almost all other encodings. |
Rethinking this, I now prefer to be explicit: ASCII punctuation and Unicode punctuation. For whitespace, if we have Unicode whitespace, and line endings, and are dropping line tab and form feed from whitespace, that only leaves spaces and tabs. In which case we can also be explicit everywhere whether just spaces, or spaces and tabs, (or even line endings) are allowed. This would resolve the whitespace issue. |
Sounds reasonable to me. |
Closed by GH-618. |
Problem
The part of the spec that defines character groups uses labels that are inconsistent: there’s whitespace and there’s unicode whitespace, compared to ascii punctuation and punctuation.
The rest of the spec uses additional groups, or potentially confusing names. For example, taking the leaf blocks:
#
, but leading and trailing whitespace is stripped from the content, so#\talpha
is invalid but# \talpha
is fine?Maybe there are very good reasons for those, but I feel that a), as someone implementing the spec, it would help to streamline the names that are used, and b), as a user, it would help to have less different types of white space.
Solution
The unicode groups (“unicode whitespace” and “punctuation”) are only referenced for emphasis/importance. Maybe they can be moved down? That would make it more clear that whitespace / punctuation is about ASCII, as defined “above”.
Maybe it’s also a good idea to not include line endings in whitespace. There are several cases where “whitespace” is used, but line endings cannot occur (e.g., GH-586, although the rest of the link reference definition spec is very good at mentioning that one line ending is allowed).
That way the spec can be explicit about “space or tab characters”, “whitespace”, “white space or line endings”, etc.
If this is of interest, I can work on this!
The text was updated successfully, but these errors were encountered: