-
-
Notifications
You must be signed in to change notification settings - Fork 559
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Emphasis and East Asian text #208
base: master
Are you sure you want to change the base?
Conversation
Thanks for doing this! I think we could simplify this considerably by defining "punctuation character" (for purposes of the spec) so that it simply excludes East Asian pnuctuation characters. This would really simplify the clauses in the spec for emphasis, since we'd avoid complicated logical constructions like (punctuation and not east asian). It would also make the code slightly more efficient (one test rather than two -- though perhaps the compiler is smart enough to optimize away this difference). What do you think? |
Thanks for comment.
I thought the same at first, but such modification counld not handle many cases using underscores ( Another point is that some punctuations are shared among EA and Western, e.g. |
+++ IKEDA Soji [Jun 27 17 07:19 ]:
I think we could simplify this considerably by defining "punctuation
character" (for purposes of the spec) so that it simply excludes
East Asian pnuctuation characters.
This would really simplify the clauses in the spec for emphasis,
since we'd avoid complicated logical constructions like (punctuation
and not east asian).
I thought the same at first, but such modification counld not handle
many cases using underscores (_). Anyway, for EA writers it is real
that EA punctuations should be handled in different way from Western
ones.
Can you give a specific example of a case where you think
what I suggest wouldn't work? I think I can do it in a way
that is logically equivalent to yours, but simpler both
in the spec and the program.
Another point is that some punctuations are shared among EA and
Western, e.g. “, ”. They cannot be excluded.
Yes, the idea would be to define 'punctuation character'
to include these but exclude east-asian-only puntuation.
|
Ok here. In following texts, Example 1猫は*「のどか」*という。
猫は_「のどか」_という。 Current master: <p>猫は*「のどか」*という。</p>
<p>猫は_「のどか」_という。</p> Excluding EA punctuations: <p>猫は<em>「のどか」</em>という。</p>
<p>猫は_「のどか」_という。</p> Expected (with this PR): <p>猫は<em>「のどか」</em>という。</p>
<p>猫は<em>「のどか」</em>という。</p> Example 2猫は*「のどか」*という。犬は*名がない*。
猫は_「のどか」_という。犬は_名がない_。 Current master: <p>猫は*「のどか」<em>という。犬は</em>名がない*。</p>
<p>猫は_「のどか」<em>という。犬は_名がない</em>。</p> Excluding EA punctuations: <p>猫は<em>「のどか」</em>という。犬は<em>名がない</em>。</p>
<p>猫は_「のどか」_という。犬は_名がない_。</p> Expected (with this PR): <p>猫は<em>「のどか」</em>という。犬は<em>名がない</em>。</p>
<p>猫は<em>「のどか」</em>という。犬は_名がない_。</p>
Excluding these from Western punctuations will not affect Western text, because space before/after punctuation is ordinary in Western texts ( The␣cat␣is␣named␣*“Nodoka”*. On the other hand including them in EA punctuations will help formatting EA text because spaces before/after punctuation is unnatural in EA texts. 猫は*“のどか”*という。
猫は␣*“のどか”*␣という。 --- unnatural So I think they would be better to belong to EA punctuations. |
Just checking back in here; do we think we might be able to move forward with the suggestion in this PR? |
Of course I agree. Please let me know if there are anything I should do. |
489926e
to
9a24044
Compare
Hi, Any updates on this PR? I think lots of projects are waiting for the update in upstream. :) Thanks! |
Not always. Examples:
|
@jgm, in this pr interaction between punctuations and emphasis matters. Are your examples affected (I haven’t confirmed)? |
My point was just that there might be unexpected consequences to treating these characters like non-punctuation, and that it isn't the case that they're never flanked by punctuation characters. It's hard to survey ahead of time all the cases that might arise, but here's one for concreteness:
If the double quotes get treated as non-punctuation for purposes of determining flankingness, then the final |
My PR does not treat LEFT/RIGHT DOUBLE QUOTATION MARKs as non-punctuations, but treats them as EA punctuations. In fact, even if my modification was applied:
|
Sorry for the misunderstanding. left_flanking = numdelims > 0 && !cmark_utf8proc_is_space(after_char) &&
(!cmark_utf8proc_is_punctuation(after_char) ||
cmark_utf8proc_is_eastasian_punctuation(after_char) ||
cmark_utf8proc_is_space(before_char) ||
cmark_utf8proc_is_punctuation(before_char));
right_flanking = numdelims > 0 && !cmark_utf8proc_is_space(before_char) &&
(!cmark_utf8proc_is_punctuation(before_char) ||
cmark_utf8proc_is_eastasian_punctuation(before_char) ||
cmark_utf8proc_is_space(after_char) ||
cmark_utf8proc_is_punctuation(after_char)); Simplifying a bit (EDIT: sorry, first version was completely wrong): Left flanking:
Right flanking:
The effect of this part of the rule is to make it strictly easier to count as left-flanking and right-flanking, in the cases where a left-flanking run is followed by EA punctuation or a right-flanking run is preceded by EA punctuation. So there won't be examples of the sort I was trying to give, where your rule fails to count something as left- or right-flanking that the original rule does. Your rule may, however, count some delimiter runs as BOTH left and right flanking where the original rule only has one flankingness. To deal with that, you also modify the rules for "can open" and "can close". The current rule says that a delimiter run that is both left and right flanking can open emphasis when the before char is punctuation. Your rule loosens that up to: when the before char is punctuation or the after char is EA punctuation. This ensures that, in every case where your rule makes a formerly left and not-right flanking delimiter run both left and right flanking, if it could open/close emphasis before it will still be able to open/close emphasis. However, there could still be changes due to the fact that it could now close emphasis when it couldn't before. So, one kind of example to look for is a case where a delimiter run that formerly could only open emphasis can now both open and close, and gives bad results for that reason. I will think about whether there are realistic examples of this sort. But, just to make a general comment, one thing I dislike about the proposed change is that it makes an already fairly complicated rule, which I could (barely) keep in my head, even more complicated and hard to think about. That is the reason I've found it difficult to get convinced that this change should be made. It's not by itself a reason to reject the change, but I haven't yet been convinced that the change won't have unanticipated consequences. |
Here's an (admittedly artifical) example where we'd see a difference, if I'm not mistaken:
With the proposed rule, the second
whereas currently we get
Unless I've made a mistake in thinking about it... |
Another case:
|
I'll investigate your simplified rule afterward (but I want to confirm: It is equivalent to my rule, isn't it?).
What is the reason for "unique flankingness" requirement? For me, flankingness looks introduced only to describe behavior of the parser (without consideration of EA context).
It is natural that modification of rules will cause change of behavior. We have to modify rules if the rules can't handle texts as we expect. I can't decide whether changes brought to existing texts will be acceptable or not. There seem these options:
|
I'll add corresponding examples with East Asian context. For example,
will be handled by current master properly (Note:
However, example above is a lucky case. Perhaps this sentence is understandable without
will be rendered with current master as:
I think it is hard to accept this result for writers. As a workaround, for example, we might recommend writers to markup such as:
This will be rendered as:
The result is readalbe, if readers ignored an ugry space. However, it may not be easy to give excuse to force writers inserting unusual spaces not appeared in plain text witout markup. Note: My PR will not solve all problems with current master: It can not handle as complex markup in East Asian context as Western context. In fact, since the example above is slightly complex, it will be rendered with my PR as:
However, from view of East Asian writers, it will improve current behavior much. |
Yes, my simplified rephrasing was meant to be equivalent to your proposal. (Just to help me think about it more clearly.) Thinking outside of the box a bit: instead of having two distinct classes of punctuation characters, would it work to treat East Asian characters in general (including both EA punctuation and EA non-punctuation characters) as equivalent to punctuation for determining flankingness and can-open/can-close? That is: the rules would all be the same as they are, except that "punctuation" would be interpreted as including Western punctuation characters plus ALL EA characters. (Obviously, one might want a better name for this broad class than "punctuation," but that's a detail.) This would keep the simpler logic of the current rules, and it would guarantee that nothing changes in the interpretation of Western texts. |
Just wondering is there any progress on this? All CJK projects based on CommonMark just stuck on it for years. |
Maybe this issue can be seen better from a different perspective. At least I have always found using the left-flanking and right-flanking terms confusing and I always easily got lost in them when thinking about some particular complicated input example. Eventually I started to use in my head an alternative wording which (I believe) is 100%-equivalent to the current specs wording. It may be spelled as follows:
(If you prefer code, MD4C uses internally this alternative wording.) I post this because it might be easier to come with the solution in this wording, if we just add more rules into the score calculations above. Imho, it could perhaps even solve the issue with the ambiguous punctuation noted in earlier comments. E.g. something like
At least, it can be easily seen this wouldn't change anything for western text, and the people who (unlike me) understand EA languages and their needs may play more safely as long as they propose rules which require EA-characters on both sides of the run. Divide et impera. |
Although this PR works for Japanese and Chinese text (please note that Korean text uses "Western" punctuation marks), it does not solve a related but slightly different issue in Korean text reported here (github/javascript-tutorial, #2040). Koreans expect This Korean-text issue may be resolved by adding one more condition to @jgm's simple rule in this comment: Right flanking:
although it will break nested emphases more severely. By the way, I think a better way to solve CJK-related emphasis issues is to introduce a new syntax |
It seems that the issue on emphasizing Korean texts has not been reported before. I posted this issue in https://talk.commonmark.org/t/emphasis-and-east-asian-text/2491 as a comment. |
Sorry I haven't had time to think over this issue. |
…mmar (#16) Chinese and Japanese content usually do _not_ include spaces between formatted and unformatted segments of a single phrase, such as `**{value}**件の投稿`. But this is technically not valid `strong` formatting according to the CommonMark spec, since the right flank of the ending delimiter is a non-space Unicode character. See more information in the CommonMark discussion here: https://talk.commonmark.org/t/emphasis-and-east-asian-text/2491/5 commonmark/cmark#208 Because this library is explicitly intended to support many languages including most Asian languages, we are adding an extension to the Markdown rules to accommodate these situations. The following tests assert that the special cases for East Asian languages function in a logically-similar way to Western languages. The tests for this change are pretty small, as I'm not fluent in anything near CJK and have purely gone off of suggestions and forums to enumerate these. Most importantly, `**{value}**件の投稿`, is now treated as a **bold** `value` followed by plain text, rather than being completely ignored.
Discussions:
There are three commits:
I realised that the change will introduce some ambiguity, but I think they are not actually problem.
Rule 6:
is not
Rule 7:
is not