Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Emphasis with CJK punctuation #650

Open
ptmkenny opened this issue May 26, 2020 · 205 comments
Open

Emphasis with CJK punctuation #650

ptmkenny opened this issue May 26, 2020 · 205 comments

Comments

@ptmkenny
Copy link

Hi, I encountered some strange behavior when using CJK full-width punctuation and trying to add emphasis.

Original issue here

Example punctuation that causes this issue:

。!?、

To my mind, all of these should work as emphasis, but some do and some don't:

**テスト。**テスト

**テスト**。テスト

**テスト、**テスト

**テスト**、テスト

**テスト?**テスト

**テスト**?テスト

cjk_punctuation_nospace_commonmark

I'm not sure if this is the spec as intended, but in Japanese, as a general rule there are no spaces in sentences, which leads to the following kind of problem when parsing emphasis.

In English, this is emphasized as expected:

This is **what I wanted to do.** So I am going to do it.

But the same sentence emphasized in the same way in Japanese fails:

これは**私のやりたかったこと。**だからするの。

whatiwanted_markdown_emphasis

@tats-u
Copy link

tats-u commented Nov 13, 2023

This and the above issues are caused by the change in #618. It is mixed in only v0.30 spec.

https://spec.commonmark.org/0.30/changes

A left-flanking delimiter run is a delimiter run that is (1) not followed by Unicode whitespace, and either (2a) not followed by a Unicode punctuation character, or (2b) followed by a Unicode punctuation character and preceded by Unicode whitespace or a Unicode punctuation character. For purposes of this definition, the beginning and the end of the line count as Unicode whitespace.

A right-flanking delimiter run is a delimiter run that is (1) not preceded by Unicode whitespace, and either (2a) not preceded by a Unicode punctuation character, or (2b) preceded by a Unicode punctuation character and followed by Unicode whitespace or a Unicode punctuation character. For purposes of this definition, the beginning and the end of the line count as Unicode whitespace.

  1. A single * character can open emphasis iff (if and only if) it is part of a left-flanking delimiter run.
  2. A single * character can close emphasis iff it is part of a right-flanking delimiter run.
  3. A double ** can open strong emphasis iff it is part of a left-flanking delimiter run.
  4. A double ** can close strong emphasis iff it is part of a right-flanking delimiter run.

The definition of left- and -right-franking emphasis for * and ** must use ASCII punctuation characters instead of Unicode ones.

https://v1.mdxjs.com/

does not cause such problem, so remark depended by MDX v2+ is affected.

@wooorm
Copy link
Contributor

wooorm commented Nov 13, 2023

Again, there is no change in 618. That PR is just about words, terminology.

MDX 1 did not follow CM correctly and had other bugs.

Can you please read what I say, and please stop spamming, and actually contribute?

@tats-u
Copy link

tats-u commented Nov 13, 2023

MDX 1 did not follow CM correctly and had other bugs.

The extension by MDX is not the culprit.

https://codesandbox.io/s/remark-playground-wmfor?file=/package.json

image

As of remark-parse v7, this problem is not reproduced, either.

https://prettier.io/playground/#N4Igxg9gdgLgprEAuEAqVhT00DTmg8qMHYMgZtGBSKoOGmgQAzrEl6A7EYM2xZIANCBAA4wCW0AzsqAEMATkIgB3AArCEfFAIA2YgQE8+LAEZCBYANZwYAZQEBbOABlOUOMgBmCnnA1bd+g222WA5shhCAro4gDsacPv6BPF7ycACKfhDwtvaBAFY8AB4GUbHxiUh28g4sAI65cBKibLIgAjwAtFZwACbNzCC+ApzyXgDCEMbGAsg18vJtkVCe0QCCML6c6n7wEnBCFlZJhYEAFjDG8gDq25zwPO5gcAYyJ5wAbifKw2A8aiC3AQCSUC2wBmBCnA402+BhgymimyKIDYogcBy0bGGMLgDiEt2sLEsqJgFQEnkGkMC7iEqOGgyEOia4igbRhlhgB04TRg22QAA4AAwsIRwUqcHm4-FDfLJFgwATqRnM1lIABMLD8DgAKhLZAUoXBjOpmi0mmYBJM-Hi4AAxCBCQZzLzDARLCAgAC+DqAA

Not reproduced in The latest Prettier (uses remark-parse v8), either.

That PR is just about words, terminology.

This means that the credit for the change goes to the fact that it turns to be clear that this specification is a terrible one that should be revised. Old remark-parse were based on an older ambiguous specification and consequently avoided this problem.

@tats-u
Copy link

tats-u commented Nov 13, 2023

https://spec.commonmark.org/0.29/

A punctuation character is an ASCII punctuation character or anything in the general Unicode categories Pc, Pd, Pe, Pf, Pi, Po, or Ps.

You are right. I'm sorry. I will look for another version.

@tats-u
Copy link

tats-u commented Nov 13, 2023

I finally found that the current broken definition sentences were introduced in 0.14.

https://spec.commonmark.org/0.14/changes

https://spec.commonmark.org/0.13/

I will investigate why these are introduced.

@tats-u
Copy link

tats-u commented Nov 13, 2023

https://github.com/commonmark/commonmark-spec/blob/0.14/changelog.spec.txt

  • Improved rules for emphasis and strong emphasis. This improves parsing of emphasis around punctuation. For background see http://talk.commonmark.org/t/903/6. The basic idea of the change is that if the delimiter is part of a delimiter clump that has punctuation to the left and a normal character (non-space, non-punctuation) to the right, it can only be an opener. If it has punctuation to the right and a normal character (non-space, non-punctuation) to the left, it can only be a closer. This handles cases like
**Gomphocarpus (*Gomphocarpus physocarpus*, syn. *Asclepias physocarpa*)**

and

**foo "*bar*" foo**

http://talk.commonmark.org/t/903/6

There are some good ideas here 4. It looks hairy, but if I understand correctly, basic idea is fairly simple:

  1. Strings of * or _ are divided into “left flanking” and “right flanking,” based on two things: the character immediately before them and the character immediately after.
  2. Left-flanking delimiters can open emphasis, right flanking can close, and non-flanking delimiters are just regular text.
  3. A delimiter is left-flanking if the character to the left has a lower rank than the character to the right, according to the following ranking: spaces and newlines are 0, punctuation (unicode categories Pc, Pd, Ps, Pe, Pi, Pf, Po, Sc, Sk, Sm or So) is 1, the rest 2. And similarly a delimiter is right-flanking if the character to the left has a higher rank than the character to the right.

Note

I replaced the link with a cache by the Wayback machine.

I conclude that this problem was caused by a lack of consideration for Chinese and Japanese by @jgm and the author of vfmd(@roop or possibly @akavel).

@tats-u
Copy link

tats-u commented Nov 13, 2023

I would like to ask them why they included non-ASCII punctuation characters and why only ASCII punctuation characters are not sufficient.

@tats-u
Copy link

tats-u commented Nov 13, 2023

@tats-u
Copy link

tats-u commented Nov 13, 2023

I found the commit containing the initial definition in the spec of vfmd:

vfmd/vfmd-spec@7b53f05

@roop seems to live in India, and this may be because he added non-ASCII punctuation characters, but the trouble is that I do not know Hindi at all. I wonder if a space is always adjacent to punctuation characters in that language like European ones.

@vassudanagunta
Copy link

@tats-u dude, here and in your comments on #618 you come off as arrogant and very disrespectful. You make absolutist claims and then frequently correcting yourself because it turns out you didn't do your homework. You need to have the humility to realize that your perception that "something broke or is broken" might have to do with you not understanding one or more of the following (I don't have the time to figure out which ones, the responsibility is on you):

  • your specific perspective, which may not be universal, which may miss the forest for the single tree that you are most focused on
  • the problem, if there actually is one, might be downstream of CommonMark, in the tool you are using
  • if CommonMark is involved:
    • the facts, the history, or the priorities of CommonMark
    • the impossible expectation that CommonMark can be all things to all people.
    • the difficulty in maintaining a spec where many users expect it to work how they want it without understanding

A more reasoned, respectful and helpful approach would be to have a discussion with other people who are affected by what you claim is broken, including the makers and other users of the downstream tool that you claim is now broken. Diagnose the problem with them, assuming they agree with you that there is a problem, before making a claim that the source of the problem is upstream in CommonMark.

If it turns out that you are alone in this, that should tell you something.

@wooorm
Copy link
Contributor

wooorm commented Nov 14, 2023

@tats-u This issue is still open, so indeed it is looking for a solution. It is also something I have heard from others.

However, it is not easy to solve.
Many languages do use whitespace.
No languages use only ASCII.
Not using unicode would harm many users, too.

There are also legitimate cases where you do want to use an asterisk or underscore but don’t want it to result in emphasis/strong. Also in East-Asian languages.

One idea I have, that could potentially help emphasis/strong, is the Unicode line breaking algorithm: https://unicode.org/reports/tr14/.
It has to be researched, but it might come up with line breaking points that are better indicators than solely relying on whitespace/punctuation.
It might also be worse.

@tats-u
Copy link

tats-u commented Nov 14, 2023

@vassudanagunta I had got too much angry at that time. I do think it was over the limit now. I wish GitHub would provide the draft comment feature out of box, and I could post many things at once without editing or additional ones.

the problem, if there actually is one, might be downstream of CommonMark, in the tool you are using

Let me say there are never in each framework. This problem can be reproduced in the most major JS Markdown frameworks, remark (unified) and markdown-it. Remark-related issues that I have raised are closed immediately with the reason that they are on spec.

image

the impossible expectation that CommonMark can be all things to all people.

I never have. This is why I have looked into the background and the impact of my proposed changes now.

the difficulty in maintaining a spec where many users expect it to work how they want it without understanding

It looks like a lot of work to study the impact of breaking changes and decide whether or not to apply them.

many users expect it to work how they want it without understanding

Due to this problem, it became necessary for me (us) to tell all Japanese (and some Chinese) Markdown writers to refrain from surrounding whole sentences with **, to use JSX <strong>, or to compromise with adding an extra space after the full-width punctuation marks and if they are going to continue additional sentences.

<!-- What would you feel if Markdown would not recognize ** here as <strong> if you remove 4 or 5 spaces?   -->
**Don't surround the whole sentence with the double-asterisk without adding extra spaces!**      The Foobar language which is spoken by most CommonMark maintainers use as many as 6 spaces to split sentences.

the facts, the history

This is what I have looked into by digging through rummaging through the Git history, change logs, and test cases now.

the priorities of CommonMark

It is not surprising that maintainers and you lower the priority of this problem, since it does not affect any European language family, which puts space next to punctuation or parentheses.
I had got angry because I assumed that Japanese and Chinese were not even seen as third-class citizens in the Markdown world due to the background of this problem. (The change causing this problem assumes that all languages puts space next to punctuation or parentheses)

If it turns out that you are alone in this, that should tell you something.

I clearly doubt this.
You had better know many of users of specific languages (and they are not minor ones!) are (going to be) suffered by this problem.


@wooorm I apologize again at this time for my anger and for being too militant in my remarks.


My humble suggestions and comments on them:

  • Revert the concept of left- and right-flanking to prior to 0.14 (0.14 itself is not included)
    • Old remark v8 used in Prettier, which is said to violate CM 0.14+ spec, correctly parses the cases presented in the change log in CM v0.14.
    • I would like to know and have to investigate the impact of this change because it is a breaking change
  • Left- and right-flanking + ASCII punctuation (Unicode punctuation can be used in other parts)
    • In addition to the issues you mentioned, the combination with link **[製品ほげ](./product-foo)**と**[製品ふが](./product-bar)**をお試しください still cannot be parsed as expected. Compromised solution
  • Left- and right-flanking + exclude Chinese- and Japanese-related punctuation from list
    • Some users use ( ) without adjacent space. Compromised solution

Many languages do use whitespace.

I know. It is the background of this problem.

There are also legitimate cases where you do want to use an asterisk or underscore but don’t want it to result in emphasis/strong. Also in East-Asian languages.

I have looked for ones and their frequency. Escaping them does not modify the rendered content itself, but I have been disgusted of having to modify the content by adding extra space or to depend on the inline raw JSX tag (<strong>) to avoid this problem, which puts the shackles on Markdown's expressive power.

Unicode line breaking algorithm

I will look into it later. (I do not expect it either)

@Crissov
Copy link
Contributor

Crissov commented Nov 15, 2023

Checking the general Unicode categories Pc, Pd, Pe, Pf, Pi, Po and Ps, U+3001 Ideographic Comma and U+3002 Ideographic Full Stop are of course included in what Commonmark considers punctuation marks, which are all treated alike.

For its definitions of flanking, CM could start to handle Open/Start Ps (e.g. () and Initial Pi () differently than Close/End Pe ()) and Final Pf (), and both differently than the rest of Connector Pc (_), Dash Pd (-) and Other Po. However, this could only (somewhat) help with brackets and quotation marks or in contexts where they are present, since the characters in question are all part of that last category Po, which is the largest and most diverse by far.

Possibly affected Examples are, for instance: 363, 367+368, 371+372, 376 and 392–394.

@tats-u
Copy link

tats-u commented Nov 26, 2023

@Crissov

Possibly affected Examples are, for instance: 363, 367+368, 371+372, 376 and 392–394.

I checked the raised test cases. 367 is most affected in them.
I wonder how many Markdown writers use nested <em> for casual documents suitable for Markdown and whether if we can ask users to combine * and _ or use the raw <em> powered by MDX if they want to nest <em>.
CJK languages do not use italic. They use https://en.wikipedia.org/wiki/Emphasis_mark, brackets (「」), or quotes (“”) for emphasizing words.
Emphasizing parens in that case may less natural for humans but is a simpler specification and easier to expect the behavior.
Japanese and Chinese do not use _-related syntax because it has too many restrictions, so 371 does not matter. You can keep the current behavior on _.
Other raised cases are not affected.

However, there are some ones not raised but more important. I am not convinced in the test case 378 (a**"foo"**\n→as is).
We may as well treat ** in it as <strong>.
It is popular to make text bold even in Chinese and Japanese and ** is used much more frequently than *.
MDN says that <em> can be nested but does not say that <strong> is also nested.
It would be appreciated if the behavior of ** would be changed first. It is the highest priority for Chinese and Japanese.

handle Open/Start Ps (e.g. () and Initial Pi (“) differently than Close/End Pe ()) and Final Pf (”), and both differently than the rest of Connector Pc (_), Dash Pd (-) and Other Po.

Does it not mean that ** in 単語と**[単語と](word-and)**単語 is going to be treated as <strong> by that change, does it?


FYI, as of https://hypestat.com/info/github.com, one in six visitors in GitHub live in China or Japan. This percentage would not be able to be ignored or underestimated.

@wooorm
Copy link
Contributor

wooorm commented Dec 4, 2023

CJK languages do not use italic.

<em> elements have a default styling in HTML (italic), but you can change that. You can add 「」 before/after if you want, with CSS. Markdown does not dictate italic.

MDN says that <em> can be nested but does not say that <strong> is also nested.

The “Permitted content: Phrasing content” bit allow it for both.

This percentage would not be able to be ignored or underestimated.

I don’t think anybody is underestimating that.
You can’t ignore all existing markdown users either, though, and break them.

Practically, this is also open source, which implies that somebody has to do the work for free here, probably because they think it’s fun or important to do. And then folks working on markdown parsers need to do it too. To illustrate, GitHub hasn’t really done anything in the last 3 years (just security vulnerabilities / new fancy footnote footnotes feature).

@jgm
Copy link
Member

jgm commented Dec 4, 2023

Getting emphasis right in markdown (especially nested emphasis) is very difficult. Changing the existing rules without messing up cases that currently work is highly nontrivial.

For what it's worth, my rationalized syntax djot has simpler rules for emphasis, gives you what you want in the above Japanese example, and allows you to use braces to clarify nesting in cases where it's unclear, e.g. {*foo{*bar*}*}. It might be worth a look.

@tats-u
Copy link

tats-u commented Dec 11, 2023

<em> elements have a default styling in HTML (italic), but you can change that. You can add 「」 before/after if you want, with CSS.

This is technically possible but not practical or necessary. It is much easier and faster to type "「" & "」" from the keyboard directly, and you cannot copy these brackets in ::before and ::after from the text.

Markdown does not dictate italic.

Almost all description on Markdown for newbies including the following say that * is for italic.

I do not know of SaaSes in Japan that customize the style of <em>.

The current behavior of CommonMark forces newbies in China or Japan to try to decipher its spec. It is for developers of Markdown parsers, not for users except for experts.

CommonMark has now grown to the point where it can manipulate the largest Markdown implementations (remark, markdonw-it, goldmark (used by Hugo), commonmarker (possibly used by GitHub), and so on) from behind the scenes. We may well lobby to revise its specification. (unenforceable of course though!)

It would not be difficult to create a new specification of Markdown, but is difficult to give sufficient power to it.

These are why I had tried to stop the left- and right-flanking, but I have found a convincing plan to recently.

We have only to change by my plan:

  • The definitions of (2a) & (2b) in the left- and right-flanking delimiter run
  • Example 352 & 379, which should not occur in English and other many languages that are not suffered by this problem, because a space is mostly adjacent to punctuation in them.

Getting emphasis right in markdown (especially nested emphasis) is very difficult. Changing the existing rules without messing up cases that currently work is highly nontrivial.

We do not have to change the other. I hope most Chinese and Japanese can be convinced by it. Also, you can continue to nest <em> and <strong> in other than Chinese or Japanese as you can do today. (We rarely need that feature in these languages) This will not break almost all existing documents written without abusing the details of the spec.

I don’t think anybody is underestimating that.

I am a little relieved to hear that. I apologize for the misunderstanding.

You can’t ignore all existing markdown users either, though, and break them.

It would affect too many documents if the left- & right-flanking rule were abolished. However, the new plan will not affect on most existing documents except for ones that abuse the details of the spec. Do you mean that they are also included in "all existing" ones?
For the first place, this feature is just an Easter egg. A little modification of that could be accepted. I would be appreciated if you could provide me some links to famous sites that describe on Markdown for intermediate level people and that mention the <em> & <strong> nesting if you have time. I could not find one.

I suggest new terms "punctuation run preceded by space" & "puncuation run followed by space".

  • "... preceded ..." means: a sequence of Unicode punctuation characters preceded by Unicode whitespace
  • "... followed ..." means: a sequence of Unicode punctuation characters followed by Unicode whitespace

(2a) and (2b) is going to be changed like the following:

  • A left-flanking delimiter run is a delimiter run that is (1) not followed by Unicode whitespace, and either (2a) preceded by a Unicode whitespace, or (2b) not the first characters in puncuation run followed by space. For purposes of this definition, the beginning and the end of the line count as Unicode whitespace.
  • A right-flanking delimiter run is a delimiter run that is (1) not preceded by Unicode whitespace, and either (2a) followd by a Unicode whitespace, or (2b) not the last characters in puncuation run preceded by space. For purposes of this definition, the beginning and the end of the line count as Unicode whitespace.

This change treats punctuation characters that are not adjacent to space as normal letters. To see if the "**" works as intended, one need only check the nearest whitespace and the punctuation characters around it. It make it possible to parse all of the followings:

**これは太字になりません。**ご注意ください。

カッコに注意**(太字にならない)**文が続く場合に要警戒。

**[リンク](https://example.com)**も注意。(画像も同様)

先頭の**`コード`も注意。**

**末尾の`コード`**も注意。

Also, we can parse even the following English as intended:

You should write “John**'s**” instead.

We do not concatenate too many punctuation characters, so we do not have to search more than ten and some (e.g. 16) punctuation characters for space from the previous or next of the target delimiter run.


To check if the delimiter run is "the last characters in punctuation run preceded by space" (without using cache):

flowchart TD
    Next{"Is the<br>next character<br>an Unicode punctuation<br>chracter?"}
    Next--> |YES| F["<code>return false</code>"]
    Next--> |NO| Init["<code>current =</code><br>(previous character)<br><code>n =</code><br>(Length of delimiter run)"]
    Init--> Exceed{"<code>n >= 16</code>?"}
    Exceed--> |YES| F
    Exceed --> |NO| Previous{"What type is <code>current</code>?"}
    Previous --> |Not punctuation or space| F
    Previous --> |Space| T["<code>return true</code>"]
    Previous --> |Unicode punctuation| Iter["<code>n++<br>current =</code><br>(previous character)"]
    Iter --> Exceed
Loading

In the current spec, to non-advanced users especially in China or Japan, "*" and "**" sometimes appear to be abandoning its duties. We must not let non-advanced users write Markdown in fear of this hidden feature.

@Crissov
Copy link
Contributor

Crissov commented Feb 1, 2024

0.31 changes the wording slightly, but as far as I can tell this does not change flanking behavior at all.

A Unicode punctuation character is …

  • old:

    an [ASCII punctuation character] or anything in the general Unicode categories Pc, Pd, Pe, Pf, Pi, Po, or Ps.

  • new:

    a character in the Unicode P (puncuation) or S (symbol) general categories.

@tats-u
Copy link

tats-u commented Feb 4, 2024

The change made the situation even worse.
The following sentences are now unable to be parsed properly.

税込**¥10,000**で入手できます。

正解は**④**です。

The few improvements are only that it is easier to explain the condition to beginners (we can now use the single word “symbols”) and more consistent with ASCII punctuation characters.

@jgm
Copy link
Member

jgm commented Feb 4, 2024

This particular change was not intended to address this issue; it was just intended to make things more consistent.

@tats-u I am sorry, I have not yet had time to give your proposal proper consideration.

@tats-u
Copy link

tats-u commented Feb 5, 2024

This particular change was not intended to address this issue; it was just intended to make things more consistent.

I guess it, but as a result it did cause a breaking change and break some documents (much less than ones affected by 0.14 though), which is a kind of regressions you have mostly feared and cared about.
This change will be the basis for determining what kind of breaking changes will be acceptable in the future.

For the first place, we cannot easily access to convincing and practical examples that describe how legitimate controversial parts of specifications and changes are; we can easily find only ones that are designed only for testing and do not have meaning (e.g. *$*a. and *$*alpha.)

What is needed is like:

Price: ****10 per month (note: you cannot pay in US$!)

I have not yet had time to give your proposal proper consideration.

FYI you do not have evaluate how optimize the algorithm in the above flowchart; it is too naive and can be optimized. All I want you to do first is to evaluate how acceptable breaking changes brought by my revision are. It might be better for me to make a PoC to make it easy to do it.

@jgm
Copy link
Member

jgm commented Feb 5, 2024

To be honest, I didn't anticipate these breaking changes, and I would have thought twice about the change if I had.

Having a parser to play with that implements your idea would make it easier to see what its consequences would be. (Ideally, a minimally altered cmark or commonmark.js.) It's also important to have a plan that can be implemented without significantly degrading the parser's performance. But my guess is that if it's just a check that has to be run once for each delimiter + punctuation run, it should be okay.

@tats-u
Copy link

tats-u commented Sep 23, 2024

I found Yijing symbols other than Yijing Hexagram Symbols.
U+2630 (☰)–U+2637 (☷) (Yijing Trigram Symbols) in Miscellaneous Symbols block
To keep consistency with Hexagram symbols, I want to remove Hexagram symbols from CJK list.

@tats-u
Copy link

tats-u commented Sep 23, 2024

Also, if the next character of * is one of U+3030 (〰︎), U+303D (〽︎), U+3297 (㊗︎), U+3299 (㊙︎), we should see the next character of such a character (U+3030/303D/3299) to check whether that character is emoji or CJK character.

VS-16 (U+FE0F) is required to be recognized as an emoji there.

3030 FE0F     ; Basic_Emoji                  ; wavy dash                                                      # E0.6   [1] (〰️)
303D FE0F     ; Basic_Emoji                  ; part alternation mark                                          # E0.6   [1] (〽️)
3297 FE0F     ; Basic_Emoji                  ; Japanese “congratulations” button                              # E0.6   [1] (㊗️)
3299 FE0F     ; Basic_Emoji                  ; Japanese “secret” button                                       # E0.6   [1] (㊙️)

https://unicode.org/Public/emoji/16.0/emoji-sequences.txt

@jgm
Copy link
Member

jgm commented Sep 23, 2024

To keep consistency with Hexagram symbols, I want to remove Hexagram symbols from CJK list.

Want to create a PR for this?

if the next character of * is one of U+3030 (〰︎), U+303D (〽︎), U+3297 (㊗︎), U+3299 (㊙︎), we should see the next character of such a character (U+3030/303D/3299) to check whether that character is emoji or CJK character.

I'd love to avoid making the rules too complex. This suggestion goes in that direction.

@tats-u
Copy link

tats-u commented Sep 23, 2024

Want to create a PR for this?

Sure, reverting is sufficient.

I'd love to avoid making the rules too complex. This suggestion goes in that direction.

It is very unlikely that we have to look more than two code points from * to determine if a character is an emoji. Unicode and emojis are now too complex. I hate those who integrated those emojis into normal text symbols in Unicode.

e.g. 1️⃣: (U+0031 (ASCII digit 1) U+FE0F (VS-16) U+20E3 (keycap ⃣ ))
This emoji is very irregular (stars with an ASCII digit) but you have only to see up to the second codepoint to check if it is an emoji.

@tats-u
Copy link

tats-u commented Sep 23, 2024

commonmark/cmark#564

@tats-u
Copy link

tats-u commented Oct 5, 2024

I found a spec bug caused by the combination of a keycap emoji and * in Markdown:

まず、*️⃣を押してください。番号を入力したら、もう一度*️⃣を押してください。
<p>まず、<em>️⃣を押してください。番号を入力したら、もう一度</em>️⃣を押してください。</p>

https://spec.commonmark.org/dingus/?text=%E3%81%BE%E3%81%9A%E3%80%81*%EF%B8%8F%E2%83%A3%E3%82%92%E6%8A%BC%E3%81%97%E3%81%A6%E3%81%8F%E3%81%A0%E3%81%95%E3%81%84%E3%80%82%E7%95%AA%E5%8F%B7%E3%82%92%E5%85%A5%E5%8A%9B%E3%81%97%E3%81%9F%E3%82%89%E3%80%81%E3%82%82%E3%81%86%E4%B8%80%E5%BA%A6*%EF%B8%8F%E2%83%A3%E3%82%92%E6%8A%BC%E3%81%97%E3%81%A6%E3%81%8F%E3%81%A0%E3%81%95%E3%81%84%E3%80%82

002A FE0F 20E3                                         ; fully-qualified     # *️⃣ E2.0 keycap: *
002A 20E3                                              ; unqualified         # *⃣ E2.0 keycap: *

https://unicode.org/Public/emoji/latest/emoji-test.txt
https://www.unicode.org/emoji/charts/emoji-style.txt

Of course we can hotfix this by adding \ before *️⃣, but it's a pitfall for non-experts.

@jgm
Copy link
Member

jgm commented Oct 5, 2024

@tats-u because this doesn't have to do with CJK specifically, perhaps you should raise it in a separate issue.

@wooorm
Copy link
Contributor

wooorm commented Oct 7, 2024

It’s #646

@tats-u
Copy link

tats-u commented Oct 7, 2024

because this doesn't have to do with CJK specifically

Both of the keycap of * and CJK emojis require the preload of up to 2 characters from * to convince us that they're not keycap or CJK text symbols.

  • * ㊙︎ VS16 (*㊙️)
  • * VS16 Keycap (*️⃣)

It’s #646

I see. We might be able to fix both at the same time.

@tats-u
Copy link

tats-u commented Oct 13, 2024

Looks like the keycap is more difficult due to **foo**️⃣.
The 3rd * (just before *️⃣) should be only right-flanking but the last (4th) * (the 1st codepoint of *️⃣) should be neither-flanking.
VS16 two next to * doesn't break the concept of the flanking delimiter run unlike the keycap. (Update: do we have only to exclude such a * from the delimiter run?)
However this doesn't change the fact that we need the capability to peek two next characters to fix either of such problems.

@tats-u
Copy link

tats-u commented Jan 13, 2025

I have just published markdown-it plugin: https://www.npmjs.com/package/markdown-it-cj-friendly

@tats-u
Copy link

tats-u commented Jan 13, 2025

UAX 11 East Asian Width might be better instead of managing own ranges by ourselves.

  • CJK characters are ones whose EAW are H, F, or W, and not emoji (EAW of many emojis is W)
  • Korean characters are Script = Hangul or U+20A9 (Halfwidth) won sign
  • Variation selectors are not covered by EAW

@jgm
Copy link
Member

jgm commented Jan 13, 2025

I like that suggestion. We could create a script that produces code directly from EastAsianWidth.txt.

@jgm
Copy link
Member

jgm commented Jan 13, 2025

import re
import sys
import os
import urllib.request

# Download Unicode data files if not present
unicode_files = [
    'EastAsianWidth.txt',
    'emoji-data.txt',
    'Scripts.txt'
]

unicode_version = '16.0.0'
base_url = f'https://www.unicode.org/Public/{unicode_version}/ucd/'

for file in unicode_files:
    if not os.path.exists(file):
        url = base_url + file
        print(f"Downloading {file}...")
        urllib.request.urlretrieve(url, file)
    else:
        print(f"{file} already exists. Skipping download.")

# cjk_codepoints is an array 0..F0000
# each element is a Boolean; True if it's cjk
size = 0xF0000
cjk_codepoints = [False] * size

# first we iterate through EastAsianWidth.txt, setting code points to
# true if they have width H, F, or W:
with open('EastAsianWidth.txt', 'r') as file:
    for line in file:
        match = re.match(r'^([0-9A-F]+)(?:..([0-9A-F]+))?\s*;\s*[HFW]', line)
        if match:
            x = int(match.group(1), 16)
            y = (match.group(2) and int(match.group(2), 16)) or x
            for z in range(x,y):
                cjk_codepoints[z] = True

# second, we set Hangul code points to True + U+20A9
with open('Scripts.txt', 'r') as file:
    for line in file:
        match = re.match(r'^([0-9A-F]+)(?:..([0-9A-F]+))?\s*;\s*Hangul', line)
        if match:
            x = int(match.group(1), 16)
            y = (match.group(2) and int(match.group(2), 16)) or x
            for z in range(x,y):
                cjk_codepoints[z] = True
cjk_codepoints[0x20A9] = True

# third, we iterate through emoji-data.txt, setting code points to
# false if they are in these ranges.
with open('emoji-data.txt', 'r') as file:
    for line in file:
        match = re.match(r'^([0-9A-F]+)(?:..([0-9A-F]+))?\s*;', line)
        if match:
            x = int(match.group(1), 16)
            y = (match.group(2) and int(match.group(2), 16)) or x
            for z in range(x,y):
                cjk_codepoints[z] = False

# finally, we iterate through our array to create a sequence of ranges
range_open = False
for z in range(0,size):
    if cjk_codepoints[z]:
        if not range_open:
           sys.stdout.write("%x.." % z)
           range_open = True
    else:
        if range_open:
            sys.stdout.write("%x\n" % (z - 1))
            range_open = False

yields:

CJK ranges in Hex
1100..11fe
20a9..20a9
268a..268e
2e80..2e98
2e9b..2ef2
2f00..2fd4
2ff0..2ffe
3001..3002
3012..3012
301e..301e
3021..3028
302a..302c
302e..302e
3031..3034
3036..3036
3038..3039
3041..3095
3099..3099
309b..309b
309d..309d
30a1..30f9
30fc..30fd
3105..312e
3131..318d
3190..3190
3192..3194
3196..319e
31a0..31be
31c0..31e4
31f0..31fe
3200..321d
3220..3228
322a..3246
3251..325e
3260..327e
3280..3288
328a..32af
32b1..32be
32c0..32fe
3300..33fe
3400..4dbe
4dc0..4dfe
4e00..9ffe
a000..a013
a016..a48b
a490..a4c5
a960..a97b
ac00..d7a2
d7b0..d7c5
d7cb..d7fa
f900..fa6c
fa6e..fa6e
fa70..fad8
fada..fafe
fe10..fe15
fe31..fe31
fe33..fe33
fe45..fe45
fe49..fe4b
fe4d..fe4e
fe50..fe51
fe54..fe56
fe5f..fe60
fe64..fe65
fe6a..fe6a
ff01..ff02
ff05..ff06
ff0e..ff0e
ff10..ff18
ff1a..ff1a
ff1c..ff1d
ff1f..ff1f
ff21..ff39
ff41..ff59
ff64..ff64
ff66..ff6e
ff71..ff9c
ff9e..ff9e
ffa0..ffbd
ffc2..ffc6
ffca..ffce
ffd2..ffd6
ffda..ffdb
ffe0..ffe0
ffe5..ffe5
ffe9..ffeb
ffed..ffed
16fe0..16fe0
16ff0..16ff0
17000..187f6
18800..18afe
18b00..18cd4
18d00..18d07
1aff0..1aff2
1aff5..1affa
1affd..1affd
1b000..1b0fe
1b100..1b121
1b150..1b151
1b164..1b166
1b170..1b2fa
1d300..1d355
1d360..1d375
1f200..1f200
1f210..1f231
1f23a..1f23a
1f240..1f247
1f30c..1f30c
1f30e..1f312
1f315..1f315
1f318..1f31c
1f31e..1f31e
1f32f..1f32f
1f331..1f331
1f333..1f333
1f34a..1f34b
1f34f..1f350
1f37b..1f37b
1f37f..1f37f
1f3c4..1f3c9
1f3e3..1f3e4
1f407..1f408
1f40b..1f40b
1f40e..1f40e
1f410..1f410
1f412..1f416
1f429..1f42a
1f464..1f465
1f46b..1f46b
1f46d..1f46d
1f4ac..1f4ad
1f4b5..1f4b5
1f4b7..1f4b7
1f4eb..1f4eb
1f4ed..1f4ef
1f4f4..1f4f5
1f4f7..1f4f8
1f502..1f503
1f507..1f509
1f514..1f515
1f52b..1f52b
1f52d..1f52d
1f55b..1f55b
1f600..1f600
1f606..1f606
1f608..1f608
1f60d..1f611
1f614..1f61b
1f61e..1f61f
1f625..1f625
1f627..1f627
1f62b..1f62d
1f62f..1f62f
1f633..1f636
1f640..1f640
1f644..1f644
1f680..1f680
1f682..1f682
1f685..1f689
1f68b..1f690
1f693..1f698
1f69a..1f69a
1f6a1..1f6a3
1f6a5..1f6a6
1f6ad..1f6ad
1f6b1..1f6b2
1f6b5..1f6b6
1f6b8..1f6b8
1f6be..1f6c0
1f6d0..1f6d0
1f6d5..1f6d5
1f6dc..1f6dc
1f6f6..1f6f6
1f6f8..1f6fa
1f90c..1f90c
1f90f..1f90f
1f918..1f918
1f91e..1f91f
1f927..1f927
1f92f..1f930
1f932..1f932
1f93e..1f93f
1f94b..1f94c
1f94f..1f94f
1f95e..1f95e
1f96b..1f96b
1f970..1f972
1f976..1f976
1f978..1f97b
1f97f..1f97f
1f984..1f984
1f991..1f991
1f997..1f997
1f9a2..1f9a2
1f9a4..1f9a4
1f9aa..1f9aa
1f9ad..1f9ad
1f9af..1f9af
1f9b9..1f9b9
1f9bf..1f9c0
1f9c2..1f9c2
1f9ca..1f9cc
1f9cf..1f9cf
1f9e6..1f9e6
1fa73..1fa74
1fa77..1fa77
1fa7a..1fa7a
1fa82..1fa82
1fa86..1fa86
1fa88..1fa88
1fa8f..1fa8f
1fa95..1fa95
1faa8..1faa8
1faac..1faac
1faaf..1faaf
1fab6..1fab6
1faba..1faba
1fabd..1fabf
1fac2..1fac2
1fac5..1fac5
1facf..1facf
1fad6..1fad6
1fad9..1fad9
1fadb..1fadb
1fadf..1fadf
1fae7..1fae8
1faf6..1faf6
20000..2a6de
2a6e0..2a6fe
2a700..2b738
2b73a..2b73e
2b740..2b81c
2b81e..2b81e
2b820..2cea0
2cea2..2ceae
2ceb0..2ebdf
2ebe1..2ebee
2ebf0..2ee5c
2ee5e..2f7fe
2f800..2fa1c
2fa1e..2fa1e
2fa20..2fffc
30000..31349
3134b..3134e
31350..323ae
323b0..3fffc
CJK ranges in decimal
4352..4606
8361..8361
9866..9870
11904..11928
11931..12018
12032..12244
12272..12286
12289..12290
12306..12306
12318..12318
12321..12328
12330..12332
12334..12334
12337..12340
12342..12342
12344..12345
12353..12437
12441..12441
12443..12443
12445..12445
12449..12537
12540..12541
12549..12590
12593..12685
12688..12688
12690..12692
12694..12702
12704..12734
12736..12772
12784..12798
12800..12829
12832..12840
12842..12870
12881..12894
12896..12926
12928..12936
12938..12975
12977..12990
12992..13054
13056..13310
13312..19902
19904..19966
19968..40958
40960..40979
40982..42123
42128..42181
43360..43387
44032..55202
55216..55237
55243..55290
63744..64108
64110..64110
64112..64216
64218..64254
65040..65045
65073..65073
65075..65075
65093..65093
65097..65099
65101..65102
65104..65105
65108..65110
65119..65120
65124..65125
65130..65130
65281..65282
65285..65286
65294..65294
65296..65304
65306..65306
65308..65309
65311..65311
65313..65337
65345..65369
65380..65380
65382..65390
65393..65436
65438..65438
65440..65469
65474..65478
65482..65486
65490..65494
65498..65499
65504..65504
65509..65509
65513..65515
65517..65517
94176..94176
94192..94192
94208..100342
100352..101118
101120..101588
101632..101639
110576..110578
110581..110586
110589..110589
110592..110846
110848..110881
110928..110929
110948..110950
110960..111354
119552..119637
119648..119669
127488..127488
127504..127537
127546..127546
127552..127559
127756..127756
127758..127762
127765..127765
127768..127772
127774..127774
127791..127791
127793..127793
127795..127795
127818..127819
127823..127824
127867..127867
127871..127871
127940..127945
127971..127972
128007..128008
128011..128011
128014..128014
128016..128016
128018..128022
128041..128042
128100..128101
128107..128107
128109..128109
128172..128173
128181..128181
128183..128183
128235..128235
128237..128239
128244..128245
128247..128248
128258..128259
128263..128265
128276..128277
128299..128299
128301..128301
128347..128347
128512..128512
128518..128518
128520..128520
128525..128529
128532..128539
128542..128543
128549..128549
128551..128551
128555..128557
128559..128559
128563..128566
128576..128576
128580..128580
128640..128640
128642..128642
128645..128649
128651..128656
128659..128664
128666..128666
128673..128675
128677..128678
128685..128685
128689..128690
128693..128694
128696..128696
128702..128704
128720..128720
128725..128725
128732..128732
128758..128758
128760..128762
129292..129292
129295..129295
129304..129304
129310..129311
129319..129319
129327..129328
129330..129330
129342..129343
129355..129356
129359..129359
129374..129374
129387..129387
129392..129394
129398..129398
129400..129403
129407..129407
129412..129412
129425..129425
129431..129431
129442..129442
129444..129444
129450..129450
129453..129453
129455..129455
129465..129465
129471..129472
129474..129474
129482..129484
129487..129487
129510..129510
129651..129652
129655..129655
129658..129658
129666..129666
129670..129670
129672..129672
129679..129679
129685..129685
129704..129704
129708..129708
129711..129711
129718..129718
129722..129722
129725..129727
129730..129730
129733..129733
129743..129743
129750..129750
129753..129753
129755..129755
129759..129759
129767..129768
129782..129782
131072..173790
173792..173822
173824..177976
177978..177982
177984..178204
178206..178206
178208..183968
183970..183982
183984..191455
191457..191470
191472..192092
192094..194558
194560..195100
195102..195102
195104..196604
196608..201545
201547..201550
201552..205742
205744..262140

How does that compare with our current ranges?

@tats-u
Copy link

tats-u commented Jan 14, 2025

-             for z in range(x,y):
+             for z in range(x,y + 1):
  • There are some code points whose EAW is not defined. If they are surrounded by non-emoji-and-EAW∈{H,F,W} ranges, they should be treated the same way. e.g. U+3097–U+3098 are not defined but U+3096 & U+3099 are EAW = W and not emoji, so they should not be removed from the range.
  • We should not remove single codepoint "unqualified-emoji" (without U+FE0F) from the range. https://unicode.org/Public/emoji/latest/emoji-test.txt
    • 〰 (U+3030)
    • 〽 (U+303D)
    • 🈂 (U+1F202)
    • 🈷 (U+1F237)
    • ㊗ (U+3297)
    • ㊙ (U+3299)

We have excluded Korean but may have to add it to the range, especially for right flanking.

https://talk.commonmark.org/t/emphasis-and-east-asian-text/2491/6

*스크립트(script)*라고

The following is generated by translated from Japanese come up by me:

패키지를 발행하려면 **`npm publish`**를 실행하십시오.

@tats-u

This comment was marked as outdated.

@tats-u

This comment was marked as outdated.

@tats-u
Copy link

tats-u commented Jan 15, 2025

Source
import re
import sys
import os
import urllib.request
from typing import Literal


def download(base_url: str, file: str) -> None:
    if not os.path.exists(file):
        url = base_url + file
        print(f"Downloading {file}...")
        urllib.request.urlretrieve(url, file)
    else:
        print(f"{file} already exists. Skipping download.")

# Download Unicode data files if not present
uax_files = [
    'EastAsianWidth.txt',
    'Scripts.txt',
    'Blocks.txt'
]

unicode_version_upto_minor = '16.0'
unicode_full_version = 'latest' if unicode_version_upto_minor == 'latest' else f'{unicode_version_upto_minor}.0'
uax_base_url = f'https://www.unicode.org/Public/{unicode_full_version}/ucd/'

for file in uax_files:
    download(uax_base_url, file)

emoji_base_url = f"https://www.unicode.org/Public/emoji/{unicode_version_upto_minor}/"

download(f"{uax_base_url}emoji/", 'emoji-data.txt')

emoji_files = [
    'emoji-test.txt',
    'emoji-sequences.txt'
]

for file in emoji_files:
    download(emoji_base_url, file)

# cjk_codepoints is an array 0..F0000
# each element is a Boolean; True if it's cjk
size = 0x110000
cjk_codepoints: list[bool | None] = [None] * size

# first we iterate through EastAsianWidth.txt, setting code points to
# true if they have width H, F, or W:
with open('EastAsianWidth.txt', 'r') as file:
    for line in file:
        match = re.match(r'^([0-9A-F]+)(?:\.\.([0-9A-F]+))?\s*;\s*([HFWAN]a?)\s+', line)
        if match:
            first = int(match.group(1), 16)
            last = (match.group(2) and int(match.group(2), 16)) or first
            isEaw = match.group(3) in {'H', 'F', 'W'}
            for z in range(first,last + 1):
                cjk_codepoints[z] = isEaw

# second, we set Hangul code points to True + U+20A9
with open('Scripts.txt', 'r') as file:
    for line in file:
        match = re.match(r'^([0-9A-F]+)(?:\.\.([0-9A-F]+))?\s*;\s*Hangul', line)
        if match:
            first = int(match.group(1), 16)
            last = (match.group(2) and int(match.group(2), 16)) or first
            for z in range(first,last + 1):
                cjk_codepoints[z] = True
# cjk_codepoints[0x20A9] = True

unqualified_single_codepoint_emojis: set[int] = set()

with open("emoji-test.txt", "r") as file:
    for line in file:
        match = re.match(r'^([0-9A-F]+)\s+;\s+unqualified\s+', line)
        if match:
            unqualified_single_codepoint_emojis.add(int(match.group(1), 16))

# third, we iterate through emoji-data.txt, setting code points to
# false if they are in these ranges.
with open('emoji-data.txt', 'r') as file:
    for line in file:
        match = re.match(r'^([0-9A-F]+)(?:\.\.([0-9A-F]+))?\s*;', line)
        if match:
            first = int(match.group(1), 16)
            last = (match.group(2) and int(match.group(2), 16)) or first
            for z in range(first,last + 1):
                if z not in unqualified_single_codepoint_emojis:
                    cjk_codepoints[z] = False


def print_range_start(start: int) -> None:
    print(f"- {start:04X}", end='')

def print_range_end(start: int, end: int) -> None:
    start_char = chr(start)
    if start == end:
        print(f" ({start_char})")
    else:
        print(f"..{end:04X} ({start_char}..{chr(end)})")

# IVS
for z in range(0xE0100, 0xE01EF + 1):
    cjk_codepoints[z] = True

print("## Ranges\n")

# finally, we iterate through our array to create a sequence of ranges
range_start: int | None = None
paused_range_end: int | None = None
for z in range(0,size):
    match cjk_codepoints[z]:
        case True:
            if range_start is None:
                print_range_start(z)
                range_start = z
            elif paused_range_end is not None:
                paused_range_end = None
        case False:
            if range_start is not None:
                if paused_range_end is not None:
                    print_range_end(range_start, paused_range_end)
                    paused_range_end = None
                else:
                    print_range_end(range_start, z - 1)
                range_start = None
        case None:
            if range_start is not None and paused_range_end is None:
                paused_range_end = z - 1

def get_coverage_txt(first: int, last: int) -> Literal["All"] | Literal["Some"] | Literal["None"] | Literal["Undefined"]:
    has: bool | None = None
    for z in range(first, last + 1):
        result = cjk_codepoints[z]
        if result is not None:
            if has is None:
                has = result
            elif has != result:
                return "Some"

    match has:
        case True:
            return "All"
        case False:
            return "None"
        case None:
            return "Undefined"

def get_coverage_statistics(first: int, last: int) -> tuple[int, int]:
    has = 0
    total = 0
    for z in range(first, last + 1):
        result = cjk_codepoints[z]
        if result is not None:
            if result:
                has += 1
            total += 1
    return has, total


coverage_dict: dict[Literal["All"] | Literal["Some"] | Literal["None"] | Literal["Undefined"], list[tuple[int, int, str]]] = {
    "All": [],
    "Some": [],
    "None": [],
    "Undefined": []
}

with open("Blocks.txt", "r") as file:
    for line in file:
        match = re.match(r'^([0-9A-F]+)\.\.([0-9A-F]+); ([^ ].*)$', line)
        if match:
            first = int(match.group(1), 16)
            last = int(match.group(2), 16)
            name = match.group(3)
            coverage = get_coverage_txt(first, last)
            coverage_dict[coverage].append((first, last, name))

block_url_prefix = "https://www.compart.com/en/unicode/block/U+"

for coverage, ranges in coverage_dict.items():
    if len(ranges) == 0:
        continue
    print(f"\n## {coverage}\n")
    if coverage == "Some":
        for first, last, name in ranges:
            has, total = get_coverage_statistics(first, last)
            percent = f"{round(has / total * 100, 2)}%"
            print(f"- [{name}]({block_url_prefix}{first:04X}) ({first:04X}..{last:04X}) ({has}/{total}) ({percent})")
    else:
        if coverage == "None":
            print("<details>\n")
        for first, last, name in ranges:
            print(f"- [{name}]({block_url_prefix}{first:04X}) ({first:04X}..{last:04X})")
        if coverage == "None":
            print("\n</details>")

Ranges

  • 1100..11FF (ᄀ..ᇿ)
  • 20A9 (₩)
  • 2329..232A (〈..〉)
  • 268A..268F (⚊..⚏)
  • 2E80..303E (⺀..〾)
  • 3041..3247 (ぁ..㉇)
  • 3250..A4C6 (㉐..꓆)
  • A960..A97C (ꥠ..ꥼ)
  • AC00..D7FB (가..ퟻ)
  • F900..FAFF (豈..﫿)
  • FE10..FE19 (︐..︙)
  • FE30..FE6B (︰..﹫)
  • FF01..FFEE (!..○)
  • 16FE0..1B2FB (𖿠..𛋻)
  • 1D300..1D376 (𝌀..𝍶)
  • 1F200 (🈀)
  • 1F202 (🈂)
  • 1F210..1F219 (🈐..🈙)
  • 1F21B..1F22E (🈛..🈮)
  • 1F230..1F231 (🈰..🈱)
  • 1F237 (🈷)
  • 1F23B (🈻)
  • 1F240..1F248 (🉀..🉈)
  • 20000..3FFFD (𠀀..𿿽)
  • E0100..E01EF (󠄀..󠇯)

All

Some

None

@tats-u
Copy link

tats-u commented Jan 15, 2025

I'm almost completely satisfied with the ranges.

Do you all agree that Korean should be also taken into account? If so I have to rename the suffix of my package(s) -cj-friendly to -cjk-friendly and republish it (them).

Korean uses spaces to split not words but phrases.

@jgm
Copy link
Member

jgm commented Jan 15, 2025

Can you explain what "Ranges", "All", "Some", "None" mean in the above?

I'm not sure about Korean because I don't know a thing about it. I have to defer to others here about what would be best. The question is whether it is generally whole phrases that are emphasized. I imagine not but I don't know what counts as a phrase.

@tats-u
Copy link

tats-u commented Jan 15, 2025

Ranges → Those of CJK
All → All characters in the block are treated as CJK
Some → Some in the block are CJK but not all
None → Non-CJK blocks

A Korean noun and verb are frequently followed by one or a few postpositional words (e.g. postpositional particle) in a phrase.

**[링크](https://example.kr/)**만을 강조하고 싶다.

I want to emphasize only **this [link](https://example.kr/)**.

For English or other European language speakers, a phrase can be something like "in foo", "Bar is", or "Please do baz". Although I know little about Korean, I don't think that it's as general as in English or that Korean needs nested emphasis.

As you can imagine you will not suffer from this spec bug in Korean unlike Chinese or Japanese If you want to emphasize whole sentences. This is the main reason why I have excluded Korean from the range.

@jgm
Copy link
Member

jgm commented Jan 15, 2025

So, am I right that "Ranges" is all that a spec or implementation needs to be concerned with here? Is there some wording relating these ranges to the Unicode blocks that you would suggest, or is it best just to give the ranges explicitly?

@tats-u
Copy link

tats-u commented Jan 16, 2025

So, am I right that "Ranges" is all that a spec or implementation needs to be concerned with here?

Yes It's just a computation result and snapshot for those who pursue performance or zero dependency, or are interested in what characters are actually treated as CJ(K).

It contains some unassigned code points (e.g. U+2FFFF), but implementations can choose to include or exclude them.

Is there some wording relating these ranges to the Unicode blocks that you would suggest

No. Blocks can be used just for verification of the correctness of the range. It is not directly related to a new spec draft.

it best just to give the ranges explicitly

That suggested by me is just a computation result and a snapshot as of Unicode 16.

Implementations can directly convert it to a conditional expression in an if statement or something similar, or check EAW, Script, and emojiness of surrounding characters.

@tats-u
Copy link

tats-u commented Jan 16, 2025

I found out a Korean's tweet of dissatisfaction:

https://x.com/pyrasis/status/1567317306153508865

또 다른 문제점은 MDX의 markdown 볼드 문법(** **)을 처리였습니다. **안녕**하세요.는 괜찮았는데, **안녕(hello)**하세요.와 같이 특수문자가 들어가면 볼드처리가 되지 않는 문제가 있었습니다. 이 부분도 어쩔 수 없이 <b>안녕(hello)</b>하세요.로 대체하였습니다.

Another issue was handling MDX's markdown bold syntax (** **). While **안녕**하세요. (**He**llo.) worked fine, **안녕(hello)**하세요. had problems with bold formatting due to special characters. Unfortunately, this part had to be replaced with <b>안녕(hello)</b>하세요..

I will choose to take Korean into account at least in my package(s).

My suggestion (full):

"CJK character other than (without) variation selector" SHALL be a character whose code point:

  • satisfies at least one of:
    • satisfies both of:
      • EAW is H, F, or W
      • Not fully-qualified emoji by single code point
    • Script is Hangul

@tats-u
Copy link

tats-u commented Jan 18, 2025

New package with Korean support has just been arrived: https://www.npmjs.com/package/markdown-it-cjk-friendly
Spec diff based on CommonMark spec and tips: https://github.com/tats-u/markdown-cjk-friendly/blob/main/specification.md

The previous package is now deprecated.

@tats-u
Copy link

tats-u commented Jan 26, 2025

remark plugin & micromark extension have just been released:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants