-
-
Notifications
You must be signed in to change notification settings - Fork 322
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Emphasis with CJK punctuation #650
Comments
This and the above issues are caused by the change in #618. It is mixed in only v0.30 spec. https://spec.commonmark.org/0.30/changes
The definition of left- and -right-franking emphasis for * and ** must use ASCII punctuation characters instead of Unicode ones. does not cause such problem, so remark depended by MDX v2+ is affected. |
Again, there is no change in 618. That PR is just about words, terminology. MDX 1 did not follow CM correctly and had other bugs. Can you please read what I say, and please stop spamming, and actually contribute? |
The extension by MDX is not the culprit. https://codesandbox.io/s/remark-playground-wmfor?file=/package.json As of Not reproduced in The latest Prettier (uses
This means that the credit for the change goes to the fact that it turns to be clear that this specification is a terrible one that should be revised. Old |
https://spec.commonmark.org/0.29/
You are right. I'm sorry. I will look for another version. |
I finally found that the current broken definition sentences were introduced in 0.14. https://spec.commonmark.org/0.14/changes https://spec.commonmark.org/0.13/ I will investigate why these are introduced. |
https://github.com/commonmark/commonmark-spec/blob/0.14/changelog.spec.txt
http://talk.commonmark.org/t/903/6
Note I replaced the link with a cache by the Wayback machine. I conclude that this problem was caused by a lack of consideration for Chinese and Japanese by |
I would like to ask them why they included non-ASCII punctuation characters and why only ASCII punctuation characters are not sufficient. |
I will blame https://github.com/vfmd/vfmd-spec/blob/gh-pages/specification.md later. The test cases in vfmd considered only ASCII punctuation. |
I found the commit containing the initial definition in the spec of vfmd:
|
@tats-u dude, here and in your comments on #618 you come off as arrogant and very disrespectful. You make absolutist claims and then frequently correcting yourself because it turns out you didn't do your homework. You need to have the humility to realize that your perception that "something broke or is broken" might have to do with you not understanding one or more of the following (I don't have the time to figure out which ones, the responsibility is on you):
A more reasoned, respectful and helpful approach would be to have a discussion with other people who are affected by what you claim is broken, including the makers and other users of the downstream tool that you claim is now broken. Diagnose the problem with them, assuming they agree with you that there is a problem, before making a claim that the source of the problem is upstream in CommonMark. If it turns out that you are alone in this, that should tell you something. |
@tats-u This issue is still open, so indeed it is looking for a solution. It is also something I have heard from others. However, it is not easy to solve. There are also legitimate cases where you do want to use an asterisk or underscore but don’t want it to result in emphasis/strong. Also in East-Asian languages. One idea I have, that could potentially help emphasis/strong, is the Unicode line breaking algorithm: https://unicode.org/reports/tr14/. |
@vassudanagunta I had got too much angry at that time. I do think it was over the limit now.
Let me say there are never in each framework. This problem can be reproduced in the most major JS Markdown frameworks, remark (unified) and markdown-it. Remark-related issues that I have raised are closed immediately with the reason that they are on spec.
I never have. This is why I have looked into the background and the impact of my proposed changes now.
It looks like a lot of work to study the impact of breaking changes and decide whether or not to apply them.
Due to this problem, it became necessary for me (us) to tell all Japanese (and some Chinese) Markdown writers to refrain from surrounding whole sentences with <!-- What would you feel if Markdown would not recognize ** here as <strong> if you remove 4 or 5 spaces? -->
**Don't surround the whole sentence with the double-asterisk without adding extra spaces!** The Foobar language which is spoken by most CommonMark maintainers use as many as 6 spaces to split sentences.
This is what I have looked into by digging through rummaging through the Git history, change logs, and test cases now.
It is not surprising that maintainers and you lower the priority of this problem, since it does not affect any European language family, which puts space next to punctuation or parentheses.
I clearly doubt this. @wooorm I apologize again at this time for my anger and for being too militant in my remarks. My humble suggestions and comments on them:
I know. It is the background of this problem.
I have looked for ones and their frequency. Escaping them does not modify the rendered content itself, but I have been disgusted of having to modify the content by adding extra space or to depend on the inline raw JSX tag (
I will look into it later. (I do not expect it either) |
Checking the general Unicode categories Pc, Pd, Pe, Pf, Pi, Po and Ps, U+3001 Ideographic Comma and U+3002 Ideographic Full Stop are of course included in what Commonmark considers punctuation marks, which are all treated alike. For its definitions of flanking, CM could start to handle Open/Start Possibly affected Examples are, for instance: 363, 367+368, 371+372, 376 and 392–394. |
I checked the raised test cases. 367 is most affected in them. However, there are some ones not raised but more important. I am not convinced in the test case 378 (
Does it not mean that FYI, as of https://hypestat.com/info/github.com, one in six visitors in GitHub live in China or Japan. This percentage would not be able to be ignored or underestimated. |
The “Permitted content: Phrasing content” bit allow it for both.
I don’t think anybody is underestimating that. Practically, this is also open source, which implies that somebody has to do the work for free here, probably because they think it’s fun or important to do. And then folks working on markdown parsers need to do it too. To illustrate, GitHub hasn’t really done anything in the last 3 years (just security vulnerabilities / new fancy footnote footnotes feature). |
Getting emphasis right in markdown (especially nested emphasis) is very difficult. Changing the existing rules without messing up cases that currently work is highly nontrivial. For what it's worth, my rationalized syntax djot has simpler rules for emphasis, gives you what you want in the above Japanese example, and allows you to use braces to clarify nesting in cases where it's unclear, e.g. |
This is technically possible but not practical or necessary. It is much easier and faster to type "「" & "」" from the keyboard directly, and you cannot copy these brackets in
Almost all description on Markdown for newbies including the following say that
I do not know of SaaSes in Japan that customize the style of The current behavior of CommonMark forces newbies in China or Japan to try to decipher its spec. It is for developers of Markdown parsers, not for users except for experts. CommonMark has now grown to the point where it can manipulate the largest Markdown implementations (remark, markdonw-it, goldmark (used by Hugo), commonmarker (possibly used by GitHub), and so on) from behind the scenes. We may well lobby to revise its specification. (unenforceable of course though!) It would not be difficult to create a new specification of Markdown, but is difficult to give sufficient power to it. These are why I had tried to stop the left- and right-flanking, but I have found a convincing plan to recently. We have only to change by my plan:
We do not have to change the other. I hope most Chinese and Japanese can be convinced by it. Also, you can continue to nest
I am a little relieved to hear that. I apologize for the misunderstanding.
It would affect too many documents if the left- & right-flanking rule were abolished. However, the new plan will not affect on most existing documents except for ones that abuse the details of the spec. Do you mean that they are also included in "all existing" ones? I suggest new terms "punctuation run preceded by space" & "puncuation run followed by space".
(2a) and (2b) is going to be changed like the following:
This change treats punctuation characters that are not adjacent to space as normal letters. To see if the " **これは太字になりません。**ご注意ください。
カッコに注意**(太字にならない)**文が続く場合に要警戒。
**[リンク](https://example.com)**も注意。(画像も同様)
先頭の**`コード`も注意。**
**末尾の`コード`**も注意。 Also, we can parse even the following English as intended: You should write “John**'s**” instead. We do not concatenate too many punctuation characters, so we do not have to search more than ten and some (e.g. 16) punctuation characters for space from the previous or next of the target delimiter run. To check if the delimiter run is "the last characters in punctuation run preceded by space" (without using cache): flowchart TD
Next{"Is the<br>next character<br>an Unicode punctuation<br>chracter?"}
Next--> |YES| F["<code>return false</code>"]
Next--> |NO| Init["<code>current =</code><br>(previous character)<br><code>n =</code><br>(Length of delimiter run)"]
Init--> Exceed{"<code>n >= 16</code>?"}
Exceed--> |YES| F
Exceed --> |NO| Previous{"What type is <code>current</code>?"}
Previous --> |Not punctuation or space| F
Previous --> |Space| T["<code>return true</code>"]
Previous --> |Unicode punctuation| Iter["<code>n++<br>current =</code><br>(previous character)"]
Iter --> Exceed
In the current spec, to non-advanced users especially in China or Japan, " |
0.31 changes the wording slightly, but as far as I can tell this does not change flanking behavior at all.
|
The change made the situation even worse.
The few improvements are only that it is easier to explain the condition to beginners (we can now use the single word “symbols”) and more consistent with ASCII punctuation characters. |
This particular change was not intended to address this issue; it was just intended to make things more consistent. @tats-u I am sorry, I have not yet had time to give your proposal proper consideration. |
I guess it, but as a result it did cause a breaking change and break some documents (much less than ones affected by 0.14 though), which is a kind of regressions you have mostly feared and cared about. For the first place, we cannot easily access to convincing and practical examples that describe how legitimate controversial parts of specifications and changes are; we can easily find only ones that are designed only for testing and do not have meaning (e.g. What is needed is like: Price: **€**10 per month (note: you cannot pay in US$!)
FYI you do not have evaluate how optimize the algorithm in the above flowchart; it is too naive and can be optimized. All I want you to do first is to evaluate how acceptable breaking changes brought by my revision are. It might be better for me to make a PoC to make it easy to do it. |
To be honest, I didn't anticipate these breaking changes, and I would have thought twice about the change if I had. Having a parser to play with that implements your idea would make it easier to see what its consequences would be. (Ideally, a minimally altered cmark or commonmark.js.) It's also important to have a plan that can be implemented without significantly degrading the parser's performance. But my guess is that if it's just a check that has to be run once for each delimiter + punctuation run, it should be okay. |
I found Yijing symbols other than Yijing Hexagram Symbols. |
Also, if the next character of VS-16 (U+FE0F) is required to be recognized as an emoji there.
|
Want to create a PR for this?
I'd love to avoid making the rules too complex. This suggestion goes in that direction. |
Sure, reverting is sufficient.
It is very unlikely that we have to look more than two code points from e.g. 1️⃣: (U+0031 (ASCII digit 1) U+FE0F (VS-16) U+20E3 (keycap ⃣ )) |
I found a spec bug caused by the combination of a keycap emoji and
https://unicode.org/Public/emoji/latest/emoji-test.txt Of course we can hotfix this by adding |
@tats-u because this doesn't have to do with CJK specifically, perhaps you should raise it in a separate issue. |
It’s #646 |
Both of the keycap of
I see. We might be able to fix both at the same time. |
Looks like the keycap is more difficult due to |
I have just published markdown-it plugin: https://www.npmjs.com/package/markdown-it-cj-friendly |
UAX 11 East Asian Width might be better instead of managing own ranges by ourselves.
|
I like that suggestion. We could create a script that produces code directly from EastAsianWidth.txt. |
import re
import sys
import os
import urllib.request
# Download Unicode data files if not present
unicode_files = [
'EastAsianWidth.txt',
'emoji-data.txt',
'Scripts.txt'
]
unicode_version = '16.0.0'
base_url = f'https://www.unicode.org/Public/{unicode_version}/ucd/'
for file in unicode_files:
if not os.path.exists(file):
url = base_url + file
print(f"Downloading {file}...")
urllib.request.urlretrieve(url, file)
else:
print(f"{file} already exists. Skipping download.")
# cjk_codepoints is an array 0..F0000
# each element is a Boolean; True if it's cjk
size = 0xF0000
cjk_codepoints = [False] * size
# first we iterate through EastAsianWidth.txt, setting code points to
# true if they have width H, F, or W:
with open('EastAsianWidth.txt', 'r') as file:
for line in file:
match = re.match(r'^([0-9A-F]+)(?:..([0-9A-F]+))?\s*;\s*[HFW]', line)
if match:
x = int(match.group(1), 16)
y = (match.group(2) and int(match.group(2), 16)) or x
for z in range(x,y):
cjk_codepoints[z] = True
# second, we set Hangul code points to True + U+20A9
with open('Scripts.txt', 'r') as file:
for line in file:
match = re.match(r'^([0-9A-F]+)(?:..([0-9A-F]+))?\s*;\s*Hangul', line)
if match:
x = int(match.group(1), 16)
y = (match.group(2) and int(match.group(2), 16)) or x
for z in range(x,y):
cjk_codepoints[z] = True
cjk_codepoints[0x20A9] = True
# third, we iterate through emoji-data.txt, setting code points to
# false if they are in these ranges.
with open('emoji-data.txt', 'r') as file:
for line in file:
match = re.match(r'^([0-9A-F]+)(?:..([0-9A-F]+))?\s*;', line)
if match:
x = int(match.group(1), 16)
y = (match.group(2) and int(match.group(2), 16)) or x
for z in range(x,y):
cjk_codepoints[z] = False
# finally, we iterate through our array to create a sequence of ranges
range_open = False
for z in range(0,size):
if cjk_codepoints[z]:
if not range_open:
sys.stdout.write("%x.." % z)
range_open = True
else:
if range_open:
sys.stdout.write("%x\n" % (z - 1))
range_open = False yields: CJK ranges in Hex
CJK ranges in decimal
How does that compare with our current ranges? |
- for z in range(x,y):
+ for z in range(x,y + 1):
We have excluded Korean but may have to add it to the range, especially for right flanking. https://talk.commonmark.org/t/emphasis-and-east-asian-text/2491/6 *스크립트(script)*라고 The following is generated by translated from Japanese come up by me: 패키지를 발행하려면 **`npm publish`**를 실행하십시오. |
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
Sourceimport re
import sys
import os
import urllib.request
from typing import Literal
def download(base_url: str, file: str) -> None:
if not os.path.exists(file):
url = base_url + file
print(f"Downloading {file}...")
urllib.request.urlretrieve(url, file)
else:
print(f"{file} already exists. Skipping download.")
# Download Unicode data files if not present
uax_files = [
'EastAsianWidth.txt',
'Scripts.txt',
'Blocks.txt'
]
unicode_version_upto_minor = '16.0'
unicode_full_version = 'latest' if unicode_version_upto_minor == 'latest' else f'{unicode_version_upto_minor}.0'
uax_base_url = f'https://www.unicode.org/Public/{unicode_full_version}/ucd/'
for file in uax_files:
download(uax_base_url, file)
emoji_base_url = f"https://www.unicode.org/Public/emoji/{unicode_version_upto_minor}/"
download(f"{uax_base_url}emoji/", 'emoji-data.txt')
emoji_files = [
'emoji-test.txt',
'emoji-sequences.txt'
]
for file in emoji_files:
download(emoji_base_url, file)
# cjk_codepoints is an array 0..F0000
# each element is a Boolean; True if it's cjk
size = 0x110000
cjk_codepoints: list[bool | None] = [None] * size
# first we iterate through EastAsianWidth.txt, setting code points to
# true if they have width H, F, or W:
with open('EastAsianWidth.txt', 'r') as file:
for line in file:
match = re.match(r'^([0-9A-F]+)(?:\.\.([0-9A-F]+))?\s*;\s*([HFWAN]a?)\s+', line)
if match:
first = int(match.group(1), 16)
last = (match.group(2) and int(match.group(2), 16)) or first
isEaw = match.group(3) in {'H', 'F', 'W'}
for z in range(first,last + 1):
cjk_codepoints[z] = isEaw
# second, we set Hangul code points to True + U+20A9
with open('Scripts.txt', 'r') as file:
for line in file:
match = re.match(r'^([0-9A-F]+)(?:\.\.([0-9A-F]+))?\s*;\s*Hangul', line)
if match:
first = int(match.group(1), 16)
last = (match.group(2) and int(match.group(2), 16)) or first
for z in range(first,last + 1):
cjk_codepoints[z] = True
# cjk_codepoints[0x20A9] = True
unqualified_single_codepoint_emojis: set[int] = set()
with open("emoji-test.txt", "r") as file:
for line in file:
match = re.match(r'^([0-9A-F]+)\s+;\s+unqualified\s+', line)
if match:
unqualified_single_codepoint_emojis.add(int(match.group(1), 16))
# third, we iterate through emoji-data.txt, setting code points to
# false if they are in these ranges.
with open('emoji-data.txt', 'r') as file:
for line in file:
match = re.match(r'^([0-9A-F]+)(?:\.\.([0-9A-F]+))?\s*;', line)
if match:
first = int(match.group(1), 16)
last = (match.group(2) and int(match.group(2), 16)) or first
for z in range(first,last + 1):
if z not in unqualified_single_codepoint_emojis:
cjk_codepoints[z] = False
def print_range_start(start: int) -> None:
print(f"- {start:04X}", end='')
def print_range_end(start: int, end: int) -> None:
start_char = chr(start)
if start == end:
print(f" ({start_char})")
else:
print(f"..{end:04X} ({start_char}..{chr(end)})")
# IVS
for z in range(0xE0100, 0xE01EF + 1):
cjk_codepoints[z] = True
print("## Ranges\n")
# finally, we iterate through our array to create a sequence of ranges
range_start: int | None = None
paused_range_end: int | None = None
for z in range(0,size):
match cjk_codepoints[z]:
case True:
if range_start is None:
print_range_start(z)
range_start = z
elif paused_range_end is not None:
paused_range_end = None
case False:
if range_start is not None:
if paused_range_end is not None:
print_range_end(range_start, paused_range_end)
paused_range_end = None
else:
print_range_end(range_start, z - 1)
range_start = None
case None:
if range_start is not None and paused_range_end is None:
paused_range_end = z - 1
def get_coverage_txt(first: int, last: int) -> Literal["All"] | Literal["Some"] | Literal["None"] | Literal["Undefined"]:
has: bool | None = None
for z in range(first, last + 1):
result = cjk_codepoints[z]
if result is not None:
if has is None:
has = result
elif has != result:
return "Some"
match has:
case True:
return "All"
case False:
return "None"
case None:
return "Undefined"
def get_coverage_statistics(first: int, last: int) -> tuple[int, int]:
has = 0
total = 0
for z in range(first, last + 1):
result = cjk_codepoints[z]
if result is not None:
if result:
has += 1
total += 1
return has, total
coverage_dict: dict[Literal["All"] | Literal["Some"] | Literal["None"] | Literal["Undefined"], list[tuple[int, int, str]]] = {
"All": [],
"Some": [],
"None": [],
"Undefined": []
}
with open("Blocks.txt", "r") as file:
for line in file:
match = re.match(r'^([0-9A-F]+)\.\.([0-9A-F]+); ([^ ].*)$', line)
if match:
first = int(match.group(1), 16)
last = int(match.group(2), 16)
name = match.group(3)
coverage = get_coverage_txt(first, last)
coverage_dict[coverage].append((first, last, name))
block_url_prefix = "https://www.compart.com/en/unicode/block/U+"
for coverage, ranges in coverage_dict.items():
if len(ranges) == 0:
continue
print(f"\n## {coverage}\n")
if coverage == "Some":
for first, last, name in ranges:
has, total = get_coverage_statistics(first, last)
percent = f"{round(has / total * 100, 2)}%"
print(f"- [{name}]({block_url_prefix}{first:04X}) ({first:04X}..{last:04X}) ({has}/{total}) ({percent})")
else:
if coverage == "None":
print("<details>\n")
for first, last, name in ranges:
print(f"- [{name}]({block_url_prefix}{first:04X}) ({first:04X}..{last:04X})")
if coverage == "None":
print("\n</details>") Ranges
All
Some
None
|
I'm almost completely satisfied with the ranges. Do you all agree that Korean should be also taken into account? If so I have to rename the suffix of my package(s) Korean uses spaces to split not words but phrases. |
Can you explain what "Ranges", "All", "Some", "None" mean in the above? I'm not sure about Korean because I don't know a thing about it. I have to defer to others here about what would be best. The question is whether it is generally whole phrases that are emphasized. I imagine not but I don't know what counts as a phrase. |
Ranges → Those of CJK A Korean noun and verb are frequently followed by one or a few postpositional words (e.g. postpositional particle) in a phrase. **이 [링크](https://example.kr/)**만을 강조하고 싶다.
I want to emphasize only **this [link](https://example.kr/)**. For English or other European language speakers, a phrase can be something like "in foo", "Bar is", or "Please do baz". Although I know little about Korean, I don't think that it's as general as in English or that Korean needs nested emphasis. As you can imagine you will not suffer from this spec bug in Korean unlike Chinese or Japanese If you want to emphasize whole sentences. This is the main reason why I have excluded Korean from the range. |
So, am I right that "Ranges" is all that a spec or implementation needs to be concerned with here? Is there some wording relating these ranges to the Unicode blocks that you would suggest, or is it best just to give the ranges explicitly? |
It contains some unassigned code points (e.g. U+2FFFF), but implementations can choose to include or exclude them.
No. Blocks can be used just for verification of the correctness of the range. It is not directly related to a new spec draft.
That suggested by me is just a computation result and a snapshot as of Unicode 16. Implementations can directly convert it to a conditional expression in an |
I found out a Korean's tweet of dissatisfaction: https://x.com/pyrasis/status/1567317306153508865
↓
I will choose to take Korean into account at least in my package(s). My suggestion (full):
|
New package with Korean support has just been arrived: https://www.npmjs.com/package/markdown-it-cjk-friendly The previous package is now deprecated. |
remark plugin & micromark extension have just been released: |
Hi, I encountered some strange behavior when using CJK full-width punctuation and trying to add emphasis.
Original issue here
Example punctuation that causes this issue:
。!?、
To my mind, all of these should work as emphasis, but some do and some don't:
I'm not sure if this is the spec as intended, but in Japanese, as a general rule there are no spaces in sentences, which leads to the following kind of problem when parsing emphasis.
In English, this is emphasized as expected:
This is **what I wanted to do.** So I am going to do it.
But the same sentence emphasized in the same way in Japanese fails:
これは**私のやりたかったこと。**だからするの。
The text was updated successfully, but these errors were encountered: