Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comments and aliases problem #78

Open
jsbien opened this issue Feb 28, 2016 · 5 comments
Open

Comments and aliases problem #78

jsbien opened this issue Feb 28, 2016 · 5 comments

Comments

@jsbien
Copy link

jsbien commented Feb 28, 2016

The page U+018D states

The character is also known as reversed Polish-hook o.

However there is no such a formal alias.

On the other hand Fileformat.Info contains the following comments:

reversed Polish-hook o
archaic phonetic for labialized alveolar fricative
recommended spellings U+007A U+02B7 or U+007A U+032B

Looks like the first comment was converted into an alias while the other ones have been skipped.

I don't think this is correct :-( How the comments are imported and how are they processed? Shouldn't they be displayed just as "Unicode comments"?

@Boldewyn
Copy link
Contributor

It's a verbatim copy from this file in the Unicode standard:

http://www.unicode.org/Public/UCD/latest/ucd/NamesList.txt

I don't know, where Unicode got the alias name of this one from. When I google it, the most prominent results are your mails to the Unicode mailing list ;-)

In said file, lines prefixed with = are alias names as of the standard (like the "reversed Polish-hook o"), while others (like the * prefixed) are other informative data, comments, ... (I opened #49 in order to add this missing info, too.)

I found many of those quite useful to get a generic idea of the character (see, e.g., the low ASCII control characters, or the guillemets), so I embedded the aliases in the character description.

This works quite well for almost all characters. This is the first one, where the alias seems off.

The best way to fix this would be to file an upstream issue with Unicode. A changed NamesList.txt would automatically lead to a fix here. Had you tried that already, by chance, after asking on the mailing list last year?

If that doesn't lead to a result, we could add additional info between the character description and the Wikipedia entry to describe, why the alias is problematic. If you write one or two sentences I'd add them as file codepoints.net/data/U+018D.en.md.

As a last resort I could hotfix the database to remove the alias, but I'd rather stick to the standard as close as possible. (Also the alias might sneak in again in a later import.)

@jsbien
Copy link
Author

jsbien commented Feb 28, 2016

Let's make haste slowly.

I've checked NameAliases.txt only, I was not aware (or forgot) that aliases are defined also in NamesList.txt. I will file an issue with Unicode, perhaps after discussing the problem on the Unicode list (it was on my TODO list already).

I'm glad you plan to handle also other informations from the file.

Some time in the future I would like to include the information from
https://bitbucket.org/jsbien/unicode4polish/wiki/codes/U+018D_LATIN_SMALL_LETTER_TURNED_DELTA
However I don't have a clear idea how to do it in an elegant and extensible way.

@jsbien
Copy link
Author

jsbien commented Mar 13, 2016

Thanks to the thread about NamesList.txt on the Unicode list I've came to the conclusion that we have to distinguish formal aliases from NameAliases.txt and informal aliases from NamesList.txt. So instead

The character is also known as reversed Polish-hook o.

we should have something like

The Unicode version ??? mentions an informal alias: reversed Polish-hook o.

I've added the Unicode version because, as far as I understand, the annotations are not stable and may vanish.

@Boldewyn
Copy link
Contributor

Since I parse the data anew with every next version, the informal alias would vanish then, too, here. So I guess, we could leave that out. Apart from that I very much like the idea to re-word it like this.

@jsbien
Copy link
Author

jsbien commented Mar 25, 2016

What about

The character is also called a reversed Polish-hook

versus e.g.

The character is also known as SYRIAC SUBLINEAR COLON SKEWED LEFT

The second example is an official alias. As for the first one, nobody knows the character as a reversed Polish-hook o, especially as there is no such thing as a Polish-hook, the diacritic mark even in English is called ogonek. The name is just an individual usage of an author of a perhaps obsolete book on phonology.

Moreover, I don't like vanishing information. I would appreciate very much the note in/since which versions the comment appeared. The proposed wording allows for it, e.g.

The character is also called a reversed Polish-hook (in Unicode 4.1.0 and later versions)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants