You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
мың and миллиард are NUM num everywhere, while миллион in some cases is NUM num, and in others NOUN n.
млрд. and трлн. are NOUN abbr everywhere, while млн. is some cases tagged as NUM num, in others as NOUN abbr.
(a)
4 2 2 NUM num NumType=Card 5 compound _ _
5 миллиард миллиард NUM num NumType=Card 6 compound _ _
6 300 300 NUM num NumType=Card 7 compound _ _
7 миллион миллион NUM num NumType=Card 8 nummod _ _
8 теңгеден теңге NOUN n Case=Abl 10 nmod _ _
9 астам астам ADJ adj _ 10 amod _ _
10 қаржы қаржы NOUN n Case=Nom 11 obj _ _
vs (b)
3 4,3 4,3 NUM num NumType=Card 4 nummod _ _
4 мыңнан мың NUM num Case=Abl|NumType=Card,Ord 6 nmod _ _
5 астам астам ADJ adj _ 6 amod _ _
6 шақырымды шақырым NOUN n Case=Acc 7 obj _ _
Hereby I suggest:
to tag all of мың, миллион, миллиард, триллион, млн., млрд. and трлн. as NUM num. For the latter three, apertium-kaz & co can be modified to output <abbr> as a secondary tag, i.e. млн\.? --> <num><abbr>. Since there are abbreviated nouns, abbreviated numerals etc, for known abbreviations I think it makes sense to make <abbr> a secondary tag, especially in context of UD annotating:
Strings that consists entirely of alphanumeric characters are not symbols but they may be proper nouns: 130XE, DC10; others may be tagged PROPN (rather than SYM) even if they contain special characters: DC-10. Similarly, abbreviations for single words are not symbols but are assigned the part of speech of the full form. For example, Mr. (mister), kg (kilogram), km (kilometer), Dr (Doctor) should be tagged nouns. Acronyms for proper names such as UN and NATO should be tagged as proper nouns.
[unquote]
but also generally speaking knowing the POS of the unabbreviated form is considered helpful for applications.
to handle all numerical constructions like the above as compounds (i.e. as done in 3a). In other words, a flat chain of compounds, with the rightmost element being the head receiving nummod or nmod whatever.
The text was updated successfully, but these errors were encountered:
General context: #17
Actually several related issues:
мың
andмиллиард
areNUM num
everywhere, whileмиллион
in some cases isNUM num
, and in othersNOUN n
.млрд.
andтрлн.
areNOUN abbr
everywhere, whileмлн.
is some cases tagged asNUM num
, in others asNOUN abbr
.(a)
vs (b)
Hereby I suggest:
мың
,миллион
,миллиард
,триллион
,млн.
,млрд.
andтрлн.
asNUM num
. For the latter three, apertium-kaz & co can be modified to output<abbr>
as a secondary tag, i.e.млн\.?
--><num><abbr>
. Since there are abbreviated nouns, abbreviated numerals etc, for known abbreviations I think it makes sense to make<abbr>
a secondary tag, especially in context of UD annotating:[quote https://universaldependencies.org/u/pos/all.html#sym-symbol]
Strings that consists entirely of alphanumeric characters are not symbols but they may be proper nouns: 130XE, DC10; others may be tagged PROPN (rather than SYM) even if they contain special characters: DC-10. Similarly, abbreviations for single words are not symbols but are assigned the part of speech of the full form. For example, Mr. (mister), kg (kilogram), km (kilometer), Dr (Doctor) should be tagged nouns. Acronyms for proper names such as UN and NATO should be tagged as proper nouns.
[unquote]
but also generally speaking knowing the POS of the unabbreviated form is considered helpful for applications.
UPDATE: note that in UD there is the Abbr feature: https://universaldependencies.org/u/feat/Abbr.html
compounds
(i.e. as done in 3a). In other words, a flat chain ofcompound
s, with the rightmost element being the head receivingnummod
ornmod
whatever.The text was updated successfully, but these errors were encountered: