In this documentation, we first explain the general principles of MSP, and then show the feature set and how to convert a regular UD treebank to MSP, with the example of English.
Words have long been an essential concept in the definition of treebanks in Universal Dependencies (UD), since the first stage in their construction is delimiting words in the language at hand. This is done due to the common view in theoretical linguistics of words as the dividing line between syntax, the grammatical module of word combination, and morphology, that is word construction.
We suggest defining the content-function boundary to differentiate 'morphological' from 'syntactic' elements. In our morpho-syntactic data structure, content words are represented as separate nodes on a dependency graph, even if they share a whitespace-separated word, and both function words and morphemes contribute morphology-style features to characterize the nodes.
Delimiting syntactically relevant words gets exponentially more complicated the less isolating languages are. Thus, this operation, which is as simple as breaking the text on white spaces for English, is borderline impossible for polysynthetic languages, in which a single word may be composed of several lexemes that have predicate-argument relations. This reflects the fact that despite the presumed role of words in contemporary linguistics, there is no consensus on a coherent cross-lingual definition of words. We will thus avoid (most) theoretical debates on word boundaries, and solve much of the word segmentation inconsistencies that occur in UD, either across languages, e.g., Japanese is treated as isolating and Korean as agglutinative, even though they are very similar typologically, or across treebanks of the same language, e.g., the different treebanks for Hebrew segment and attribute different surface forms for clitics.
The central divide in an MS graph is between content words (or morphemes) and function words (or morphemes). Content words form the nodes, while the information from function words is represented as features modifying the content nodes.
Morphosyntactic Annotation will bring the trees of very different languages much closer together and thus enable new typological studies. In isolating languages, the data will explicitly surface MS features that are expressed periphrastically. Morpho-syntactic data will be more inclusive towards languages that are currently treated unnaturally, most prominently noun-incorporating languages. Morpho-syntactic models will be able to parse sentences in more languages and enable better cross-lingual studies.
We will add the morphosyntactic features in a new 11th column called `MS-FEATS'. The original CoNLL-U file can thus be recovered by simply dropping this column. On the other hand, the morphosyntactic tree can be built by dropping all the lemmas that do not have MS-FEATS defined for them. On the other hand, in polysynthetic languages, the addition of MS-features to content words will expose the argument structure even if it is encapsulated in a single word.
The format for morpho-syntactic parsing data is a simple extension of UD's
CoNLL-U format. It includes an addition
of a single column with morpho-syntactic features (named: MS-FEATS) for every UD node
that contains a content word. UD nodes that contain function words should have empty
(i.e. _
) MS features. The new column should be added last, after the MISC column.
The other CoNLL-U's columns: ID, FORM, LEMMA, UPOS, XPOS, FEATS, HEAD, DEPREL, DEPS, and MISC, are defined exactly the same as in UD.
As the key characteristics of morpho-syntactic dependency trees, morpho-syntactic
features (MS features) are modelled after the morphological features in UD and may be
viewed as a generalization of them. Like in UD,
the features are an alphabetically ordered set of name and value pairs separated by pipes, of the
structure Name1=Value1|Name2=Value2
.1 Most feature names and values are equivalent to
those in UD, for example Gender=Masc
, Voice=Pass
, etc.
However, MS features also differ from morphological features in a couple important characteristics:
- The features are only defined for content nodes (see below)._
- Function words should not have MS features. All the information they convey should be expressed as features on the relevant content node.
- Note: since the file format is a modified version of UD's CoNLL-U, function words
may appear in the final output, their MS-feats column should be
_
. This is in contrast with content words that happen to have no MS-feats that should contain an orphan pipe|
.
- The features are defined not only by morphemes but by any grammatical function
marker, be it a morpheme or a word. So the content node go in will go should bear
the feature
Tense=Fut
.- All applicable features should be marked on the respective content nodes, even if
expressed by non-concatenative means (as long as they are grammatical). E.g., the node
go in did you go? should be marked with
Mood=Ind;Int
even though the interrogative mood is expressed mostly by word order.
- All applicable features should be marked on the respective content nodes, even if
expressed by non-concatenative means (as long as they are grammatical). E.g., the node
go in did you go? should be marked with
- Features should be applied only to their relevant node. In other words, no agreement
features are needed, and in a phrase like he goes only he should bear
Number=Sing|Person=3
, and goes should have onlyTense=Pres
(and other features if relevant). - The feature structure is not flat. In other words, features are not necessarily single
strings. They can contain:
- a list of values separated by a semicolon, for example
Aspect=Perf;Prog
on the verb of the English clause I have been walking - a negation of a value, for example
Mood=not(Pot)
on the Turkish verb yürüyemez (“he can’t walk”) where the negation refers to the ability2 - a conjunction of values. This mechanism is to be used only in cases of explicit
conjunction of grammatical constructions, for example
Case=and(Cnd,Temp)
is the manifestation of the English phrase if and when when connecting two clauses (see below for discussion on theCase
feature) - and a disjunction of values,
Tense=or(Fut,Pas)
- a list of values separated by a semicolon, for example
- If a feature includes multiple values in any kind of order or structure, they should be ordered alphabetically in accordance with the general UD guidelines.
The mapping from morpho-syntactic constructions to features does not have to be
one-to-one. In cases where several constructions have the exact some meaning (e.g.,
they differ in geographic distribution, register or personal preferences), it is
perfectly suitable to assign the same feature combination to both of them. For example,
in Spanish, both comiera and comiese will be assigned Aspect=Imp|Mood=Sub|Tense=Past|VerbForm=Fin
(remember that the agreement features should appear only on the relevant argument).
The categories of words to be "consumed" into MS features are usually: auxiliaries,
determiners, adpositions, conjunctions and subordinators, and some particles. These
categories may not neatly correspond to UD POS tags. Some clearly do, like auxiliaries
(POS tag AUX
), while others, like DET
may include also contentful word, like all
and every. Some POS tags like ADV
mix many contentful words (nicely, rapidly,
often, etc.) with a few that serve as conjunctions (when, then, etc.), and in
rare cases the same word may be considered functional or contentful depending on the
context.3
quick link: inventory of relation features
Since the MS features are a generalization of UD's morphological features, their types
and possible values are also highly similar with that of UD's features.
Therefore, for most features, the list in UD is sufficient in characterizing content
nodes in MS trees as well.
The most prominent exceptions to this is the expansion of theCase
feature.
Originally, the Case
feature characterized the relation between a predicate and
its argument, almost always a nominal, but for MS trees its role is expanded twice. First,
In line with the principle of independence from word boundaries, in MS trees this feature
corresponds to traditional case morphemes as well as adpositions (these usually have
case
as DEPREL in UD trees) and coverbs when such exist. The inclusion of adpositions
in determining the Case
feature entails the expansions of cases possible in almost any
language. Nominals in German, for example, now have an elative case (indicating motion
from the inside of the argument) expressed by the combination of the synthetic dative
case and the periphrastic aus preposition.
The second expansion of the Case
feature is that in MS trees this feature is also used
to characterize predicate-predicate relations, hence it is applicable also to verbal
nodes and it "consumes" also conjunctions and subordinators. So fell in I cried
until I fell asleep and today in It is true until today will both get a Case=Ttr
because they are both marked by the function word until.
In general, the same function word/morpheme combination should be mapped to the same
Case
value, even if it serves multiple functions. For example, the Swahili preposition
na should be mapped only to Case=Conj
even when it serves a function of introducing
the agent of a passive verb.
"inventory.md"
details a set of universal values for the Case
feature. These feature does not cover
all possible relations, and in some cases when there are adpositions or conjunctions that
do not correspond to any of the features, the value of the respective feature should be
the canonical citation form of the function word transliterated into latin letters in
quotation marks.
A mapping from adpositions and conjunctions to the features in "inventory.md" should be created as part of the annotation process. Note that the mapping does not have to be one-to-one.
Content nodes, to which morpho-syntactic features are to be defined, are all words or morphemes from open classes (like nouns, verbs and adjectives) that do not convey a grammatical modification of another word.4 These content words should form a morpho-syntactic tree, and this is automatically true in most cases when converting UD data due to the fact that UD designates content words as heads, so they directly relate to one another (see exceptions below).
Note that copulas are not content words. In sentences with copulas refer to the nominal as the predicate and tag it with the features expressed by the copula.
For example, in the sentence the quick brown fox jumps over the lazy dog there are 6 content words (quick, brown, fox, jump, lazy, dog) and 3 function words (the, over, the).
In compounds or headless expressions, i.e., cases where one of the fixed
, flat
or
goeswith
DEPRELs are used, all words are judged together to either be of content or of
function. Usually such cases will be contentful, but sometimes a fixed expression can be
a multi-word adposition, for example as well as and because of.
In addition to words from open classes, content nodes also include all arguments and predicates in the sentence. The implications of this are twofold:
- Pronouns should always be represented as nodes with MS features, regardless of your theoretical position on whether pronouns are contentful or a mere bundle of features.
- arguments that do not appear explicitly in a sentence but are expressed implicitly
(i.e., by agreement of their predicate) should also be represented by their own node.
However, this node lacks FORM or LEMMA fields and is therefore an abstract node.
Abstract nodes should appear after the node from which they inherit their features and
should have a special ID in the form of
X.1
,X.2
etc.
The most common use-case of abstract nodes is when pronouns are dropped. For example, in Basque, the UD nodes:
4 ziurtatu ziurtatu VERB _ Aspect=Perf|VerbForm=Part 0 root _ _
5 zuten edun AUX _ Mood=Ind|Number[abs]=Sing|Number[erg]=Plur|Person[abs]=3|Person[erg]=3|Tense=Past|VerbForm=Fin 4 aux _ ReconstructedLemma=Yes
should be tagged as:
4 ziurtatu ziurtatu VERB _ Aspect=Perf|VerbForm=Part 0 root _ _ Aspect=Perf|Mood=Ind|Tense=Past|VerbForm=Fin
5 zuten edun AUX _ Mood=Ind|Number[abs]=Sing|Number[erg]=Plur|Person[abs]=3|Person[erg]=3|Tense=Past|VerbForm=Fin 4 aux _ ReconstructedLemma=Yes _
5.1 _ _ _ _ _ 4 nsubj _ _ Case=Erg|Number=Plur|Person=3
5.2 _ _ _ _ _ 4 obj _ _ Case=Abs|Number=Sing|Person=3
Note that node 5
now doesn't have MS-feats (the last column) and therefore it will be
dropped from the MS tree.
This example underlines that the abstract nodes may be viewed as a replacement for feature layering. The advantage of this mechanism is that equates the representation of agreement morphemes, clitics and full pronouns, and removes the need to decide which is which.
The same mechanism is used whenever an argument is missing from the clause as an independent word, but expressed in other means, i.e., not when an argument was dropped for pragmatic reasons or otherwise being not detectable from the surface forms. For example, the annotation of the Japanese sentence 宣言したのだ ("(he) proclaimed") should not contain an abstract node for the non-existent subject although one is understood.
Abstract nodes are also to be used when the argument is outside the clause.
Abstract nodes are also to be used in simple gaps, when there are function words referring to some missing argument. For example, a phrase like books to choose from, should be annotated as:
4 books book NOUN NN Number=Plur 2 obj _ _ Number=Plur
5 to to PART TO _ 6 mark _ _ _
6 choose choose VERB VB VerbForm=Inf 4 acl _ _ VerbForm=Inf
7 from from ADP IN _ 6 obl _ _ _
7.1 _ _ _ _ _ 6 obl _ _ Case=Abl
So node 7.1
is created to carry the feature of the function word from.
Note that this strategy is not suitable when the missing element has non-missing
arguments, for example in the phrase Jon ate bananas and mary apples. In these cases,
usually characterized by the orphan
DEPREL, addition of an abstract node will require
adjustment of the HEAD column and this is beyond the current scope of this campaign in
which we only add a MS-feats
column at the end of each line. Our suggestion is then to
not tag sentences with orphan
DEPREL, at least not for this shared task.
- create a new column
- decide what your content words and your function words are (by UPOS and incoming and outgoing deprels)
- go through your function words and classify them according to UPOS, relation, and maybe lemma
- for each of these categories, figure out the morphosyntactic feature and place it on the head content word
parataxis, reparandum and punct relations are left alone
fixed relation: combined to a temporary lemma to look for in the relevant map in 'eng_relations.py'.
Nominal lemmas: NOUN, PROPN, PRON, NUM, Verbal lemmas: VERB
keep existing features for all nominals. For verbs, they are not kept because the Tense may be wrong.
aux-children: all AUX and PART that are not "'s"
Mood: Default Ind, or Int if an aux-child comes before a subj-child and (any child is a "?" or annotator decides), or Cnd if annotator decides
Polarity: default Pos, Neg if "not" child
VerbForm: default Fin, except if "to" as aux-child
if there is only one "do"-child, copy the tense to the verb
if there is a "be"-child
- if there is one "be"-child of a verb:
- if the head verb is VerbForm Ger, or VerbForm Part and Tense Pres, set Aspect to Prog
- if the head verb is VerbForm Part and Tense Past, set Voice to Pass
- if there are two "be"-children, set Aspect to Prog, and if it's a verb, set Voice to Pass otherwise Voice Active if there are no auxiliaries left, copy the remaining TAM feats from the "higher" auxiliary if aux are not only be and not, and the VerbForm is not inf: the higher be, if there were two, is the first one that does not end in -ing copy Tense from higher be if exists if the head is not a verb, copy Mood and VerbForm from higher be if exists
if there is a "get"-child, set Voice to Pass, and if aux lemmas do not contain anything other than get and not, copy Tense and VerbForm from the "get"-child
if there is a "have"-child
- if the head is not a verb or the head has VerbForm Part and Tense Past: add Perf to Aspect
- if there are no auxiliaries left, copy the remaining feats from the have
if there is a "will"-child, set Tense to Fut
if there is a "would"-child, let the annotator decide:
- Mood Conditional (conditional)
- Tense Past and add Prosp to Aspect (future in the past)
Modality
- can --> Pot
- could --> Pot, plus annotator decision between adding Cnd to Mood, or setting Tense to Past
- may or might --> Prms
- shall or should --> Des
- must --> Nec
- not --> set the previous modality to neg(modality), and delete the Polarity concatenate all modalities
relevant are all children with case, mark, or cc, or where the lemma is in case_feat_map. Remove all children that are PART, unless they're "'s"
Then case features are looked up in the feature map.
for verbs,
- copy existing features if they have not been set
- set Voice to Act if not otherwise set
- if a finite Verb has no nsubj child, create an abstract nsubj. All UD slots will be empty except for deprel and head. Take the Number and Person and Gender from somewhere for nominals and adjectives and adverbs:
- assign features for the determiner, if any
- a: Definite Ind
- the: Definite Def
- another: Definite Ind
- no: Definite Ind and Polarity Neg
- this: Dem Prox
- that: Dem Dist
- for adjectives and averbs,
- if 'more' child, set Degree to Cmp
- if 'most' child, set Degree to Sup
Footnotes
-
The feature set in unordered in theory but in practice the features are ordered alphabetically by feature name, just to make the annotations consistent. ↩
-
This is in contrast with the verb yürümebilir (literally “he is able to not walk”, i.e., he may not walk), where the negation pertains to the verb itself and should be tagged as
Mood=Pot|Polarity=Neg
. ↩ -
Compare the word then in the sentence if you want, then I'll do it (functional) to the same word in I didn't know what to do, then I understood (then stands for "after some time" hence contentful). ↩
-
In most languages, content nodes are equivalent to words. However, in some noun incorporating languages open class nouns can appear as morphemes concatenated to another content node that is the verb. ↩