Skip to content
respiranto edited this page Sep 20, 2020 · 16 revisions

Discussion on (FreeDict) TEI

Notes:

  • Points 1-9 are directly taken from an ML thread, numbered A1-A9 there.
    • Points 10-X also stem from around that thread.

1) TEI Lex-0.

  • Questions

    • What about the TEI Lex-0 standard?
    • Should it be followed?
  • Examples

    • a) <gram type="gender"/> instead of <gen/>.
    • b) <usg> with @type (and possibly @norm)
  • Potential advantages

    • good, fixed list of usg types (see this comparison table)
      • The useful @types textType and attribute have no equivalents in the TEI Guidelines' suggested values.
        • textType examples: bibl., poet., admin., journalese
        • attribute examples: derog., euph.
    • Requirement to fully annotate with @xml:id and @xml:id
  • Further questions:

    • Should textType and attribute just be borrowed from TEI Lex-0?
    • Where to annotate with @xml:id and @xml:lang?
  • Answers

    • The FreeDict conversion style sheets do not support TEI Lex-0. (FreeDict TEI is in parts incompatible with TEI Lex-0)
    • "It all boils down to somebody reading the document, defining our specific requirements and potentially modification and implementing it." / @shumenda
    • The TEI Lex-0 guidelines may be used in addition wherever they do not contradict the FreeDict or TEI guidelines.
    • TEI Lex-0 is meant to encode retrodigitized dictionaries including presentational information, while FreeDict TEI is not concerned with such.
    • Consider to someday switch to another (related) standard: ISO LMF-4
  • See also: this thread on the mailing list.

2) Verb & Transitivity annotation.

  • Status quo

    • In a HowTo, it is suggested to use v,vt,vi,vti, i.e., merge all such information into a single token.
    • In an example, there is "vtr", which would also adhere to TEI Lex-0, in contrast to the former.
  • Questions: How to annotate transitivity information?

  • Answer: The use of subc is strongly recommended.

3) IPA Pronunciation.

  • Question: How can I enrich my dictionary with pronunciation, as annotated in <pron> tags?

  • Answer: Unless present, the standard build process, using make, adds phonetics information using the teiaddphonetics script (which internally usese speak[-ng]).

4) Normalization of usage annotations

  • Question: Should usage annotations (the content of <usg> tags) be normalized?

    • different languages (e.g. "[Sprw.]" ~ "[prov.]")
    • same language (e.g. "[coll.]" ~ "[slang]")
  • Notes:

    • Recommended by TEI Lex-0.
    • The usage of @norm in might render this less an issue.
  • Sub-questions

  • Answers

    • An ontology should be defined.
      • Questions:
        • Similar to / linked to shared/FreeDict_ontology.xml?
          • This seems to only allow linking equivalent annotations in different languages, however not "coll." and "slang" (if these should even be considered equivalent).
        • Where to find documentation on writing such an ontology?

5) Quantified (or similar) usage annotations

  • Examples

    • "mainly Am."
    • "bes. Süddt.", "especially Am."
  • Question

    • How to represent the determiner ("mainly", "bes.", ...)?
  • Notes

    • TEI Lex-0 suggests a separate attribute, but not which (there is a TODO in the doc).
      • None of the <usg> annotations really fit, maybe @subtype?
  • Answer

    • Likely the easiest: <usg type="hint">mainly Am.</usg>

6) Regional / dialect / language annotations.

  • classes of such annotation

    • a) dialect
      • Ex.: "[Br.]", "[Am.]", "[Ös.]", "[Sächs.]"
      • distinction from b) partially unclear (e.g., "Am.")
    • b) Region or country
      • Ex.: "[South Africa]", "[Hessen]", "[Berlin]", "[Wien]"
    • c) Ex.: "[French]", "[Lat.]"
  • Questions

    • How to annotate/distinguish the above classes?
  • Notes

    • TEI Lex-0: usg[@type="geographic"]: "marker which identifies the place or region where a lexical unit is mainly used"
      • Matches b), potentially partly a).
  • Answers

    • a), b): usg[@type="geo"]
    • c): usg[@type="lang"]
    • Alternatively: Craft new type and document in the header (usg types may be be freely chosen according to the TEI Guidelines.)
      • Also consider to adopt such a new type in the FreeDict guidelines.
      • Use plain text but name the tag and attribute name explicitly.
    • Consider to use a list of languages (e.g., this).
  • Notes example (ps and lists are both fine):

<notesStmt>
  <note type="status">small</note> <!-- mandatory for our DB -->
  <note xml:lang="de"> <!-- can be freely chosen -->
    <list><item>blah</list>
  </note>
</notesStmt>

7) Abbreviations.

  • Cases

    • a) Headwords, which are annotations.
      • rare
    • b) Annotated on headwords.
  • Question: How to represent in TEI?

  • Notes

    • THE TEI Guidelines contain an example with both <form type="abbrev"> and <form type="full">, in the same <entry>.
    • The TEI Guidelines also offer <abbr> and <expan>, possibly grouped in <choice>.
      • These seem to be rather intended for encoding inside of prose.
  • Answers

    • An entry should only contain a single form tag.
    • An entry/form may contain a nested form[@type="abbrev"] element.
    • In the case of a standalone abbreviation, the corresponding form element right below entry should be annotated with @type="abbrev".
      • potential issue: Shouldn't the topmost form elements have @type="lemma"?

8) entry/sense/gramGrp vs entry/gramGrp

  • Answer: Both are fine (also in parallel).
    • Consider to put gramGrp inside form, when also in sense.

9) Header

9.1) fileDesc/publicationStmt/license

  • Question: Currently <availability> is suggested and used exclusively (for licensing information). Why not <license>?

  • Answer: The style sheets do not permit <license>, the validation would hence fail.

    • Consider to change this in a future style sheet update.

9.2) Date of imported source

  • Q: Where to annotate a date special to a source the final TEI was imported from.

  • A: Annotate within sourceDesc.

    • Q: As plain text?

9.3) fileDesc/publicationStmt/pubPlace

  • HowTo: <ref>https://freedict.org/</ref>

  • (example) TEI: <ref target="http://freedict.org/">http://freedict.org/</ref>

  • A: The HowTo is right.

9.6) [imported dictionaries] fileDesc/editionStmt/edition (version)

  • Question: What to use when the TEI output is both influenced by a source's version and an importer's version?

  • Answers

    • Whatever works or seems logical.
    • Options: srcver.importerver | date | srcver | srcver.date

9.7) [imported dictionaries] fileDesc/titleStmt/editor

  • Q: Set author of importer as editor?
    • TEI Guidelines: "[...] acting as editor, compiler, translator, etc."
  • A: Permitted.

10) Q: Should xr/ref have a content, or can a @target suffice.

A: Content!

11) Grouping of homographs

  • Options:

    • superEntry/entry
    • entry/sense
    • entry/hom
    • entry/entry
      • illegal in (FreeDict) TEI, suggested in TEI Lex-0.
  • Q: Is superEntry ok?

    • A: No. "It doesn't seem necessary at all and is on its way out, in general." / @bansp
    • A: Not handled by stylsheets. Also, hom is ignored.
  • Q: [imported dictionaries] What if it is not clear from the source whether two homographs qualify as senses of the same word?

    • Note: The "Ding" dictionary contains many words repeatedly, usually with (close to) identical meaning.

12) Presentational information

  • Examples: "{v}" - the braces, ";", "~" - for references
  • A: Drop

13) "to" prefix for verbs

  • A: Drop.

14) Multiple genders

  • Ex.: "Avis {m,n}" (german)
  • A: Two <gen> in a single gramGrp.

15) collocates' case

  • How to encode "{+Gen.}", indicating that an object in the genitive case should follow?
  • Option: <colloc>[+ Gen.]</colloc> (where "Gen." might be changed to something else)
    • Derived from TEI Lex-0
    • [] is not very nice.
    • Likely use a non-language-specific case-abbreviation (i.e., "gen")

16) Enoding of plain text annotations on headwords (and translations)

  • Examples:

    • "bread (baked in an oven)"
    • "bread (wheat product)"
  • Options:

    • <note>
    • <usg> -- @type="hint"?
      • Usually used for more specific usages, e.g. "Am.", "med.".

17) Collocates' encoding

  • Cases:

    • a) case information: "wegen {+Gen.}"
      • see 15) above
    • b) auxiliary words representing an object
      • b.1) suffixing: "eat sth."
      • b.2) prefixing: "etw. essen"
      • b.3) alternatives: "notify sth./sb."
      • b.4) several: "give sth. to sb."
        • potentially both prefixing and suffixing
    • c) specific word(s)
      • c.1) suffixing: "dismounting (of a machine)"
      • c.2) prefixing
      • c.3) combinations
    • d) combinations of a), b), c)
  • Available tags

    • <colloc> (occurs in <gramGrp>
    • <usg type="colloc">
    • attribute @subtype="left"?
  • Answers

    • For a), see Y) above.
  • Proposed answers:

    • b): <colloc>. This is grammar information.
    • c): <usg type="colloc">. This is not grammar information.
    • location: @subtype="left" resp. "right".
    • order: keep both <colloc> and <usg type="colloc"> in the original order.
      • Keeping the order of the union of both is impossible with the given suggestion, but things like "(of a machine)" are supposed to be optional anyways.
    • b.3) (alternatives)
      • i) group in <choice> or similar.
      • ii-iv) see below
      • iii) conflicts with several subsequent <colloc>s
<form><!-- ii) -->
  <orth>notify</orth>
  <gramGrp><colloc>sth.</colloc></gramGrp>
  <form type="alternate">
     <orth>notify</orth>
     <gramGrp><colloc>sb.</colloc></gramGrp>
  </form>
</form>
<!-- OR iii) -->
<form>
  <orth>notify</orth>
  <gramGrp>
    <colloc>sth.</colloc>
    <colloc>sb.</colloc>
  </gramGrp>
</form>
<!-- OR iv) -->
<form>
  <orth>notify</orth>
  <gramGrp>
    <colloc>sth./sb.</colloc>
  </gramGrp>
</form>

18) Grouping of annotations

  • Consider "[formal/Am.]" vs. "[formal] [Am.]".

    • The former indicates a disjunction, the latter a conjunction of the two annotations.
    • Also possible with grammar annotations.
  • Q: How to differentiate?

  • Options:

    • a) Don't.
    • b) For grammar annotations: Several gramGrps.
    • c) Literal retaining of the slash (or similar separator).
      • May forbid to set a common @type (such as in the example above).
    • d) Something like <choice> for disjunctions.

19) Q: Which content should grammar elements have?

  • Options
    • Short english forms from shared/FreeDict_ontology.xml
    • Anything, but link to that ontology, as done in eng-pol.tei.

20) Alternatives in a headword or translation (</>)

  • Example: "biological breakdown/degradation"

  • Q: How to encode

  • Options:

    • literally
    • derive two distinct headwords/translations
      • headwords:
        • link with xr/ref
        • sub-form with @type="alternate" or similar.
      • translations: separare cit elements
    • Something else (e.g. something like choice)
      • likely only an option for translations.

21) Q: May/should translations have examples?

  • It's common to have an example for a headword, together with a translation.

    • Question is, what about examples particular to the translation.
  • Likely realisation: <cit type="trans"><quote /><cit type="example" /></cit>

22) Q: Are entries without translations permitted?

  • A: Only if they contain any information within a sense, such as a reference (<ref>).
    • only gramGrp or inflected forms are insufficient.

23) Q: What about several subc?

  • Cases

    • a) same main part: "v/trans" + "v/intr"
      • Example: "essen {vt;vi}"
    • b) different main part (awkward): "v/trans" + "pron/rel"
  • Options

    • a.1) One pos followed by several subc.
    • *.2) Two pairs of pos, subc
    • *.3) two gramGrp
    • *.4) only (two) pos, content e.g. "vt".
    • a.5) `trans/intr

24) form @type: infl vs. inflected

  • Status quo
    • ML, Wiki, lg1-lg2.tei: infl
    • TEI Guidelines, TEI Lex-0: inflected

25.1) usg inside form[@type="inflected"]?

25.2) usg inside colloc?

26) Inflected forms for translations?

  • Q: Should inflected forms only be annotated on the source side (entries), or also on translations (cit[@type="trans"])?

    • If so, how?
  • A: Likely yes.

    • How: Unsure.