Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ucto creates invalid folia #77

Open
kosloot opened this issue Jan 27, 2020 · 2 comments
Open

ucto creates invalid folia #77

kosloot opened this issue Jan 27, 2020 · 2 comments
Assignees
Labels

Comments

@kosloot
Copy link
Contributor

kosloot commented Jan 27, 2020

given the attached file issue77.xml.txt ucto will create invalid folia: UIT.xml.text
The command was:
ucto --passthru issue77.xml UIT.xml

>foliavalidator UIT.xml 
VALIDATION ERROR on full parse by library (stage 2/3), in UIT.xml
ParseError: FoLiA exception in handling of <s> @ line 47 (in parent <p> @ parent line 44) : [DeclarationError] Processor ucto.1 is used for annotationtype SENTENCE, set None, but has no corresponding <annotator> referring to it from the annotations declaration block!

SIDENOTE: folialint doesn't complain added as LanguageMachines/libfolia#42

issue77.xml.txt
UIT.xml.txt

@kosloot
Copy link
Contributor Author

kosloot commented Jan 28, 2020

I think there are several issues here.

  1. When using passthru, it is maybe not correct that ucto tries to assign a Sentence and Words to the second paragraph. @proycon wath should --passthru do here? The documentation states:
    Don't tokenize, but perform input decoding and simple token role detection
  2. But a similar problem arises when we use ucto -Lnld issue77.xml UIT.xml
    in that case ucto creates a new sentence with processor ucto1 but uses the old sentence-annotation form the input. It should add an extra sentence-annotation referring ucto.1
    When the answer for 1. is: 'OK just add a sentence and a word' then the same would hold using the "passthru" set.

@kosloot
Copy link
Contributor Author

kosloot commented Feb 5, 2020

point 2 is (for now) resolved by 'adopting' the already present annotations. This produces correct FoLiA, but the question remains if this is the best solution.

Maybe we should reject such input. But there are use-cases where annotations are defined, (and sometimes NOT used at all).
We could also make ucto assign some own segmentation set for such cases. But this also has some troublesome consequences.

For now I suggest to stick with this half-baked solution. But feeling a bit worried.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants