Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Combining --multilang and paragraph-level annotations #45

Open
jelmervdl opened this issue Nov 3, 2023 · 0 comments
Open

Combining --multilang and paragraph-level annotations #45

jelmervdl opened this issue Nov 3, 2023 · 0 comments

Comments

@jelmervdl
Copy link
Member

jelmervdl commented Nov 3, 2023

I've been trying to support --multilang (which is CLD2 splitting the document into up to three documents with different language labels) while adding classifiers and JSON support. But should we?

Does anyone use --multilang? I think in HPLT we want to avoid breaking up documents, so it won't be used by us.

How is --multilang supposed to work with the --identify-paragraphs option? The current implementation treats each broken up document as its own, so you can only replicate these stand-off annotations if you use the exact same langid so the split happens exactly the same. This sounds like a bug to me.

Do we want to keep multilang support when adding other paragraph level annotations, such as the block element name (or tag) that delineated that paragraph? It's a bit more cumbersome to implement since the break-up boundaries of the langid chunks are whatever CLD2 makes them, not the paragraph boundaries that warc2text introduces when parsing HTML.

Related to #35 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant