-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updating offsets when a text resource is altered #26
Comments
I see the challenge. This is indeed an important question. If an If a text changes you'd indeed have to transfer annotations from the old A mechanism for finding such identical text selections (i.e. computing Currently, however, this is still too limited for what you want, as it The good news is that I'm already looking into ways to achieve this,
That is very interesting. It's good to hear you've done some research into |
Good to hear you're working on this! Having a solution directly integrated to stam would be the best. Our solution starts with https://github.com/OpenPecha/Toolkit/blob/master/openpecha/blupdate.py and uses DMP as a backup. We didn't get as far as you with the format architecture, but the fundamentals are very similar: we called text resources "base layer" and annotation datastores "annotation layers" (think of layers in Photoshop) and we save annotations in yaml files rather than JSON. This makes it a quite natural transition for us. Our colleague @eroux from BDRC came up with the CCTV approach in blupdate.py, so I'm sure he'll be interested to hear what you think and if you have a better solution. |
ah thanks that's good to hear! In our tests, solutions based on Smith-Waterman or Needleman-Wunsch algorithms don't perform very well while Myer's diff does even on pretty large files. There seems to be a Rust rewrite of the library we're using in Python/C on https://crates.io/crates/diffmatchpatch , perhaps that could be a new option in |
Thanks for the link, good to see it's already ported to Rust even. That might indeed make a very good option to implement into |
One of the main challenges our project faces is that we have multiple copies of the same text resource with degrees of cleanliness and annotations. For instance we will have 50 instances of the heart sutra with the cleanest one not having TOC annotations or with a very dirty version with great NER tags. In some cases at might also only have a bad quality text resource that is being proofread and annotated over a year.
Our goal is to be able to combine the best aspects of all resources and annotations at any given time.
In other words, we see STAM as the pivot format that will link Buddhist data in archives like BDRC, sttacentral or CBETA and websites like 84000, pecha.org, which means that we will have to update, split and merge text resources and annotations on a regular basis.
We are also putting together training datasets for the project monlam.ai which also requires annotation transfer. For instance our MT model currently suffers from a lot of typos in our 2 million aligned sentence dataset and we need to transfer the segment annotations to cleaner versions of texts we are currently producing.
A couple of years ago, our team came up with an "annotation transfer" or "base text update" mechanism combining our CCTV algorithm with Google's Diff Match Patch package.
What would be your approach to tackle this challenge with STAM?
The text was updated successfully, but these errors were encountered: