-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running SLTev on ESIC (interpretation corpus) #67
Comments
Hi Barry, when you want to evaluate SLT files (MT with timestamps), you need to OStt (OSt with timestamps) and reference. So, the number of complete segments in OStt and reference must be equal.
Yes, it means candidates (MT, SLT, ...) can have a different segmentation from the reference. But, in this case, OStt and reference have a different segmentation and it is not correct. Please notice that OStt and reference files are gold. If possible, please share with me some of your evaluation files for more help. Thanks, |
Hi Mohammed Do the OStt and reference files need to match for the latency and flicker calculation? Because bleu could use docAsWhole. I'm attaching the files below (I have to give them txt extensions) Ostt en.OSt.man.orto.txt.OStt.txt slt en.OSt.man.orto.txt.slt.txt best |
Hi, There are two ways for solving this problem: what do you think about them? which one is better? Also, I am going to add some scripts and the main points to convert candidate types to each other. For example, I am going to add the following main points: What do you think about it? Are they useful? Best, |
Hi Mohammed So for solution 1, this would just evaluate the output as MT? If we converted each document to a single segment then this would work, since the corpus is document aligned. But we could then just do this directly with sacrebleu? I think 2 is more useful, but that means we have to solve how to define flicker and latency when the OStt C-segments and the reference segments do not match - is that correct? I am not sure what the solution is, I would have to think about. For the conversion tools, is the MT to SLT tool similar to what I asked about in #33 ? Yes, that would be useful. I think @sukantasen has a script for that. For the other direction, it's less important. best |
Hi Barry, |
Hi Barry,
I have updated SLTev for solving this issue, you can upgrade it to the version v1.2. 2 (pip install --upgrade SLTev)
I will add them in the next version. Best, |
Hi Mohammed That sounds good! For "docAsWhole", does that mean that the whole test corpus is treated as a document? Is there any way to tell SLTev that the corpus is made up of a number of smaller documents? best |
Hi Barry,
Yes, it concatenates whole test corpus segments as a document.
I do not understand your meaning exactly! But in the second BleuScore calculation (using MWERsegmenter), it uses MWERsegmenter to resegment candidate segments according to the reference segments. But there is no way to tell SLTev that the corpus is made of multiple documents. Best, |
This is what I thought. |
I think using blank line make SLTev complex. I think you are going to get scores for each document seperatly. Yes? proposed idea: For example, suppose there are 3 files test.slt, test.ostt and test.ref and test.ref contains 3 documents(e.g. [1,2,3]). step1: step2: What do you think about this idea? |
Hi Mohammed The dev set of the ESIC corpus is aligned at the document level, but not at the sentence level. It contains 28 documents. It's a corpus of interpretation, so should be ideal for evaluating simultaneous SLT, and so we want to use SLTev. Evaluating all 28 documents separately seems quite awkward for the user. Adding a special token between documents could work. best |
Hi Barry, Could you please share with me an example of the dev set?
What scores would you expect to evaluate? The next idea (for calculating quality score): For example, if there is a file with 28 documents. Best, |
Hi Mohammed I linked to a document above, and it's from the ESIC corpus https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-3719
Yes, this seems fine. We could prepare the data this way, or SLTev could convert it internally. But I am not sure why calculating delay is complex? Each of the 28 documents is a speech, and there is an ostt file with timestamps for each speech best |
Hi Barry,
Thanks.
SLTev will convert them internally. Let's summarise the above messages and if you agree with them please confirm and, I will start to implement them. A. Adding a file converter to the SLTev
B. Adding multi-document support module
Thanks, |
Hi Mohammad This sounds good. Just a couple of questions:
What is this needed for?
In our case, the timestamps start from 0 for each document, so will this work as normal? best |
Hi,
Yes, SLTeval needs the OStt files for evaluation and in some cases, there is no OStt file, so we need to convert SLT files to MT to evaluate with MTeval.
Unfortunately, they are not normal and we can not use them normally, so we need to calculate delay scores for each document separately. But just a question: How can we display delay scores? Is there a need to print scores for each document? Best, |
Hi Mohammad I meant that delay will work as normal "in each document". I agree that we need to find some way of combining them, and I think the default should be a mean. best |
Hi Barry, Thanks for the good interaction and consultation. Thanks, |
Hi Barry, I have prepared a multi-docs evaluation version of the SLTev. please upgrade to the version 1.2.3. you can use the following files as samples: docs.ref
docs.ostt
docs.slt
usage:
Thanks, |
Great thanks - we will try it. @sukantasen |
Hi
We are trying to test text-to-text translation on the ESIC corpus. The problem is that ESIC is document aligned, but not sentence aligned. The documents are segmented, but the number of segments does not match between source and target, so SLTev throws an error. Yet it states in the documentation that "segmentation can differ from the reference one"
How should this case be handled in SLTev?
best
Barry
The text was updated successfully, but these errors were encountered: