Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

different p-values compared to other tools #25

Open
dimitarsh1 opened this issue Jun 18, 2019 · 0 comments
Open

different p-values compared to other tools #25

dimitarsh1 opened this issue Jun 18, 2019 · 0 comments

Comments

@dimitarsh1
Copy link

Hi,

I typically run multeval for bleu and ter but haven't assessed statistical significance so far. Now that I actually need it , I find it (1) difficult to grasp what exactly multeval computes (I checked issue #8 and it clarifies somehow what is going on) and (2) to run it 'correctly'. With (1) what I mean is that according to Koehn's paper (https://www.aclweb.org/anthology/W04-3250) I would assume you take different samples from sys1 and sys2 score w.r.t. the reference and assess the differences. If in 95% of the cases the scores differ favouring one of the systems then the difference is statistically significant. Or am I getting it wrong? Furthermore, I compared the multeval tool to mteval for the same number of samples and shuffles and the scores are completely different.
2. Maybe this all comes from me not running multeval correctly. I have one reference and the output of two MT systems. As multeval doesn't like it when there is only one variant for system 1 and the baseline I use copies, e.g. for system 1 I will use sys1.test.out and sys1.test.out.copy (and they are identical). Is this a good way to invoke multeval?

Thanks.
Cheers,
Dimtiar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant