Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Examples where >1 targets? #813

Open
Muennighoff opened this issue Aug 8, 2022 · 2 comments
Open

Examples where >1 targets? #813

Muennighoff opened this issue Aug 8, 2022 · 2 comments

Comments

@Muennighoff
Copy link

As you changed the signature of apply() to return a list for targets instead of a string, can you point me to some datasets that use multiple targets?
Is randomly picking one the best to get it back to a single string?

@Muennighoff Muennighoff changed the title Examples of where >1 targets? Examples where >1 targets? Aug 8, 2022
@stephenbach
Copy link
Member

Hey @Muennighoff ! Unless I'm misremembering, this is only changed on the eval-hackathon branch. This was a feature requested by the eval team, and maybe @cjlovering or @jordiclive can point to them. My understanding is that it's motivated by generation datasets that have multiple possible valid targets. We're not 100% decided that the current API in eval-hackathon will be merged as-is to main. For example, we could keep the current API as is (but return just the first target when there are multiple) and add another method that returns all targets. Open to suggestions!

@jordiclive
Copy link
Contributor

@Muennighoff, GEM/web_nlg and GEM/wiki_auto_asset_turk are examples of multiple references. For example GEM/wiki_auto_asset_turk/test_asset has 10 references.

Yes, the reasoning is because in NLG, having one reference is quite often unreliable, so a lot of test sets are designed with multiple references where multi-ref metrics should be used. Multi-ref metric (for bleu, rouge, sari) support was also added to the Bigscience EH because of this. I know other NLG datasets that are intended for multi-ref test sets are E2E and Totto, but not sure if they were implemented.

For promptsource main, I would suggest supporting multi-references for these datasets. Choosing one or even a random single reference for analysis makes results not comparable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants