Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separate out "must pass" and accuracy assessment tests #54

Open
ahalterman opened this issue Nov 10, 2018 · 3 comments
Open

Separate out "must pass" and accuracy assessment tests #54

ahalterman opened this issue Nov 10, 2018 · 3 comments

Comments

@ahalterman
Copy link
Member

We're in a netherworld right now of intermingled unit tests and accuracy assessment tests. Some tests measure whether the program is functioning at all, while others are more like measures of real-world performance. It would be really useful to separate these out: UP should have a test suite that new changes require 100% success on before they can be merged, and then a separate, GSR-based set that measures changes in expected performance.

@PTB-OEDA
Copy link
Member

PTB-OEDA commented Nov 10, 2018 via email

@ahalterman
Copy link
Member Author

ahalterman commented Nov 12, 2018

I think the tests here are the must-pass, since that's the purpose they served in TAB, Petr1, Petr2, etc. There are also some sentences in a separate file here that are must-pass. I also thought there were unit tests for the individual methods in UniversalPetrarch but I'm not finding them (example from Mordecai). Any commit to the code should keep a 100% pass rate on all of these.

"Accuracy assessment" would mostly the the GSRs, along with any sentences in the "Test Suite" that we've decided are no longer "must pass". The phrase info in #44 would also be in this category. Any change to the code should at a minimum not decrease accuracy on these records.

Regarding English vs. Spanish vs. Arabic, the first set should ideally be language-agnostic. The second set will of course be language specific, but we've already got all of those finished and separated out for each language.

@philip-schrodt
Copy link
Contributor

In the KEDS/TABARI/PETR-1 lineage, everything in the validation suite (eventually about 250 cases for TABARI and PETR-1) was a "must-pass" and while most of these pre-dated the wide-spread use of GitHub and certainly of the combination of automated testing and commits, during the development and any subsequent changes, the programs were expected to cleanly run through all of the cases with a 100% pass rate before those changes were considered okay. Most of the cases are, effectively, unit tests produced when various features (e.g. patterns, compounds) were being developed; the remainder are quirky cases we encountered over the years that caused the program to freeze or crash under odd syntactic situations. None of these were GSRs -- in fact a lot of them are completely artificial and are not even grammatically correct English -- and after the first years of the KEDS project (the work that resulted in the 1994 ISQ and AJPS papers), we didn't have the funding sufficient to produce GSRs (plus there were the IP issues)

As best I can tell, Clayton started with generating formal unit tests -- that is, tests oriented to very specific functions -- in PETR-2, then transitioned to something closer to the TABARI validation approach (lots of artificial, though now grammatical, sentences clearly design to test very specific functions), and finally there are a few Gigaword cases with real news articles. But, like everything in PETR-2, this was never completed in any sort of comprehensive fashion: PETR-2 is more of a proof-of-concept than a fully functioning coder.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants