Separate out "must pass" and accuracy assessment tests #54

ahalterman · 2018-11-10T19:39:34Z

We're in a netherworld right now of intermingled unit tests and accuracy assessment tests. Some tests measure whether the program is functioning at all, while others are more like measures of real-world performance. It would be really useful to separate these out: UP should have a test suite that new changes require 100% success on before they can be merged, and then a separate, GSR-based set that measures changes in expected performance.

PTB-OEDA · 2018-11-10T22:17:45Z

Do we have a breakdown of which test records fall into each class? Who handles making this split in the EN, ES, and AR records?

…

On Sat, Nov 10, 2018, 13:39 Andy Halterman ***@***.*** wrote: We're in a netherworld right now of intermingled unit tests and accuracy assessment tests. Some tests measure whether the program is functioning at all, while others are more like measures of real-world performance. It would be really useful to separate these out: UP should have a test suite that new changes require 100% success on before they can be merged, and then a separate, GSR-based set that measures changes in expected performance. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#54>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AJrP1mdNxa1-Gxo82efc1dfqtnnhL0oMks5utyt2gaJpZM4YYJUi> .

ahalterman · 2018-11-12T15:11:36Z

I think the tests here are the must-pass, since that's the purpose they served in TAB, Petr1, Petr2, etc. There are also some sentences in a separate file here that are must-pass. I also thought there were unit tests for the individual methods in UniversalPetrarch but I'm not finding them (example from Mordecai). Any commit to the code should keep a 100% pass rate on all of these.

"Accuracy assessment" would mostly the the GSRs, along with any sentences in the "Test Suite" that we've decided are no longer "must pass". The phrase info in #44 would also be in this category. Any change to the code should at a minimum not decrease accuracy on these records.

Regarding English vs. Spanish vs. Arabic, the first set should ideally be language-agnostic. The second set will of course be language specific, but we've already got all of those finished and separated out for each language.

philip-schrodt · 2018-11-14T18:43:15Z

In the KEDS/TABARI/PETR-1 lineage, everything in the validation suite (eventually about 250 cases for TABARI and PETR-1) was a "must-pass" and while most of these pre-dated the wide-spread use of GitHub and certainly of the combination of automated testing and commits, during the development and any subsequent changes, the programs were expected to cleanly run through all of the cases with a 100% pass rate before those changes were considered okay. Most of the cases are, effectively, unit tests produced when various features (e.g. patterns, compounds) were being developed; the remainder are quirky cases we encountered over the years that caused the program to freeze or crash under odd syntactic situations. None of these were GSRs -- in fact a lot of them are completely artificial and are not even grammatically correct English -- and after the first years of the KEDS project (the work that resulted in the 1994 ISQ and AJPS papers), we didn't have the funding sufficient to produce GSRs (plus there were the IP issues)

As best I can tell, Clayton started with generating formal unit tests -- that is, tests oriented to very specific functions -- in PETR-2, then transitioned to something closer to the TABARI validation approach (lots of artificial, though now grammatical, sentences clearly design to test very specific functions), and finally there are a few Gigaword cases with real news articles. But, like everything in PETR-2, this was never completed in any sort of comprehensive fashion: PETR-2 is more of a proof-of-concept than a fully functioning coder.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Separate out "must pass" and accuracy assessment tests #54

Separate out "must pass" and accuracy assessment tests #54

ahalterman commented Nov 10, 2018

PTB-OEDA commented Nov 10, 2018 via email

ahalterman commented Nov 12, 2018 •

edited

Loading

philip-schrodt commented Nov 14, 2018

Separate out "must pass" and accuracy assessment tests #54

Separate out "must pass" and accuracy assessment tests #54

Comments

ahalterman commented Nov 10, 2018

PTB-OEDA commented Nov 10, 2018 via email

ahalterman commented Nov 12, 2018 • edited Loading

philip-schrodt commented Nov 14, 2018

ahalterman commented Nov 12, 2018 •

edited

Loading