-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tedlium not distinct corpora names for the 3 partitions #490
base: main
Are you sure you want to change the base?
Conversation
While we are at it: I think we should also touch the part modifying the test/dev set in the Job, such that the new job either does not do this at all or does it via a flag which is disabled by default. |
I solve it very similar to this PR: I added a extend segment flag. I removed the updates to the transcriptions completely. Since the segments chagen from rel2 train to rel3 train change and do not occur anymore. for dev/test I removed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One could also move much of the duplicated code into a separate function (that only exists in the base class and then the make_corpus
function would call this one and set the names accordingly.
We should probably add a deprecation notice to the old class.
Should we also "move" the old class to deprecated?
@michelwi what happens with the apptek hashes here? |
f1b3c09
to
fd753e9
Compare
@JackTemaki @michelwi @Atticus1806 @christophmluscher I would like to work on the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, one thing (which I cant unfortunately manually suggest):
Maybe extend the comment in the old job with the splitting, that this is dangerous / modifies the reference? But approval does not depend on it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed offline, the apostrophe merging is now missing
The corpus name for all train, dev, and test partition is the same. This PR suggests a second version of the job that modifies simply the corpus name.