Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tedlium not distinct corpora names for the 3 partitions #490

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

Marvin84
Copy link
Contributor

The corpus name for all train, dev, and test partition is the same. This PR suggests a second version of the job that modifies simply the corpus name.

@Atticus1806
Copy link
Contributor

While we are at it: I think we should also touch the part modifying the test/dev set in the Job, such that the new job either does not do this at all or does it via a flag which is disabled by default.
@christophmluscher, since afaik you are/were working on a TED3 Job how did you solve it there?

@christophmluscher
Copy link
Contributor

christophmluscher commented Mar 28, 2024

I solve it very similar to this PR: c.name = f"TED-LIUM-release3-{corpus_key}"

I added a extend segment flag. I removed the updates to the transcriptions completely. Since the segments chagen from rel2 train to rel3 train change and do not occur anymore. for dev/test I removed.

Copy link
Contributor

@michelwi michelwi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One could also move much of the duplicated code into a separate function (that only exists in the base class and then the make_corpus function would call this one and set the names accordingly.

We should probably add a deprecation notice to the old class.

Should we also "move" the old class to deprecated?

datasets/tedlium2.py Outdated Show resolved Hide resolved
datasets/tedlium2.py Outdated Show resolved Hide resolved
datasets/tedlium2.py Outdated Show resolved Hide resolved
@Atticus1806
Copy link
Contributor

@michelwi what happens with the apptek hashes here?

datasets/tedlium2.py Outdated Show resolved Hide resolved
@michelwi
Copy link
Contributor

@michelwi what happens with the apptek hashes here?

our recipes depend on #505. If you rebase/merge the current main head the problem should go away.

datasets/tedlium2.py Show resolved Hide resolved
@Marvin84
Copy link
Contributor Author

@JackTemaki @michelwi @Atticus1806 @christophmluscher I would like to work on the i6_experiments changes for using the new version, could I get this merged please?

Copy link
Contributor

@Atticus1806 Atticus1806 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, one thing (which I cant unfortunately manually suggest):
Maybe extend the comment in the old job with the splitting, that this is dangerous / modifies the reference? But approval does not depend on it

Copy link
Contributor

@JackTemaki JackTemaki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline, the apostrophe merging is now missing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants