Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

initial script for automating the creation of a controlled testing en… #2057

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

Aya-AlJafari
Copy link
Contributor

No description provided.

create_oovs.sh Outdated
#!/bin/bash
set -e

stag=1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused?

create_oovs.sh Outdated
exit 1
fi

step=1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this shouldn't be hardcoded? Or is it meant as a development tool, so you iterate on the parts as you get them working?

create_oovs.sh Outdated
echo "Step 1: Preparing Data"
if [ $step -le 1 ]; then

# Extract corpus unique vocabularies
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Extract corpus unique vocabularies
# Extract corpus vocabulary (unique words)

create_oovs.sh Outdated
sed 's/ /\n/g' tmp/data.txt | sort | uniq -c | sort -nr > tmp/vocab.txt
grep -o . tmp/vocab.txt | sort -u > tmp/alphabet.txt

# Pick the least frequent 10% vocabularies to represent OOVs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Pick the least frequent 10% vocabularies to represent OOVs
# Pick the least frequent 10% words to build OOV set

create_oovs.sh Outdated
grep -o . tmp/vocab.txt | sort -u > tmp/alphabet.txt

# Pick the least frequent 10% vocabularies to represent OOVs
oov_count=$(wc tmp/vocab.txt | awk '{print int($0*0.1)}')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use wc -l to communicate intent earlier.

Suggested change
oov_count=$(wc tmp/vocab.txt | awk '{print int($0*0.1)}')
oov_count=$(wc -l tmp/vocab.txt | awk '{print int($0*0.1)}')

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Size of OOV set should be a parameter (with default).

create_oovs.sh Outdated

# Prepare OOV csv for testing purposes (to assess imporvements on it)
grep -wFf tmp/oov_sents tmp/data.txt > tmp/oov_corpus.txt
grep -wFf tmp/oov_sents $data | sed '1 i\wav_filename,wav_filesize,transcript' > tmp/oov_corpus.csv
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sed command doesn't work on macOS:

sed: 1: "1 i\wav_filename,wav_fi ...": extra characters after \ at the end of i command

Can we make it portable to BSD sed? This fix worked for me:

Suggested change
grep -wFf tmp/oov_sents $data | sed '1 i\wav_filename,wav_filesize,transcript' > tmp/oov_corpus.csv
echo "wav_filename,wav_filesize,transcript" > tmp/oov_corpus.csv
grep -wFf tmp/oov_sents $data >> tmp/oov_corpus.csv

create_oovs.sh Outdated
fi

# Generate LM
echo "Step 2: Generaing Language Model"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
echo "Step 2: Generaing Language Model"
echo "Step 2: Generating Language Model"

create_oovs.sh Outdated
gzip -c tmp/scorer_corpus.txt > tmp/scorer_corpus.txt.gz
grep -vf tmp/oov_sents $data > tmp/scorer_corpus.csv

# Prepare OOV csv for testing purposes (to assess imporvements on it)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Prepare OOV csv for testing purposes (to assess imporvements on it)
# Prepare OOV CSV for testing purposes (to assess improvements on it)

create_oovs.sh Outdated
echo "Evaluating on OOV testing set."
python -m coqui_stt_training.evaluate --test_files tmp/oov_corpus.csv \
--test_output_file tmp/results/oov_results.json --scorer_path native_client/kenlm.scorer \
--checkpoint_dir /home/aya/work/tmp/AM/coqui-stt-1.1.0-checkpoint --test_batch_size $nj
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checkpoint path should be made into a parameter.

create_oovs.sh Outdated
if [ $step -le 3 ]; then
echo "Evaluating on OOV testing set."
python -m coqui_stt_training.evaluate --test_files tmp/oov_corpus.csv \
--test_output_file tmp/results/oov_results.json --scorer_path native_client/kenlm.scorer \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The native_client/kenlm.scorer should be kenlm.scorer, according to the command in the step above, right? And that should probably be changed to tmp/kenlm.scorer to keep all the outputs of the script contained to that folder.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants