-
Notifications
You must be signed in to change notification settings - Fork 277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
initial script for automating the creation of a controlled testing en… #2057
base: main
Are you sure you want to change the base?
Changes from 2 commits
420f6ac
53b60be
0e2c6fd
33a3c9c
a57add6
8ac9840
de17821
5e6e86e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||
---|---|---|---|---|---|---|---|---|
@@ -0,0 +1,71 @@ | ||||||||
#!/bin/bash | ||||||||
set -e | ||||||||
|
||||||||
stag=1 | ||||||||
|
||||||||
if [ $# -eq 0 ]; then | ||||||||
echo "\n This script prepares a controlled testing environment for OOV handling." | ||||||||
echo -e "\n Usage: \n $0 <data.csv> \n" | ||||||||
exit 1 | ||||||||
fi | ||||||||
|
||||||||
step=1 | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I guess this shouldn't be hardcoded? Or is it meant as a development tool, so you iterate on the parts as you get them working? |
||||||||
data=$1 | ||||||||
nj=$(nproc) | ||||||||
mkdir -p tmp/ | ||||||||
mkdir -p tmp/lm | ||||||||
mkdir -p tmp/results | ||||||||
|
||||||||
# Data preparation: split the vocab into 10% (that'd later represent OOVs) | ||||||||
# and the remaining 90% to compose a corpus for LM generation | ||||||||
echo "Step 1: Preparing Data" | ||||||||
if [ $step -le 1 ]; then | ||||||||
|
||||||||
# Extract corpus unique vocabularies | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||
xsv select transcript $data > tmp/data.txt | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||||||||
sed 's/ /\n/g' tmp/data.txt | sort | uniq -c | sort -nr > tmp/vocab.txt | ||||||||
grep -o . tmp/vocab.txt | sort -u > tmp/alphabet.txt | ||||||||
|
||||||||
# Pick the least frequent 10% vocabularies to represent OOVs | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||
oov_count=$(wc tmp/vocab.txt | awk '{print int($0*0.1)}') | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. use
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Size of OOV set should be a parameter (with default). |
||||||||
tail -$oov_count tmp/vocab.txt | awk '{print $2}'> tmp/oov_words | ||||||||
grep -wFf tmp/oov_words tmp/data.txt > tmp/oov_sents | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. grep manual says empty lines in the pattern file will match every input line, so we should either make sure the input CSV doesn't have any empty transcripts or remove those before we get to this step. |
||||||||
|
||||||||
# Exclude OOVs from the text corpus | ||||||||
grep -vf tmp/oov_sents tmp/data.txt > tmp/scorer_corpus.txt | ||||||||
gzip -c tmp/scorer_corpus.txt > tmp/scorer_corpus.txt.gz | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. FWIW: there is no need to |
||||||||
grep -vf tmp/oov_sents $data > tmp/scorer_corpus.csv | ||||||||
|
||||||||
# Prepare OOV csv for testing purposes (to assess imporvements on it) | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||
grep -wFf tmp/oov_sents tmp/data.txt > tmp/oov_corpus.txt | ||||||||
grep -wFf tmp/oov_sents $data | sed '1 i\wav_filename,wav_filesize,transcript' > tmp/oov_corpus.csv | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This
Can we make it portable to BSD sed? This fix worked for me:
Suggested change
|
||||||||
|
||||||||
fi | ||||||||
|
||||||||
# Generate LM | ||||||||
echo "Step 2: Generaing Language Model" | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||
if [ $step -le 2 ]; then | ||||||||
python3 data/lm/generate_lm.py --input_txt tmp/scorer_corpus.txt.gz \ | ||||||||
--output_dir tmp/lm --top_k 500000 --kenlm_bins kenlm/build/bin \ | ||||||||
--arpa_order 5 --max_arpa_memory "85%" --arpa_prune "0|0|1" \ | ||||||||
--binary_a_bits 255 --binary_q_bits 8 --binary_type trie --discount_fallback | ||||||||
|
||||||||
./native_client/generate_scorer_package --alphabet tmp/alphabet.txt \ | ||||||||
--lm tmp/lm/lm.binary --vocab tmp/lm/vocab-500000.txt \ | ||||||||
--package kenlm.scorer --default_alpha 0.931289039105002 \ | ||||||||
--default_beta 1.1834137581510284 | ||||||||
fi | ||||||||
|
||||||||
# Evaluate | ||||||||
echo "Step 3: Evaluating using scorer" | ||||||||
if [ $step -le 3 ]; then | ||||||||
echo "Evaluating on OOV testing set." | ||||||||
python -m coqui_stt_training.evaluate --test_files tmp/oov_corpus.csv \ | ||||||||
--test_output_file tmp/results/oov_results.json --scorer_path native_client/kenlm.scorer \ | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The |
||||||||
--checkpoint_dir /home/aya/work/tmp/AM/coqui-stt-1.1.0-checkpoint --test_batch_size $nj | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Checkpoint path should be made into a parameter. |
||||||||
|
||||||||
echo "Evaluating on original testing set." | ||||||||
python -m coqui_stt_training.evaluate --test_files tmp/scorer_corpus.csv \ | ||||||||
--test_output_file tmp/results/samples.json --scorer_path native_client/kenlm.scorer \ | ||||||||
--checkpoint_dir /home/aya/work/tmp/AM/coqui-stt-1.1.0-checkpoint --test_batch_size $nj | ||||||||
fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unused?