Sanity Check for Semantic Transformer during inference - does inference finish? #245

rgxb2807 · 2023-11-07T17:11:00Z

rgxb2807
Nov 7, 2023

Hi Everyone,

I'm assuming this isn't an issue, but I've trained the entire pipeline a few times and am struggling to get intelligible speech results. I've noticed during the inference when I call the audiolm pipeline, the semantic transformer seems to stop early according to the ouput in my jupyter notebook. I'm assuming this is just an output discrepancy and that inference is finishing faster than the display updates (looks like tqdm)

generated_wav = audiolm(prime_wave_path="/audio/generated_samples/test_audio_primer.wav")

The output:

generating semantic:   5%|▉                   | 93/1950 [00:01<00:21, 88.00it/s]
generating coarse: 100%|██████████████████████| 512/512 [00:37<00:00, 13.65it/s]
generating fine: 100%|████████████████████████| 512/512 [02:23<00:00,  3.56it/s]

Training params:
I'm using the LibriSpeech corpus which is about 1k hours of speech. I download the train and test tests, convert the files from flac to wav using ffmpeg and sort them into training and validation folders. I'm largely using the defaults on the readme for everything and using Encodec. Happy to provide the complete training scripts if needed.


# semantic training script
wav2vec = HubertWithKmeans(
    checkpoint_path = './hubert/hubert_base_ls960.pt',
    kmeans_path = './hubert/hubert_base_ls960_L9_km500.bin'
)

semantic_transformer = SemanticTransformer(
    num_semantic_tokens = wav2vec.codebook_size,
    dim = 1024,
    depth = 6,
    flash_attn = True
)

wav2vec = HubertWithKmeans(
    checkpoint_path = '/audio/hubert/hubert_base_ls960.pt',
    kmeans_path = '/audio/hubert/hubert_base_ls960_L9_km500.bin'
)

coarse_transformer = CoarseTransformer(
    num_semantic_tokens = wav2vec.codebook_size,
    codebook_size = 1024,
    num_coarse_quantizers = 3,
    dim = 512,
    depth = 6,
    flash_attn = True
)

fine_transformer = FineTransformer(
    num_coarse_quantizers = 3,
    num_fine_quantizers = 5,
    codebook_size = 1024,
    dim = 512,
    depth = 6,
    flash_attn = True
)

rgxb2807 · 2023-11-07T20:38:58Z

rgxb2807
Nov 7, 2023
Author

ahh nevermind, answering my own question here, it looks like the generate loop breaks when a row of EOS tokens is encountered. Assuming this is the reason for the output discrepancy.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sanity Check for Semantic Transformer during inference - does inference finish? #245

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Sanity Check for Semantic Transformer during inference - does inference finish? #245

rgxb2807 Nov 7, 2023

Replies: 1 comment

rgxb2807 Nov 7, 2023 Author

rgxb2807
Nov 7, 2023

rgxb2807
Nov 7, 2023
Author