XLarge Fastconformer Long FT does not converge with default parameters #11894

TornikeAm · 2025-01-20T07:05:42Z

Hello, Im Trying to fine tune Xlarge model with these parameters:

# It contains the default values for training a Fast Conformer-CTC ASR model, large size (~120M) with CTC loss and sub-word encoding.
# This version uses Longformer-style attention in order to handle longer audio

# You may find more detail:
# FastConformer here: https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/models.html#fast-conformer
# FastConformer-CTC's architecture config: NeMo/examples/asr/conf/fastconformer/fast-conformer_ctc_bpe.yaml

# Differences from baseline config are in
# model.encoder.global_tokens
# model.encoder.global_tokens_spacing
# model.encoder.global_attn_separate

name: "FastConformer-Long-CTC-Xlarge_from_pretrain_default"
## initialized from our pretrained checkpoint.
init_from_ptl_ckpt: checkpoints/fastconf_long_xlarge_pretrain--val_loss=12487.7246-epoch=3-last.ckpt
model:
  sample_rate: 16000
  log_prediction: true # enables logging sample predictions in the output during training
  ctc_reduction: 'mean_volume'
  skip_nan_grad: false

  train_ds:
    manifest_filepath: final_ft_data.json
    sample_rate: ${model.sample_rate}
    batch_size: 25 # you may increase batch_size if your memory allows
    shuffle: true
    num_workers: 8
    pin_memory: true
    max_duration: 40 # it is set for LibriSpeech, you may need to update it for your dataset
    min_duration: 0.1
    # tarred datasets
    is_tarred: false
    tarred_audio_filepaths: null
    shuffle_n: 2048
    # bucketing params
    bucketing_strategy: "fully_randomized"
    bucketing_batch_size: null

  validation_ds:
    manifest_filepath: ft_eval_data.json
    sample_rate: ${model.sample_rate}
    batch_size: 25 # you may increase batch_size if your memory allows
    shuffle: false
    use_start_end_token: false
    num_workers: 8
    pin_memory: true

  test_ds:
    manifest_filepath: null
    sample_rate: ${model.sample_rate}
    batch_size: 16 # you may increase batch_size if your memory allows
    shuffle: false
    use_start_end_token: false
    num_workers: 8
    pin_memory: true

  # recommend vocab size of 128 or 256 when training on ~1k hr datasets and 1k vocab size on 10+k hr datasets
  # you may find more detail on how to train a tokenizer at: /scripts/tokenizers/process_asr_text_tokenizer.py
  tokenizer: 
    dir: fine_tuning/tokenizers/tokenizer_spe_unigram_v128   # path to directory which contains either tokenizer.model (bpe) or vocab.txt (wpe)
    type: bpe  # Can be either bpe (SentencePiece tokenizer) or wpe (WordPiece tokenizer)

  preprocessor:
    _target_: nemo.collections.asr.modules.AudioToMelSpectrogramPreprocessor
    sample_rate: ${model.sample_rate}
    normalize: "per_feature"
    window_size: 0.025
    window_stride: 0.01
    window: "hann"
    features: 80
    n_fft: 512
    log: true
    frame_splicing: 1
    dither: 0.00001
    pad_to: 0
    pad_value: 0.0

  spec_augment:
    _target_: nemo.collections.asr.modules.SpectrogramAugmentation
    freq_masks: 2 # set to zero to disable it
    # you may use lower time_masks for smaller models to have a faster convergence
    time_masks: 10 # set to zero to disable it
    freq_width: 27
    time_width: 0.05

  encoder:
    _target_: nemo.collections.asr.modules.ConformerEncoder
    feat_in: ${model.preprocessor.features}
    feat_out: -1 # you may set it if you need different output size other than the default d_model
    n_layers: 24
    d_model: 1024
    use_bias: True # whether to apply bias in the feedforward, MHA and convolution modules

    # Sub-sampling params
    subsampling: dw_striding # vggnet, striding, stacking or stacking_norm, dw_striding
    subsampling_factor: 8 # must be power of 2 for striding and vggnet
    subsampling_conv_channels: 256 # -1 sets it to d_model
    causal_downsampling: false

    # Feed forward module's params
    ff_expansion_factor: 4

    self_attention_model: rel_pos_local_attn # longformer-style attention (sliding window + global tokens)
    global_tokens: 1 # number of tokens that attend and are attended to by all tokens (put 0 to disable)
    global_tokens_spacing: 1 # how far apart the global tokens are
    global_attn_separate: false # whether global tokens should use separate q,k,v layers
    n_heads: 8 # may need to be lower for smaller d_models
    # [left, right] specifies the number of steps to be seen from left and right of each step in self-attention
    att_context_size: [128,128] # -1 means unlimited context
    att_context_style: regular # regular or chunked_limited
    xscaling: False # scales up the input embeddings by sqrt(d_model)
    untie_biases: true # unties the biases of the TransformerXL layers

    # Convolution module's params
    conv_kernel_size: 9
    conv_norm_type: 'batch_norm' # batch_norm or layer_norm or groupnormN (N specifies the number of groups)
    # conv_context_size can be"causal" or a list of two integers while conv_context_size[0]+conv_context_size[1]+1==conv_kernel_size
    # null means [(kernel_size-1)//2, (kernel_size-1)//2], and 'causal' means [(kernel_size-1), 0]
    conv_context_size: null

    ### regularization
    dropout: 0.1 # The dropout used in most of the Conformer Modules
    dropout_pre_encoder: 0.1 # The dropout used before the encoder
    dropout_emb: 0.0 # The dropout used for embeddings
    dropout_att: 0.1 # The dropout for multi-headed attention modules

    # set to non-zero to enable stochastic depth
    stochastic_depth_drop_prob: 0.0
    stochastic_depth_mode: linear  # linear or uniform
    stochastic_depth_start_layer: 1

  decoder:
    _target_: nemo.collections.asr.modules.ConvASRDecoder
    feat_in: null
    num_classes: -1
    vocabulary: []

  # config for InterCTC loss: https://arxiv.org/abs/2102.03216
  # specify loss weights and which layers to use for InterCTC
  # e.g., to reproduce the paper results, set loss_weights: [0.3]
  # and apply_at_layers: [8] (assuming 18 layers). Note that final
  # layer loss coefficient is automatically adjusted (to 0.7 in above example)
  interctc:
    loss_weights: []
    apply_at_layers: []

  optim:
    name: adamw
    lr: 1e-3
    # optimizer arguments
    betas: [0.9, 0.98]
    # less necessity for weight_decay as we already have large augmentations with SpecAug
    # you may need weight_decay for large models, stable AMP training, small datasets, or when lower augmentations are used
    # weight decay of 0.0 with lr of 2.0 also works fine
    weight_decay: 1e-3

    # scheduler setup
    sched:
      name: CosineAnnealing
      # scheduler config override
      warmup_steps: 15000
      warmup_ratio: null
      min_lr: 1e-4

trainer:
  devices: -1 # number of GPUs, -1 would use all available GPUs
  num_nodes: 1
  max_epochs: 1000
  max_steps: -1 # computed at runtime if not set
  val_check_interval: 1.0 # Set to 0.25 to check 4 times per epoch, or an int for number of iterations
  accelerator: auto
  strategy: ddp
  accumulate_grad_batches: 41
  gradient_clip_val: 0.0
  precision: bf16 # 16, 32, or bf16
  log_every_n_steps: 10  # Interval of logging.
  enable_progress_bar: True
  num_sanity_val_steps: 0 # number of steps to perform validation steps for sanity check the validation process before starting the training, setting to 0 disables it
  check_val_every_n_epoch: 1 # number of evaluations on validation every n epochs
  sync_batchnorm: true
  enable_checkpointing: False  # Provided by exp_manager
  logger: false  # Provided by exp_manager
  benchmark: false # needs to be false for models with variable-length speech input as it slows down training

exp_manager:
  exp_dir: null
  name: ${name}
  create_tensorboard_logger: true
  create_checkpoint_callback: true
  checkpoint_callback_params:
    # in case of multiple validation sets, first one is used
    monitor: "val_wer"
    mode: "min"
    save_top_k: 5
    always_save_nemo: True # saves the checkpoints as nemo files instead of PTL checkpoints

  resume_from_checkpoint: null # The path to a checkpoint file to continue the training, restores the whole state including the epoch, step, LR schedulers, apex, etc.
  # you need to set these two to True to continue the training
  resume_if_exists: false
  resume_ignore_no_checkpoint: false

  # You may use this section to create a W&B logger
  create_wandb_logger: false
  wandb_logger_kwargs:
    name: null
    project: null

but the model does not converge. these are the results:

I tried different learning rates with warmup_steps set to null and a ratio of 0.1, but the loss exploded earlier than it did with the default configuration parameters.

this is how I run fine-tuning:

import pytorch_lightning as pl
from omegaconf import OmegaConf
from nemo.utils.exp_manager import exp_manager
from nemo.collections.asr.losses.ctc import CTCLoss
import os

cfg = OmegaConf.load('fine_tuning/configs/fastconf_xlrage_long.yaml')
cfg = OmegaConf.to_container(cfg, resolve=True)
cfg = OmegaConf.create(cfg)

from nemo.collections.asr.models import EncDecCTCModelBPE

trainer = pl.Trainer(**cfg.trainer)
exp_manager(trainer, cfg.get("exp_manager", None))

asr_model = EncDecCTCModelBPE(cfg=cfg.model, trainer=trainer)
asr_model.maybe_init_from_pretrained_checkpoint(cfg)

trainer.fit(asr_model)

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XLarge Fastconformer Long FT does not converge with default parameters #11894

XLarge Fastconformer Long FT does not converge with default parameters #11894

TornikeAm commented Jan 20, 2025

XLarge Fastconformer Long FT does not converge with default parameters #11894

XLarge Fastconformer Long FT does not converge with default parameters #11894

Comments

TornikeAm commented Jan 20, 2025