Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cost is nan When adding guided alignment #417

Open
mahmoudaymo opened this issue Sep 14, 2023 · 1 comment
Open

Cost is nan When adding guided alignment #417

mahmoudaymo opened this issue Sep 14, 2023 · 1 comment
Labels

Comments

@mahmoudaymo
Copy link

Bug description

I have trained a model for 5 epochs without guided alignment. Then I trained for 5 epochs more with guided alignment. When training without guided alignment everything went fine. However, when adding the guided alignment (the second 5 epochs) the training cost is nan in every update.

How to reproduce

Describe steps or include command to reproduce the behavior.
I have run this script:

`#!/bin/bash

set -e

exp_dir=path_to_experiment_dir

exp=$exp_dir/basemodel
config=$exp/config.yml

/marian/build/marian -c $config
--valid-log $exp/valid.log
--log $exp/train.log
--model $exp/model.npz
--after 5e

exp=$exp_dir/finetuned
config=$exp/config.yml # This config is similar to the above except I unset --all-caps-every and --english-title-case-every params

/marian/build/marian -c $config
--pretrained-model $pretrained_model_path
--valid-log $exp/valid.log
--log $exp/train.log
--model $exp/model.npz
--after 10e
--guided-alignment /Engines/MAS/ENUSDEDE/alignment/corpus.align
--guided-alignment-cost ce`
marian.logs.txt

Context

  • Marian version: 1.12.0
  • CMake command: Type the cmake command you used and attach the output of --build-info all
  • Log file: Attach your training/decoding logs

Add any other information about the problem here.

@TransperfectAI
Copy link

We are experiencing this issue, too, even when training with alignment from the start. Could it be related to the guided-alignment-cost? We used to use mse and then changed to ce when mse was no longer supported. The issue started after that for us.
It also means that to restart training in a directory you need to edit the cost in the model.npz.progress.yml or it throws an error

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants