-
Notifications
You must be signed in to change notification settings - Fork 11
Standard Phone Loop Model
The standard model used by AMDTK for discovering acoustic unit is a Bayesian phone-loop model. We abuse the notation and refer to phone as "unit" though there is no guarantee that the model will learn an exact mapping unit to phone. Each unit is represented by a left-to-right HMM and is embedded into a loop structure. The phone-loop model represent a truncated Dirichlet Process (DP) where the atoms are the priors over the HMM representing the units. Each Gaussian component has a normal Gamma-Prior, mixture weights have a Dirichlet prior. The weights of the mixture of unit, sampled from the DP, can be seen as a unigram language model over the unit. Henceforth, we should refer to this standard unit-loop model as the unigram AUD model where AUD stands for Acoustic Unit Discovery. For more details about this model please see: Variational Inference for Acoustic Unit Discovery. Please note that differing from the paper, the model has no prior over the unit-HMM transition probabilities as we found experimentally that learning the transitions is time consuming and does not improve the accuracy of the model.
To create a unigram AUD model run:
utils/phone_loop_create.sh setup.sh keys_file output_directory
Internally, this script will proceed in two steps. First, it will estimate the mean m and variance v of the set of features define by keys_file
. Then, it will create a phone-loop model where the posterior mean of the Gaussian will be initialized by randomly sampling point from a Gaussian with mean m and variance v. Other parameters of the posteriors distribution are initialized to same value as their respective prior. The prior values should be defined in the setup.sh
file:
sil_ngauss=10
concentration=1
truncation=100
nstates=3
ncomponents=2
alpha=3
kappa=5
a=3
b=3
where:
-
sil_ngauss
is the number of Gaussian for the silence model. If this parameter is greater than 0, then the model assume that each utterance is starting and ending in the silence model. The silence model is a left-to-right HMM with the same number of states as for the other units but each state share the same GMM. When set to 0 the model does not include any silence model and the utterance can start/begin in any unit. -
concentration
is the concentration parameter of the DP. Large concentration will allow the model to have more units whereas small values will constrain the model to use only a small set of units. -
truncation
is number unit of model to approximate the infinite mixture defined by the DP. Informally, this parameters defines the maximum number of units discovered during the training. -
nstates
is the number of states in each unit-HMM. When set to 1, the model then reduced to a simple GMM model. -
ncomponents
is the number of Gaussian components per states. -
alphas
is the hyper-parameters of the symmetric Dirichlet prior of the GMM's weight for each state. -
kappa
is the scaling coefficient of the precision in the Normal-Gamma prior. -
a
is the shape parameter of the Gamma distributions. -
b
(times the variance v) is the rate parameter of the Gamma distributions.
The phone-loop can be trained by the Variational EM algorithm by running:
utils/phone_loop_train.sh setup.sh parallel_opts niter model_in_dir model_out_dir
This will run the Variational EM algorithm for niter
iterations.