Controlled remote vs local execution: cmd.sh
Kaldi is designed to work with SunGrid clusters. It also work with other clusters. We want to run it locally, it can do that too. This can be done by making sure cmd.sh sets the variables as follows:
export train_cmd=run.pl
export decode_cmd=run.pl
rather than making references to queue.pl
.
Training (and testing), will still be split into multiple jobs, each handling different subsets of the data.
Training Recognizer
This section is covered by this section of the kaldi tutorial.
The majority of the steps covered in this page, are triggered by the script run.sh
Training
Done using the script steps/train_mono.sh
, However very similar steps are used in the other training scripts from in steps
(such as steps/train_deltas
).
Usage:
steps/train_mono.sh [options] <training-data-dir> <lang-dir> <exp-dir>"
training-data-dir
is the path to the training data directory prepared earlierlang-dir
is the path the directory containing all the language model files, also prepared earlierexp-dir
is a path for the training to store all of its outputs. It will be created if it does not exist.
Configuration / Options
The train_mono
script takes many configuration options.
They can be set by passing them as flags to script: as so: --<option-name> <value>
.
Or by putting them all into a config bash script, and adding the flag --config <path>
.
They could also be set by editing the defaults in steps/train_mono.sh
, but there is no good reason to do this.
nj
: Number of Jobs to run in parallel. (default4
)cmd
: Job dispatcher script (defaultrun.pl
)scale_opts
: takes a string (wrap it in quotes) to control scaling options (default"--transition-scale=1.0 --acoustic-scale=0.1 --self-loop-scale=0.1"
)transition-scale
(default1.0
)acoustic-scale
(default0.1
)self-loop-scale
(default0.1
)
num_iters
Number of iterations of training (default40
)max_iter_inc
maximum amount to increase the number of Gaussians by (default30
)totgauss
Target number of Gaussians (default1000
)-
careful
passed on togmm-align-compiled
. To quote its documentation: “If true, do ‘careful’ alignment, which is better at detecting alignment failure (involves loop to start of decoding graph).” (defaultfalse
) boost_silence
Factor by which to boost silence likelihoods in alignment. (Default1.0
)-
realign_iters
iterations in which to perform realignment (default"1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 23 26 29 32 35 38"
) power
exponent to determine number of Gaussians from occurrence counts (default0.25
)cmvn_opts
options will be passed on to cmvn – like scale_opts. (default""
)stage
: This is used to allow you to skip some steps, if the program crashed partway though. The stage variable sets the stage to start at. The stages are discussed in the next section (default-4
)
What is the parallelism of Jobs in the Training step
During training, the training set can be (and is in the example) split up (the actual spitting is explained in the data preparation step), and each different process (Job), trains on a different subset of utterances, which each iteration are then merged.
Initialisation Stages
Initialise GMM (Stage -3)
Uses /kaldi-trunk/src/gmmbin/gmm-init-mono
.
Call that with the --help
option for more info
This defines (amongst other things), how many GMMs there are initially.
Compile Training Graphs (Stage -2)
uses /kaldi-trunk/source/bin/compile-training-graphs
.
Call that with the --help
option for more info.
See this section of the documentation.
Align Data Equally (Stage -1)
Creates an equally spaced alignment. As a starting point for further alignment stages.
Uses /kaldi-trunk/source/bin/align-equal-compiled
.
Call that with the --help
option for more info.
Estimate Gaussians (Stage 0)
Do the maximum likelihood estimation of GMM-based acoustic model.
Uses /kaldi-trunk/src/gmmbin/gmm-est
.
Call that with the --help
option for more info.
The script notes:
In the following steps, the
--min-gaussian-occupancy=3
option is important, otherwise we fail to est[imate] “rare” phones and later on, they never align properly.
Training (Stage = Iterations completed)
Every Iteration a number of steps are carried out.
Realign
If this iteration is one of the realign_iters
then:
Boost Silence
Silence is boosted using /kaldi-trunk/src/gmmbin/gmm-boost-silence
,
Call that with the --help
option for more info.
Notably it does not necessarily boost the silence phone (but it does in this training case), it can boost any phone.
It does this by modifying the GMM weights, to make silence more probable.
Align
Features are aligned given the GMM models.
Uses kaldi-kaldi/src/gmmbin/gmm-align-compiled
.
Call that with the --help
option for more info.
Reestimate the GMM model.
First accumulate stats which are used in the next step.
This is done using /kaldi-trunk/src/gmmbin/gmm-acc-states-ali
.
Call that with the --help
option for more info.
Then redo the GMM-based acoustic model.
This is done with /kaldi-trunk/src/gmmbin/gmm-est
, but using very different arguments.
Again call that with the --help
option for more info.
Merge GMMs
All the different GMMs from the partitioned training dataset are then merged,
using gmm-acc-sum
to produce a model (a .mdl
file).
The model can be examined using /kaldi-trunk/src/gmmbin/gmm-info
to get some very basic information about the number of Gaussians etc.
Finally, increase the number of Gaussians (capped by max_iter_inc
), so that by the time all the iterations (num_iters
) are all complete, it will approach the target total number of Gaussians (totgauss
) – assuming max_iter_inc
did not block it.
Making of the Decoding Graph
As showing earlier, the Grammar (G) can be composed with the Lexicon (L), to get a phoneme to word mapping.
To increase the power of the phones, they could be expanded to add context. For example making the ‘ay’ phone in ‘n-ay-n’ (nine) different from the one in ’m-ay-n' (mine). This can be done with a Context dependency FST, which can be scaled with the number of phones to take into account in the context. This is roughly equivelent to making use of n-grams on the phonetic level. Using 3 (i.e. one to each side) context is referred to a triphones.
The Context Dependency can be expressed as a FST, referred to as C.
This blog post presents the details of the creation quiet well. It will be a bit of revision from the data preparation step.
Usage of util/mkgraph.sh
The final graph is created using util/mkgraph.sh
To quote the introduction to that script:
…creates a fully expanded decoding graph (HCLG) that represents all the language-model, pronunciation dictionary (lexicon), context-dependency, and HMM structure in our model. The output is a Finite State Transducer that has word-ids on the output, and pdf-ids on the input (these are indexes that resolve to Gaussian Mixture Models).
It also creates the aforementioned Context Dependency Graph.
Usage:
utils/mkgraph.sh [options] <lang-dir> <model-dir> <graphdir>
lang-dir
is, as before, the path to the directory containing all the language model files, also prepared earliermodel-dir
is theexp-dir
from the previous train-mono step, which now contained the trained model.graph-dir
is the directory to place the final graph in. In the example script this is made as a graph subdirectory under theexp-dir
. If it does not exist, it will be created
Context Options
There are 3 options for defining how many phones are used to create the context.
These are passed as the options to the utils/mkgraph.sh
script
--mono
for monophone i.e. one phone i.e. no context (Used insteps/train_mono.sh
)- no flag (default) for triphone i.e. 3 phones i.e. one phone to each side for context
--quinphone
for quinphone i.e. 5 phones i.e. 2 phones to each side for context
It would not be hard to extend the mkgraph script to create contexts of any length.
The section of the mkgraph script responsible for this,
makes use of /kaldi-trunk/src/fstbin/fstcomposecontext
, take a look at its --help
for more information.