Day16 - 语音辨识神级工具-Kaldi part1

Kaldi 是由语音辨识领域中的神级人物 - Dan Povey 所开发出来一套以 C++ 撰写的开源工具，使用上则是会以 shell script 为主，主要用於语音辨识与语音讯号处理。 Kaldi 有关的3个主要网站如下:

Kaldi 官网: 包含安装说明、技术文件等，也可以下载到已经训练好的各种语音 model
Kaldi github: 原始码
Kaldi-help: kaldi 讨论区，有甚麽疑难杂镇都可以在这边与其他开发者讨论、交流

要安装 Kaldi 环境的话首先先将 github 专案 clone 下来，接着进到 tools/ 目录依序执行以下指令:

git clone https://github.com/kaldi-asr/kaldi.git
extras/check_dependencies.sh
CXX=g++-4.8 extras/check_dependencies.sh
make
# or run by multiple CPUs (4 for example)
make -j 4

接着进到 src/ 目录下执行以下指令

./configure --shared
# run in parallel by multiple CPUs (4 for example)
make depend -j 4
make -j 4

基本上就是依照 INSTALL 中所说明的安装流程执行。
整个 github 专案中包含了许多目录及档案，这边会根据官方文件的建议及自己使用上的经验列出一些比较需要知道与常用到的。

egs/: 这个目录应该会是实际在开发语音辨识模型最常会用到的，里面包含了各种范例，像是使用不同资料集、不同语言的。以中文来说 aidatatang_200zh, aishell, aishell2, thchs30 这几个是使用简体中文资料集的；formosa 这一个范例则是繁体中文，详细的资讯也可以到 https://sites.google.com/speech.ntut.edu.tw/fsw/home 参考。
tools: 这边提供了很多训练时会用到的工具，可以参考这里
src/: source code，可以参考这里

接下来我们会根据 formosa 这一个范例来做修改实际看看 kaldi 的运作流程~
首先先来看主程序 run.sh 的部份，run.sh 里面主要包含了以下几个步骤:

发音词典处理 (local/prepare_dict.sh)
资料处理 (local/prepare_data.sh)
训练语言模型 (local/train_lms.sh)
- 预设是使用 3-gram 模型
语音特徵撷取 (steps/make_mfcc_pitch.sh)
- 可在 conf/mfcc.conf 中调整参数
训练单音素模型 (steps/train_mono.sh)
训练三音素模型 (train tri1~tri5 model)
训练神经网路模型 (local/chain/run_tdnn.sh)
- kaldi 中常使用 TDNN 模型 (Time Delay Neural Network)
用训练好的模型计算测试集的词错误率 (decoding)
run.sh 程序如下:

stage=-2
num_jobs=16

train_dir=<train-data-path>
eval_dir=<eval-data-path>

# shell options
set -eo pipefail

. ./cmd.sh
. ./utils/parse_options.sh
#configure number of jobs running in parallel, you should adjust these numbers according to your machines
# data preparation
if [ $stage -le -2 ]; then
  # Lexicon Preparation,
  echo "$0: Lexicon Preparation"
  local/prepare_dict.sh || exit 1;

  # Data Preparation
  echo "$0: Data Preparation"
  local/prepare_data.sh --train-dir $train_dir || exit 1;
  # Phone Sets, questions, L compilation
  echo "$0: Phone Sets, questions, L compilation Preparation"
  rm -rf data/lang
  utils/prepare_lang.sh --position-dependent-phones false data/local/dict \
      "<SIL>" data/local/lang data/lang || exit 1;

  # LM training
  echo "$0: LM training"
  rm -rf data/local/lm/3gram-mincount
  local/train_lms.sh || exit 1;

  # G compilation, check LG composition
  echo "$0: G compilation, check LG composition"
  utils/format_lm.sh data/lang data/local/lm/3gram-mincount/lm_unpruned.gz \
      data/local/dict/lexicon.txt data/lang_test || exit 1;

fi

# Now make MFCC plus pitch features.
# mfccdir should be some place with a largish disk where you
# want to store MFCC features.
mfccdir=mfcc

# mfcc
if [ $stage -le -1 ]; then
  echo "$0: making mfccs"
  for x in train test ; do
    steps/make_mfcc_pitch.sh --cmd "$train_cmd" --nj $num_jobs data/$x exp/make_mfcc/$x $mfccdir || exit 1;
    steps/compute_cmvn_stats.sh data/$x exp/make_mfcc/$x $mfccdir || exit 1;
    utils/fix_data_dir.sh data/$x || exit 1;
  done
fi

# mono
if [ $stage -le 0 ]; then
  echo "$0: train mono model"
  # Make some small data subsets for early system-build stages.
  echo "$0: make training subsets"
  utils/subset_data_dir.sh --shortest data/train 3000 data/train_mono

  # train mono
  steps/train_mono.sh --boost-silence 1.25 --cmd "$train_cmd" --nj $num_jobs \
    data/train_mono data/lang exp/mono || exit 1;

  # Get alignments from monophone system.
  steps/align_si.sh --boost-silence 1.25 --cmd "$train_cmd" --nj $num_jobs \
    data/train data/lang exp/mono exp/mono_ali || exit 1;

  # Monophone decoding
  (
  utils/mkgraph.sh data/lang_test exp/mono exp/mono/graph || exit 1;
  steps/decode.sh --cmd "$decode_cmd" --config conf/decode.config --nj $num_jobs \
    exp/mono/graph data/test exp/mono/decode_test
  )&
fi

# tri1
if [ $stage -le 1 ]; then
  echo "$0: train tri1 model"
  # train tri1 [first triphone pass]
  steps/train_deltas.sh --boost-silence 1.25 --cmd "$train_cmd" \
   2500 20000 data/train data/lang exp/mono_ali exp/tri1 || exit 1;

  # align tri1
  steps/align_si.sh --cmd "$train_cmd" --nj $num_jobs \
    data/train data/lang exp/tri1 exp/tri1_ali || exit 1;

  # decode tri1
  (
  utils/mkgraph.sh data/lang_test exp/tri1 exp/tri1/graph || exit 1;
  steps/decode.sh --cmd "$decode_cmd" --config conf/decode.config --nj $num_jobs \
    exp/tri1/graph data/test exp/tri1/decode_test
  )&
fi

# tri2
if [ $stage -le 2 ]; then
  echo "$0: train tri2 model"
  # train tri2 [delta+delta-deltas]
  steps/train_deltas.sh --cmd "$train_cmd" \
   2500 20000 data/train data/lang exp/tri1_ali exp/tri2 || exit 1;

  # align tri2b
  steps/align_si.sh --cmd "$train_cmd" --nj $num_jobs \
    data/train data/lang exp/tri2 exp/tri2_ali || exit 1;

  # decode tri2
  (
  utils/mkgraph.sh data/lang_test exp/tri2 exp/tri2/graph
  steps/decode.sh --cmd "$decode_cmd" --config conf/decode.config --nj $num_jobs \
    exp/tri2/graph data/test exp/tri2/decode_test
  )&
fi

# tri3a
if [ $stage -le 3 ]; then
  echo "$-: train tri3 model"
  # Train tri3a, which is LDA+MLLT,
  steps/train_lda_mllt.sh --cmd "$train_cmd" \
   2500 20000 data/train data/lang exp/tri2_ali exp/tri3a || exit 1;

  # decode tri3a
  (
  utils/mkgraph.sh data/lang_test exp/tri3a exp/tri3a/graph || exit 1;
  steps/decode.sh --cmd "$decode_cmd" --nj $num_jobs --config conf/decode.config \
    exp/tri3a/graph data/test exp/tri3a/decode_test
  )&
fi

# tri4
if [ $stage -le 4 ]; then
  echo "$0: train tri4 model"
  # From now, we start building a more serious system (with SAT), and we'll
  # do the alignment with fMLLR.
  steps/align_fmllr.sh --cmd "$train_cmd" --nj $num_jobs \
    data/train data/lang exp/tri3a exp/tri3a_ali || exit 1;

  steps/train_sat.sh --cmd "$train_cmd" \
    2500 20000 data/train data/lang exp/tri3a_ali exp/tri4a || exit 1;

  # align tri4a
  steps/align_fmllr.sh  --cmd "$train_cmd" --nj $num_jobs \
    data/train data/lang exp/tri4a exp/tri4a_ali

  # decode tri4a
  (
  utils/mkgraph.sh data/lang_test exp/tri4a exp/tri4a/graph
  steps/decode_fmllr.sh --cmd "$decode_cmd" --nj $num_jobs --config conf/decode.config \
    exp/tri4a/graph data/test exp/tri4a/decode_test
  )&
fi

# tri5
if [ $stage -le 5 ]; then
  echo "$0: train tri5 model"
  # Building a larger SAT system.
  steps/train_sat.sh --cmd "$train_cmd" \
    3500 100000 data/train data/lang exp/tri4a_ali exp/tri5a || exit 1;

  # align tri5a
  steps/align_fmllr.sh --cmd "$train_cmd" --nj $num_jobs \
    data/train data/lang exp/tri5a exp/tri5a_ali || exit 1;

  # decode tri5
  (
  utils/mkgraph.sh data/lang_test exp/tri5a exp/tri5a/graph || exit 1;
  steps/decode_fmllr.sh --cmd "$decode_cmd" --nj $num_jobs --config conf/decode.config \
     exp/tri5a/graph data/test exp/tri5a/decode_test || exit 1;
  )&
fi

# nnet3 tdnn models
# commented out by default, since the chain model is usually faster and better
#if [ $stage -le 6 ]; then
  # echo "$0: train nnet3 model"
  # local/nnet3/run_tdnn.sh
#fi

# chain model
if [ $stage -le 7 ]; then
   # The iVector-extraction and feature-dumping parts coulb be skipped by setting "--train_stage 7"
   echo "$0: train chain model"
   local/chain/run_tdnn.sh
fi

# getting results (see RESULTS file)
if [ $stage -le 8 ]; then
  echo "$0: extract the results" |& tee -a RETRAIN_RESULTS
  for test_set in test ; do
  echo "WER: $test_set" |& tee -a RETRAIN_RESULTS
  for x in exp/*/decode_${test_set}*; do [ -d $x ] && grep WER $x/wer_* | utils/best_wer.sh; done |& tee -a RETRAIN_RESULTS
  for x in exp/*/*/decode_${test_set}*; do [ -d $x ] && grep WER $x/wer_* | utils/best_wer.sh; done |& tee -a RETRAIN_RESULTS
  echo |& tee -a RETRAIN_RESULTS

  echo "CER: $test_set" |& tee -a RETRAIN_RESULTS
  for x in exp/*/decode_${test_set}*; do [ -d $x ] && grep WER $x/cer_* | utils/best_wer.sh; done |& tee -a RETRAIN_RESULTS
  for x in exp/*/*/decode_${test_set}*; do [ -d $x ] && grep WER $x/cer_* | utils/best_wer.sh; done |& tee -a RETRAIN_RESULTS
  echo |& tee -a RETRAIN_RESULTS
  done
fi

# finish
echo "$0: all done"

exit 0;

如果每一个步骤都详细说明的话要花好几天的时间，所以明天会着重在发音词典处理、资料处理、训练神经网路模型这三个部分的细节说明。

参考资料: