(PDF) THE PYTORCH-KALDI SPEECH RECOGNITI

THE PYTORCH-KALDI SPEECH RECOGNITION TOOLKIT Mirco Ravanelli1 , Titouan Parcollet2 , Yoshua Bengio1∗ 1 Mila, Université de Montréal , ∗ CIFAR Fellow 2 LIA, Université d’Avignon ABSTRACT libraries for efficiently implementing state-of-the-art speech recogni- tion systems. Moreover, the toolkit includes a large set of recipes that The availability of open-source software is playing a remarkable role cover all the most popular speech corpora. In parallel to the devel- in the popularization of speech recognition and deep learning. Kaldi, opment of this ASR-specific software, several general-purpose deep for instance, is nowadays an established framework used to develop learning frameworks, such as Theano [15], TensorFlow [16], and state-of-the-art speech recognizers. PyTorch is used to build neural CNTK [17], have gained popularity in the machine learning com- networks with the Python language and has recently spawn tremen- munity. These toolkits offer a huge flexibility in the neural network dous interest within the machine learning community thanks to its design and can be used for a variety of deep learning applications. simplicity and flexibility. PyTorch [18] is an emerging python package that implements ef- The PyTorch-Kaldi project aims to bridge the gap between these ficient GPU-based tensor computations and facilitates the design of popular toolkits, trying to inherit the efficiency of Kaldi and the neural architectures, thanks to proper routines for automatic gradi- flexibility of PyTorch. PyTorch-Kaldi is not only a simple inter- ent computation. An interesting feature of PyTorch lies in its modern face between these software, but it embeds several useful features and flexible design, that naturally supports dynamic neural networks. for developing modern speech recognizers. For instance, the code is In fact, the computational graph is dynamically constructed on-the- specifically designed to naturally plug-in user-defined acoustic mod- fly at running time rather than being statically compiled. els. As an alternative, users can exploit several pre-implemented The PyTorch-Kaldi project aims to bridge the gap between Kaldi neural networks that can be customized using intuitive configuration and PyTorch1 . Our toolkit implements acoustic models in PyTorch, files. PyTorch-Kaldi supports multiple feature and label streams as while feature extraction, label/alignment computation, and decod- well as combinations of neural networks, enabling the use of com- ing are performed with the Kaldi toolkit, making it a perfect fit to plex neural architectures. The toolkit is publicly-released along with develop state-of-the-art DNN-HMM speech recognizers. PyTorch- a rich documentation and is designed to properly work locally or on Kaldi natively supports several DNNs, CNNs, and RNNs models. HPC clusters. Combinations between deep learning models, acoustic features, and Experiments, that are conducted on several datasets and tasks, labels are also supported, enabling the use of complex neural archi- show that PyTorch-Kaldi can effectively be used to develop modern tectures. For instance, users can employ a cascade between CNNs, state-of-the-art speech recognizers. LSTMs, and DNNs, or run in parallel several models that share some Index Terms: speech recognition, deep learning, Kaldi, PyTorch. hidden layers. Users can also explore different acoustic features, context duration, neuron activations (e.g., ReLU, leaky ReLU), nor- 1. INTRODUCTION malizations (e.g., batch [19] and layer normalization [20]), cost func- tions, regularization strategies (e.g, L2, dropout [21]), optimization Over the last years, we witnessed a progressive improvement and algorithms (e.g., SGD, Adam [22], RMSPROP), and many other maturation of Automatic Speech Recognition (ASR) technologies hyper-parameters of an ASR system through simple edits of a con- [1, 2], that have reached unprecedented performance levels and are figuration file. nowadays used by millions of users worldwide. The toolkit is designed to make the integration of user-defined A key role in this technological breakthrough is being played by acoustic models as simple as possible. In practice, users can em- deep learning [3], that contributed to overcoming previous speech bed their deep learning model and conduct ASR experiments even recognizers based on Gaussian Mixture Models (GMMs). Beyond without being fully familiar with the complex speech recognition deep learning, other factors have played a role in the progress of pipeline. The toolkit can perform computations on both local ma- the field. A number of speech-related projects such as AMI [4], chines and HPC cluster, and supports multi-gpu training, recovery DICIT [5], DIRHA [6] and speech recognition challenges such strategy, and automatic data chunking. as CHiME [7], Babel, and Aspire, have remarkably fostered the The experiments, conducted on several datasets and tasks, have progress in ASR. The public distribution of large datasets such shown that PyTorch-Kaldi makes it possible to easily develop com- as Librispeech [8] has also played an important role to establish petitive state-of-the-art speech recognition systems. common evaluation frameworks and tasks. Among the others factors, the development of open-source soft- 2. THE PYTORCH-KALDI PROJECT ware such as HTK [9], Julius [10], CMU-Sphinx, RWTH-ASR [11], LIA-ASR [12] and, more recently, the Kaldi toolkit [13] have further An overview of the architecture adopted in PyTorch-Kaldi is re- helped popularize ASR, making both research and development of ported in Fig. 1. The main script run exp.py is written in python novel ASR applications significantly easier. Kaldi currently represents the most popular ASR toolkit. It re- 1 The code is available on GitHub (github.com/mravanelli/ lies on finite-state transducers (FSTs) [14] and provides a set of C++ PyTorch-kaldi/). Text shuffling, as well as mean and variance normalization. As out- lined before, PyTorch-Kaldi can manage multiple feature streams. Kaldi For instance, users can define models that exploit combinations of Decoding MFCCs, FBANKs, PLP, and fMLLR [24] coefficients. Posterior Processing 2.3. Labels DNN The main labels used for training the acoustic model derive from a PyTorch forced alignment procedure between the speech features and the se- quence of context-dependent phone states computed by Kaldi with a phonetic decision tree. To enable multi-task learning, PyTorch- kaldi supports multiple labels. For instance, it is possible to jointly Labels Features load both context-dependent and context-independent targets and Label Processing Feature Processing use the latter ones to perform monophone regularization [25, 26]. It is also possible to employ models based on an ecosystem of neu- Kaldi Kaldi Label Reading Feature Reading ral networks performing different tasks, as done in the context of joint training between speech enhancement and speech recognition Label Computation Feature Computation [27–29] or in the context of the recently-proposed cooperative net- works of deep neural networks [30]. Speech Waveform 2.4. Chunk and Mini-batch Composition PyTorch-Kaldi automatically splits the full dataset into a number of chunks, which are composed of labels and features randomly sampled from the full corpus. Each chunk is then stored into the Fig. 1: An overview of the PyTorch-Kaldi architecture. GPU or CPU memory and processed by the neural training algorithm run nn.py. The toolkit dynamically composes different chunks at each epoch. A set of mini-batches are then derived from them. Mini- and manages all the phases involved in an ASR system, including batches are composed of few training examples that are used for gra- feature and label extraction, training, validation, decoding, and scor- dient computation and parameter optimization. ing. The toolkit is detailed in the following sub-sections. The way mini-batches are gathered strongly depends on the ty- pology of the neural network. For feed-forward models, the mini- 2.1. Configuration file batches are composed of randomly shuffled features and labels sam- pled from the chunk. For recurrent networks, the minibatches must The main script takes as input a configuration file in INI format2 , be composed of full sentences. Different sentences, however, are that is composed of several sections. The section [Exp] specifies likely to have different duration, making zero-padding necessary to some high-level information such as the folder used for the experi- form mini-batches of the same size. PyTorch-Kaldi sorts the speech ment, the number of training epochs, the random seed. It also allows sequences in ascending order according to their lengths (i.e., short users to specify whether the experiments have to be conducted on a sentences are processed first). This approach minimizes the need of CPU, GPU, or on multiple GPUs. The configuration file continues zero-paddings and turned out to be helpful to avoid possible biases with the [dataset∗] sections, that specify information on features on batch normalization statistics. Moreover, it has been shown use- and labels, including the paths where they are stored, the charac- ful to slightly boost the performance and to improve the numerical teristics of the context window [23], and the number of chunks in stability of gradients. which the speech dataset must be split. The neural models are de- scribed in the [architecture∗] sections, while the [model] section 2.5. DNN acoustic modeling defines how these neural networks are combined. The latter section exploits a simple meta-language that is automatically interpreted by Each minibatch is processed by a neural network implemented with the run exp.py script. Finally, the configuration file defines the de- PyTorch, that takes as input the features and as outputs a set of poste- coding parameters in the [decoding] section. rior probabilities over the context-dependent phone states. The code is designed to easily plug-in customized models. As reported in the pseudo-code reported in Fig. 2, the new model can be simply de- 2.2. Features fined by adding a new class into the neural nets.py. The class must The feature extraction is performed with Kaldi, that natively pro- be composed of an initialization method, that specifies the parame- vides c++ libraries (e.g., compute-mfcc-feats, compute-fbank-feats, ters with their initialization, and a forward method that defines the compute-plp-feats) to efficiently extract the most popular speech computations to perform. recognition features. The computed coefficients are stored in bi- As an alternative, a number of pre-defined state-of-the-art neural nary archives (with extension .ark) and are later imported into the models are natively implemented within the toolkit. The current ver- python environment using the kaldi-io utilities inherited from the sion supports standard MLPs, CNNs, RNNs, LSTM, and GRU mod- kaldi-io-for-python project3 . The features are then processed by the els. Moreover, it supports some advanced recurrent architectures, function load-chunk, that performs context window composition, such as the recently-proposed Light GRU [31] and twin-regularized RNNs [32]. The SincNet model [33, 34] is also implemented to per- 2 The configuration file is fully described in the project documentation. form speech recognition from raw waveform directly. The hyper- 3 github.com/vesis84/kaldi-io-for-python parameters of the model (such as learning rate, number of neurons, Fig. 2: Adding a user model into PyTorch-Kaldi Table 1: PER(%) obtained for the test set of TIMIT with various neural architectures. c l a s s my NN ( nn . Module ) : def init ( self , options ): MFCC FBANK fMLLR s u p e r ( my NN , s e l f ) . init () MLP 18.2 18.7 16.7 # D e f i n i t i o n o f Model P a r a m e t e r s RNN 17.7 17.2 15.9 # Parameter I n i t i a l i z a t i o n LSTM 15.1 14.3 14.5 GRU 16.0 15.2 14.9 def forward ( s e l f , minibatch ) : Li-GRU 15.3 14.6 14.2 # D e f i n i t i o n o f Model C o m p u t a t i o n s return [ output prob ] 3.2. DNN setting The experiments consider different acoustic features, i.e., 39 MFCCs number of layers, dropout factor, etc.) can be tuned using a utility (13 static+∆+∆∆), 40 log-mel filter-bank features (FBANKS), as that implements the random search algorithm [35]. well as 40 fMLLR features [24] (extracted as reported in the s5 recipe of Kaldi), that were computed using windows of 25 ms with an overlap of 10 ms. 2.6. Decoding and Scoring The feed-forward models were initialized according to the Glo- rot’s scheme [38], while recurrent weights were initialized with or- The acoustic posterior probabilities generated by the neural network thogonal matrices [39]. Recurrent dropout was used as a regular- are normalized by their prior before feeding a HMM-based decoder. ization technique [40]. Batch normalization was adopted for feed- The decoder merges the acoustic scores with the language probabili- forward connections only, as proposed in [41, 42]. The optimiza- ties derived by an n-gram language model and tries to retrieve the se- tion was done using the RMSprop algorithm running for 24 epochs. quence of words uttered in the speech signal using a beam-search al- The performance on the development set was monitored after each gorithm. The final Word-Error-Rate (WER) score is computed with epoch and the learning rate was halved when the relative perfor- the NIST SCTK scoring toolkit. mance improvement went below 0.1%. The main hyperparameters of the model (i.e., learning rate, number of hidden layers, hidden 3. EXPERIMENTAL SETUP neurons per layer, dropout factor, as well as the twin regularization term λ) were tuned on the development datasets. In the following sub-sections, the corpora, and the DNN setting adopted for the experimental activity are described. 4. BASELINES 3.1. Corpora and Tasks In this section, we discuss the baselines obtained with TIMIT, DIRHA, CHiME, and LibriSpeech datasets. As a showcase to illus- The first set of experiments was performed with the TIMIT corpus, trate the main functionalities of the PyTorch-Kaldi toolkit, we first considering the standard phoneme recognition task (aligned with the report the experimental validation conducted on TIMIT. Kaldi s5 recipe [13]). Table 1 shows the performance obtained with several feed- To validate our model in a more challenging scenario, exper- forward and recurrent models using different features. To ensure iments were also conducted in distant-talking conditions with the a more accurate comparison between the architectures, five exper- DIRHA-English dataset4 [36,37]. Training was based on the original iments varying the initialization seeds were conducted for each WSJ-5k corpus (consisting of 7, 138 sentences uttered by 83 speak- model and feature. The table thus reports the average phone error ers) that was contaminated with a set of impulse responses measured rates (PER)5 . Results show that, as expected, fMLLR features out- in a domestic environment [37]. The test phase was carried out with perform MFCCs and FBANKs coefficients, thanks to the speaker the real part of the dataset, consisting of 409 WSJ sentences uttered adaptation process. Recurrent models significantly outperform in the aforementioned environment by six native American speakers. the standard MLP one, especially when using LSTM, GRU, and Additional experiments were conducted with the CHiME 4 Li-GRU architecture, that effectively address gradient vanishing dataset [7], that is based on speech data recorded in four noisy envi- through multiplicative gates. The best result (PER=14.2%) is ob- ronments (on a bus, cafe, pedestrian area, and street junction). The tained with the Li-GRU model [31], that is based on a single gate training set is composed of 43, 690 noisy WSJ sentences recorded and thus saves 33% of the computations over a standard GRU. by five microphones (arranged on a tablet) and uttered by a total of Table 2 details the impact of some popular techniques imple- 87 speakers. The test set ET-real considered in this work is based mented in PyTorch-Kaldi for improving the ASR performance. The on 1, 320 real sentences uttered by four speakers, while the subset first row (Baseline) reports the performance achieved with a basic re- DT-real has been used for hyperparameter tuning. The CHiME current model, where powerful techniques such as dropout and batch experiments were based on the single channel setting [7]. normalization are not adopted. The second row highlights the per- Finally, experiments were performed with the LibriSpeech [8] formance gain that is achieved when progressively increasing the se- dataset. We used the training subset composed of 100 hours and quence length during training. In this case, we started the training by the dev-clean set for the hyperparameter search. Test results are re- truncating the speech sentence at 100 steps (i.e, approximately 1 sec- ported on the test-clean part using the fglarge decoding graph inher- ond of speech) and we progressively double the maximum sequence ited from the Kaldi s5 recipe. duration at every epoch. This simple strategy generally improves the 4 This dataset is distributed by the Linguistic Data Consortium (LDC). 5 Standard deviations range between 0.15 and 0.2 for all the experiments. Table 2: PER(%) obtained on TIMIT when progressively applying Table 5: WER(%) obtained for the DIRHA, CHiME, and Lib- some techniques implemented within PyTorch-Kaldi. riSpeech (100h) datasets with various neural architectures. RNN LSTM GRU Li-GRU DIRHA CHiME LibriSpeech Baseline 16.5 16.0 16.6 16.3 MLP 26.1 18.7 6.5 + Incr. Seq. length 16.6 15.3 16.1 15.4 LSTM 24.8 15.5 6.4 + Recurrent Dropout 16.4 15.1 15.4 14.5 GRU 24.8 15.2 6.3 + Batch Normalization 16.0 14.8 15.3 14.4 Li-GRU 23.9 14.6 6.2 + Monophone Reg. 15.9 14.5 14.9 14.2 Table 3: PER(%) obtained by combining multiple neural networks challenging task, that is characterized by the presence of consider- and acoustic features. able levels of noise and reverberation. The WER=23.9% obtained on this dataset represents the best performance published so-far on Architecture Features PER (%) the single-microphone task. Finally, the performance obtained with Li-GRU fMLLR 14.2 Librispeech outperforms the corresponding p-norm Kaldi baseline MLP+Li-GRU+MLP MFCC+FBANK+fMLLR 13.8 (W ER = 6.5%) on the considered 100 hours subset. Table 4: PER(%) obtained with standard convolutional and with the 5. CONCLUSIONS SincNet architectures. This paper described the PyTorch-Kaldi project, a new initiative that Model Features PER (%) aims to bridge the gap between Kaldi and PyTorch. The toolkit CNN FBANK 18.3 is designed to make the development of an ASR system simpler CNN Raw waveform 18.3 and more flexible, allowing users to easily plug-in their customized SincNet Raw waveform 18.1 acoustic models. PyTorch-Kaldi also supports combinations of neu- ral architectures, features, and labels, allowing users to possibly em- ploy complex ASR pipelines. The experiments have confirmed that system performance since it encourages the model to first focus on PyTorch-Kaldi can achieve state-of-the-art results in some popular short-term dependencies and learn longer-term ones only at a later speech recognition tasks and datasets. stage. The third row shows the improvement achieved when adding The current version of the PyTorch-Kaldi is already publicly- recurrent dropout. Similarly to [40,42], we applied the same dropout available along with a detailed documentation. The project is still in mask for all the time steps to avoid gradient vanishing problems. The its initial phase and we invite all potential contributors to participate fourth line, instead, shows the benefits derived from batch normal- in it. We hope to build a community of developers larger enough to ization [19]. Finally, the last line shows the performance achieved progressively maintain, improve, and expand the functionalities of when also applying monophone regularization [26]. In this case, we our current toolkit. In the future, we plan to increase the number of employ a multi-task learning strategy by means of two softmax clas- pre-implemented models coded in our framework, and we would like sifiers: the first one estimates context-dependent states, while the to extend the current project by integrating neural language model second one predicts monophone targets. As observed in [26], our training, as well proper support for end-to-end ASR systems. results confirm that this technique can successfully be used as an effective regularizer. 6. ACKNOWLEDGMENT The experiments discussed so far are based on single neural models. In Table 3 we compare our best Li-GRU system with a We would like to thank Maurizio Omologo for his helpful comments. more complex architecture based on a combination of feed-forward This research was enabled in part by support provided by Calcul and recurrent models fed by a concatenation of features. To the best Québec and Compute Canada. of our knowledge, the PER=13.8% achieved by the latter system yields the best-published performance on the TIMIT test-set. Previous achievements were based on standard acoustic features 7. REFERENCES computed with Kaldi. However, within PyTorch-Kaldi users can em- ploy their own features. Table 4 shows the results achieved with con- [1] D. Yu and L. Deng, Automatic Speech Recognition – A Deep volutional models fed by standard FBANKs coefficients or by raw Learning Approach, Springer, 2015. acoustic waveform directly. The standard CNN based on raw sam- [2] M. Ravanelli, Deep learning for Distant Speech Recognition, ples performs similarly to the one fed by FBANK features. A slight PhD Thesis, Unitn, 2017. performance improvement is observed with SincNet [33], whose ef- fectiveness to process raw waveforms for speech recognition is here [3] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning, highlighted for the first time. MIT Press, 2016. We now extend our experimental validation to other datasets. [4] S. Renals, T. Hain, and H. Bourlard, “Interpretation of Mul- With this regard, Table 5 shows the performance achieved on tiparty Meetings the AMI and Amida Projects,” in Proc. of DIRHA, CHiME, and Librispeech (100h) datasets. The Table HSCMA, 2008, pp. 115–118. consistently show better performance with the Li-GRU model, confirming our previous achievements on TIMIT. The results on [5] M. Omologo, “A prototype of distant-talking interface for con- DIRHA and CHiME show the effectiveness of the proposed toolkit trol of interactive TV,” in Proceedings of Asilomar Conference also in noisy condition. In particular, DIRHA represents a very on Signals, Systems and Computers, 2010. [6] L. Cristoforetti, M. Ravanelli, M. Omologo, A. Sosi, A. Abad, [25] M. Ravanelli and M. Omologo, “Contaminated speech training M. Hagmueller, and P. Maragos, “The DIRHA simulated cor- methods for robust DNN-HMM distant speech recognition,” in pus,” in Proc. of LREC, 2014, pp. 2629–2634. Proc. of Interspeech, 2015, pp. 756–760. [7] J. Barker, R. Marxer, E. Vincent, and S. Watanabe, “The [26] P. Bell, P. Swietojanski, and S. Renals, “Multitask learning third CHiME Speech Separation and Recognition Challenge: of context-dependent targets in deep neural network acoustic Dataset, task and baselines,” in Proc. of ASRU, 2015, pp. 504– models,” IEEE/ACM Trans. Audio, Speech & Language Pro- 511. cessing, vol. 25, no. 2, pp. 238–247, 2017. [8] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib- [27] M. Ravanelli, P. Brakel, M. Omologo, and Y. Bengio, “Batch- rispeech: An ASR corpus based on public domain audio normalized joint training for dnn-based distant speech recog- books,” in Proc. of ICASSP, 2015, pp. 5206–5210. nition,” in Proc. of SLT, 2016, pp. 28–34. [9] S. Young et al., HTK – Hidden Markov Model Toolkit, 2006. [28] A. Narayanan and D. Wang, “Joint noise adaptive training for robust automatic speech recognition,” in Proc. of ICASSP, [10] A. Lee and T. Kawahara., “Recent development of open-source 2014, pp. 4380–4384. speech recognition engine julius,” in Proc. of APSIPA-ASC, 2008. [29] X. Xiao et al., “Deep beamforming networks for multi-channel speech recognition,” in Proc. of ICASSP, 2016. [11] D. Rybach, S. Hahn, P. Lehnen, D. Nolden, M. Sundermeyer, Z. Tüske, S. Wiesler, R. Schlüter, and H. Ney, “RASR - The [30] M. Ravanelli, P. Brakel, M. Omologo, and Y. Bengio, “A net- RWTH Aachen University Open Source Speech Recognition work of deep neural networks for distant speech recognition,” Toolkit,” in Proc. of ASRU, 2011. in Proc. of ICASSP, 2017, pp. 4880–4884. [31] M. Ravanelli, P. Brakel, M. Omologo, and Y. Bengio, “Light [12] G. Linarès, P. Nocera, D. Massonié, and D. Matrouf, “The gated recurrent units for speech recognition,” IEEE Transac- lia speech recognition system: From 10xrt to 1xrt,” in Text, tions on Emerging Topics in Computational Intelligence, vol. Speech and Dialogue, Václav Matoušek and Pavel Mautner, 2, no. 2, pp. 92–102, April 2018. Eds. 2007, pp. 302–308, Springer Berlin Heidelberg. [32] M. Ravanelli, D. Serdyuk, and Y. Bengio, “Twin regularization [13] D. Povey et al., “The Kaldi Speech Recognition Toolkit,” in for online speech recognition,” in Proc. of Interspeech, 2018. Proc. of ASRU, 2011. [33] M. Ravanelli and Y.Bengio, “Speaker Recognition from raw [14] M. Mohri, “Finite-state transducers in language and speech waveform with SincNet,” in Proc. of SLT, 2018. processing,” Computational Linguistics, vol. 23, no. 2, pp. 269–311, 1997. [34] M. Ravanelli and Y.Bengio, “Interpretable Convolutional Fil- ters with SincNet,” in Proc. of NIPS@IRASL, 2018. [15] Theano Development Team, “Theano: A Python framework for fast computation of mathematical expressions,” arXiv e- [35] J. Bergstra and Y. Bengio, “Random search for hyper- prints, vol. abs/1605.02688, May 2016. parameter optimization,” Journal of Machine Learning Re- search, vol. 13, pp. 281–305, 2012. [16] M. Abadi et al., “Tensorflow: A system for large-scale machine [36] M. Ravanelli, L. Cristoforetti, R. Gretter, M. Pellin, A. Sosi, learning,” in Proc. of USENIX-OSDI Symposium, 2016, pp. and M. Omologo, “The DIRHA-English corpus and re- 265–283. lated tasks for distant-speech recognition in domestic environ- [17] F. Seide and A. Agarwal, “CNTK: Microsoft’s Open-Source ments,” in Proc. of ASRU, 2015, pp. 275–282. Deep-Learning Toolkit,” in Proceedings of ACM SIGKDD, [37] M. Ravanelli, P. Svaizer, and M. Omologo, “Realistic multi- 2016, pp. 2135–2135. microphone data simulation for distant speech recognition,” in [18] A. Paszke et al., “Automatic differentiation in pytorch,” 2017. Proc. of Interspeech, 2016, pp. 2786–2790. [19] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating [38] X. Glorot and Y. Bengio, “Understanding the difficulty of train- deep network training by reducing internal covariate shift,” in ing deep feedforward neural networks,” in Proc. of AISTATS, Proc. of ICML, 2015, pp. 448–456. 2010, pp. 249–256. [20] L. J. Ba, R. K., and G. E. Hinton, “Layer normalization,” [39] Q.V. Le, N. Jaitly, and G.E. Hinton, “A simple way CoRR, vol. abs/1607.06450, 2016. to initialize recurrent networks of rectified linear units,” arXiv:1504.00941, 2015. [21] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neu- [40] T. Moon, H. Choi, H. Lee, and I. Song, “RNNDROP: A novel ral networks from overfitting,” Journal of Machine Learning dropout for RNNS in ASR,” in Proc. of ASRU, 2015. Research, vol. 15, pp. 1929–1958, 2014. [41] C. Laurent, G. Pereyra, P. Brakel, Y. Zhang, and Y. Bengio, [22] D.P. Kingma and J. Ba, “Adam: A method for stochastic opti- “Batch normalized recurrent neural networks,” in Proc. of mization,” in Proc. of ICLR, 2015. ICASSP, 2016, pp. 2657–2661. [23] M. Ravanelli and M. Omologo, “Automatic context window [42] M. Ravanelli, P. Brakel, M. Omologo, and Y. Bengio, “Im- composition for distant speech recognition,” Speech Commu- proving speech recognition by revising gated recurrent units,” nication, vol. 101, pp. 34 – 44, 2018. in Proc. of Interspeech, 2017. [24] M.J.F. Gales, “Maximum Likelihood Linear Transformations for HMM-Based Speech Recognition,” Computer Speech and Language, vol. 12, no. 4, pp. 75–98, 1998.

(PDF) THE PYTORCH-KALDI SPEECH RECOGNITION TOOLKIT