COMPARISON OF OPTIMIZATION ALGORITHMS OF CONNECTIONIST TEMPORAL CLASSIFIER FOR SPEECH RECOGNITION SYSTEM

This paper evaluates and compares the performances of three well-known optimization algorithms (Adagrad, Adam, Momentum) for faster training the neural network of CTC algorithm for speech recognition. For CTC algorithms recurrent neural network has been used, specifically LongShort-Term memory. LSTM is effective and often used model. Data has been downloaded from VCTK corpus of Edinburgh University. The results of optimization algorithms have been evaluated by the Label error rate and CTC loss.


Introduction
There have been many techniques for recognizing speech and variety of tasks like voice pattern recognition, which helps to identify human by his voice [12]. Considering the fact that speech data is complicated in terms of segmentation that is [1], it is difficult to build a model with a simple structure. The state-of-theart technique for ASR (Automatic speech recognition) is always been HMM model [7], which involves other pre-trained models like acoustic model, language model etc. However, recent researches have shown that by using recurrent neural networks [9], we can build such architecture of neural network, which will require only speech data (.wav) and transcription (.txt) to train the model completely, whereas traditional models (HMM) [7] would require data for training language model and acoustic model. This advanced algorithm called Connectionist-Temporal-Classifier [8], the heart of which is RNN. One of the most common and crucial steps in neural network is training. It is important that the model will train fast and at the same time does not overfit or underfit, especially with speech data. Labelling an unsegmented data is very common and often difficult problem in the sequence-tosequence models. Straightforward way to solve this problem is to label each segment of a sequence (for example wave file) manually. However, considering that there are so many words in speech, not counting the sentences, which brings a certain transformations time-consuming, boring and hard to do. To avoid this kind of issues traditional ASR system uses Language model like in [4], which predicts the probability of last word given the sentence and Acoustic model using a progresses like in [3], which gives the phoneme representation of the given speech (Fig. 1).

Fig. 1. Traditional ASR system
Labelling an unsegmented data is very common and often difficult problem in the sequence-to-sequence models. Straightforward way to solve this problem is to label each segment of a sequence (for example wave file) manually. However, considering that there are so many words in speech, not counting the sentences, which brings a certain transformations timeconsuming, boring and hard to do. To avoid this kind of issues traditional ASR system uses Language model like in [4], which predicts the probability of last word given the sentence and Acoustic model using a progresses like in [3], which gives the phoneme representation of the given speech (Fig. 1).
Connectionist temporal classifier [8] require only a speech data (raw audio) and transcription (txt file) in order to train only one model without involving the Language model. Instead of Language model, it uses dynamic programming method, which called Beam search in [13]. For training the model, any neural network structure uses an optimizer that helps to achieve the good accuracy fast and with no issues (over fitting, under fitting).
This paper organized as follows. Section 2 contains the information about CTC algorithm, Beam search and optimization algorithms, which will be considered in the experiment. Section 3 contains the experiment itself, which is about building a neural network, used optimization algorithms and dataset. Section 4 illustrates the outcomes of the experiment that shows a result of optimization algorithms comparing with each other (Adagrad, Adam, and Momentum). Section 5 concludes the whole experiment by choosing the best optimizer for CTC algorithm.

Encoder and decoder
The RNN encoder-decoder is a neural network model that directly computes the conditional probability of the output sequence given the input sequence without assuming a fixed alignment, i.e. P(y 1 , . . ., y O |x 1 , . . ., x T ) where the lengths of the output and the input, O and T respectively, may be different. For speech recognition, the input is usually a sequence of acoustic feature vectors, while the output is usually a sequence of class indices corresponding to units such as phonemes, letters, HMM states, or words. The idea of the encoder-decoder approach is that for each output y o , the encoder maps the input sequence into a fixed-length hidden representation co, which is referred as context vector. From the previous output symbols and the context vector, the decoder computes. Since the probability 1 1 ,..., ,..., conditioned on the previous outputs as well as the context vector, an RNN can be used to compute this probability which implicitly remembers the history using a recurrent layer. Let y o be a vector representation of the output symbol yo, where y o is a one-hot vector indicating one of the words in the vocabulary followed by a neural projection layer for dimension reduction. The posterior probability of y o is computed as ( | ( ( where so denotes the output of a recurrent hidden layer f(·) with inputs y o−1 , s o−1 , and c o . g(·) is a softmax function with inputs yo−1, so and co. We condition both f(·) and g(·) on the context vector to encourage the decoder to be heavily reliant on the context from the encoder. The previous output y o−1 is also fed to the softmax function g(·) to capture the bigram dependency between consecutive words [3]. We have also investigated a simpler output function without the dependence on the previous output Encoder As discussed above, the computation of the conditional probability relies on the availability of the context vector for each output . The context vector is obtained from the encoder which reads the input sequence and generates a continuous space representation. The context vector co is obtained by the weighted average of all the hidden representations of a bidirectional RNN (BiRNN) [8]: ∑ where ∈ [0, 1] and ∑ ; ( ⃗⃗⃗ ⃗⃗⃗ ) and ⃗⃗⃗ , ⃗⃗⃗ denote the hidden representations of xt from the forward and backward RNNs respectively. The context vector is global, for instance, . This means the context vector does not depend on the index o, meaning that the whole input sequence is encoded into a fixed vector representation. This approach has produced state-of-the-art results in machine translation when the dimension of the vector is relatively large [14]. When the model size is relatively small, however, the use of a dynamic context vector has been found to be superior, especially for long input sequences.
The weight is computed by a learned alignment model for each co, which is implemented as a neural network such that ( ∑ ( ( where a(·) is a feedforward neural network that computes the relevance of each hidden representation ht with respect to the previous hidden state of RNN decoder so−1. The alignment model is a single-hidden-layer neural network: ( ( where W and U are weight matrices, and v is a vector so that the output of a(·) is a scalar. More hidden layers can be used in the alignment model.
In the case of using a fixed context vector using an RNN to map the whole input sequence into the context vector is necessary because this vector must represent all the relevant information in the input sequence.

Connectionist temporal classifier
The CTC algorithm considers the order of the output labels of RNNs with ignoring the alignments by introducing a blank label, b. For the set of target labels, L, and its extended set with the additional CTC blank label, L′ = L ∪ {b}, the path, π, is defined as a sequence over L′, that is, π∈ L ′T, where T is the length of the input sequence, x. Then, the output sequence, z ∈ L ≤T, is represented by z = F(π) with the sequence to sequence mapping function F. F maps any path π with the length T into the shorter sequence of the label, z, by first merging the consecutive same labels into one and then removing the blank labels. Therefore, any sequence of the raw RNN outputs with the length T can be decoded into the shorter labelling sequence, z, with ignoring timings. This enables the RNNs to learn the sequence mapping, z = G(x), where x is the input sequence and z is the corresponding target labelling for all (x, z) in the training set, S. More specifically, the gradient of the loss function L(x, z) = − ln p(z/x) is computed and fed to the RNN through the softmax layer, of which the size is |L′|.
The CTC algorithm employs the forward-backward algorithm for computing the gradient of the loss function, L(x, z). Let z′ be the sequence over L′ with the length of 2|z|+1 where z′ u = b for odd u and z′ u = z u/2 for even u. Then, the forward variable, α, and the backward variable, β, are initialized by   Figure 2 shows an illustration of the evolution of beams: we start with the empty beam, then add all possible characters (we only have "a" and "b" in this example) to it in the first iteration and only keep the best scoring ones. The beam width controls the number of surviving beams. This is repeated until the complete NN output it processed.

Optimization algorithms
Gradient descent [14] is one of the most popular algorithms to perform optimization and by far the most common way to optimize neural networks. At the same time, every state-of-the-art Deep Learning library contains implementations of various algorithms to optimize gradient descent. These algorithms, however, often used as black-box optimizers, as practical explanations of their strengths and weaknesses are hard to come by.
Gradient descent is a way to minimize an objective function ( parameterized by a model's parameters by updating the parameters in the opposite direction of the gradient of the objective function Stochastic Gradient Descent SGD in [10] updates model parameters in the negative direction of the gradient (g) by taking a subset or a mini-batch of data of size (m):

Adagrad
This method simply allows the learning Rate -to adapt based on the parameters. Therefore, it makes big updates for infrequent parameters and small updates for frequent parameters. For this reason, it is well suited for dealing with sparse data.
Adagrad uses a different learning rate in [5] for every parameter (i) ( at every time step t , we first show Adagrad's per-parameter update, which we then vectorise. Briefly, we set g(t, i) ( to be the gradient of the loss function w.r.t. to the parameter (i) ( ( at time step t .

Momentum
SGD has trouble navigating ravines, i.e. areas where the surface curves much more steeply in one dimension than in another, which are common around local optima. In these scenarios, SGD oscillates across the slopes of the ravine while only making hesitant progress along the bottom towards the local optimum.
Momentum in [2] is a method that helps accelerate SGD in the relevant direction and dampens oscillations. It does this by adding a fraction γ of the update vector of the past time step to the current update vector. ( The momentum term γ is usually set to 0.9 or a similar value. Essentially, when using momentum, we push a ball down a hill. The ball accumulates momentum as it rolls downhill, becoming faster and faster on the way (until it reaches its terminal velocity, if there is air resistance, i.e. γ < 1). The same thing happens to our parameter updates: The momentum term increases for dimensions whose gradients point in the same directions and reduces updates for dimensions whose gradients change directions. As a result, we gain faster convergence and reduced oscillation.
Adam Adam stands for Adaptive Moment Estimation. Adaptive Moment Estimation (Adam) is another method that computes adaptive learning rates for each parameter [15]. In addition to storing an exponentially decaying average of past squared gradients, Adam also keeps an exponentially decaying average of past gradients M (t) ( , similar to momentum: ( ( and are estimates of the first moment (the mean) and the second moment (the uncentered variance) of the gradients respectively, hence the name of the method. As and are initialized as vectors of 0's, the authors of Adam observe that they are biased towards zero, especially during the initial time steps, and especially when the decay rates are small (i.e. β 1 and β 2 are close to 1).
̂ , ̂ They then use these to update the parameters:

Results
After training the model three times with different optimization algorithms, we see the following outcomes (Table 1):

Fig. 3. CTC loss and LER of Adagrad
As we can see in Figure 3 CTC loss at the beginning of each iteration starts to decrease as it should, but after that, the loss value starts to hesitate drastically between 500 and 50 (approximately). At the same time, LER right from the beginning start to hesitate between 1 and 0.9 (approximately), which does not allow the model to learn.

Fig. 4. CTC loss and LER of Momentum
The visualization we see in Figure 4 shows that Momentum works a lot better than Adagrad. LER and CTC loss are continuously decreasing for training and validation sets. Because of learning rate is equal to 0.005 decreasing process slows down little bit. Other than that, learning process is doing well.

Fig. 5. CTC loss and LER of Adam
As shown in Figure 5 CTC loss decreases at the beginning of the gradient steps and once again as in Adagrad optimizer starts to hesitate between two numbers with a big difference. LER on the other hand decreases to 1 after few iterations and after that does not change for a long gradient steps. Right after it reaches about 750 iteration LER starts to hesitate between 1 and 0.5 (approximately), which is not a well performance.

Conclusion
This paper shows a clear benefit of Momentum optimizer over Adam and Adagrad for CTC algorithm for speech recognition. The experiment showed that model with Momentum optimizer learns faster decreasing the CTC loss and LER after each gradient step, whereas Adagrad and Adam optimizers performed very poorly, showing a hesitation of errors from big number to small. Other than that, this paper shows the advanced algorithm called Connectionist Temporal Class for speech recognition in action. It also describes the clear benefits of this algorithm over the traditional method, which is HMM based model, which is the simplicity and effectiveness.