Creating the model

We will replicate the exact model as described in the original DeepSpeech paper. As explained earlier, the model consists of both recurrent and nonrecurrent layers. We will now look at the get_layers function in the code:

with tf.name_scope('Lyr1'):
    B1 = tf.get_variable(name='B1', shape=[n_h], 
    initializer=tf.random_normal_initializer(stddev=0.046875))
    H1 = tf.get_variable(name='H1', shape=[n_inp + 2*n_inp*n_ctx, n_h],
    initializer=tf.contrib.layers.xavier_initializer(uniform=False))
    logits1 = tf.add(tf.matmul(X_batch, H1), B1)
    relu1 = tf.nn.relu(logits1)
    clipped_relu1 = tf.minimum(relu1,20.0)
    Lyr1 = tf.nn.dropout(clipped_relu1, 0.5)

with tf.name_scope('Lyr2'):
    B2 = tf.get_variable(name='B2', shape=[n_h], 
    initializer=tf.random_normal_initializer(stddev=0.046875))
    H2 = tf.get_variable(name='H2', shape=[n_h,n_h],
    initializer=tf.random_normal_initializer(stddev=0.046875))
    logits2 = tf.add(tf.matmul(Lyr1, H2), B2)
    relu2 = tf.nn.relu(logits2)
    clipped_relu2 = tf.minimum(relu2,20.0)
    Lyr2 = tf.nn.dropout(clipped_relu2, 0.5)

with tf.name_scope('Lyr3'):
    B3 = tf.get_variable(name='B3', shape=[2*n_h], 
    initializer=tf.random_normal_initializer(stddev=0.046875))
    H3 = tf.get_variable(name='H3', shape=[n_h,2*n_h],
    initializer=tf.random_normal_initializer(stddev=0.046875))
    logits3 = tf.add(tf.matmul(Lyr2, H3), B3)
    relu3 = tf.nn.relu(logits3)
    clipped_relu3 = tf.minimum(relu3,20.0)
    Lyr3 = tf.nn.dropout(clipped_relu3, 0.5)

The first three hidden layers are nonrecurrent. The parameters H1, H2, H3, and B1, B2, B3 are the weights and biases of the layers, respectively. The outputs of the layers pass through a clipped ReLU function to avoid exploding activations. We also have dropout for all of the first three hidden layers. Note that the first hidden layer weights have shape [n_inp + 2*n_ip*n_ctx,n_h], which is the same as the MFCC input with context. In the following code, we have set the number of hidden units n_h as 1024, MFCC features n_inp as 26 and context n_ctx as 9. Afterwards, we will look at the recurrent layer:

with tf.name_scope('RNN_Lyr'):
    fw_c = tf.contrib.rnn.BasicLSTMCell(n_h, forget_bias=1.0, state_is_tuple=True, 
    reuse=tf.get_variable_scope().reuse)
    fw_c = tf.contrib.rnn.DropoutWrapper(fw_c, input_keep_prob=0.7,
            output_keep_prob=0.7,seed=123)
    bw_c = tf.contrib.rnn.BasicLSTMCell(n_h, forget_bias=1.0, state_is_tuple=True, 
    reuse=tf.get_variable_scope().reuse)
    bw_c = tf.contrib.rnn.DropoutWrapper(bw_c,input_keep_prob=0.7, 
            output_keep_prob=0.7, seed=123)
    Lyr3 = tf.reshape(Lyr3, [-1, X_batch_shape[0], 2*n_h])
    outs, out_states = tf.nn.bidirectional_dynamic_rnn(cell_fw=fw_c,
                cell_bw=bw_c,inputs=Lyr3,dtype=tf.float32,time_major=True,
                sequence_length=seq_len)
    outs = tf.concat(outs, 2)
    outs = tf.reshape(outs, [-1, 2 * n_h])

As described previously, the recurrent layer is a bidirectional LSTM with dropout. The concatenated output of the forward and backward LSTM are input to the next hidden layer. Now, we will look at the last two hidden layers at the output:

with tf.name_scope('Lyr4'):
    B4 = tf.get_variable(name='B4', shape=[n_h], 
    initializer=tf.random_normal_initializer(stddev=0.046875))
    H4 = tf.get_variable(name='H4', shape=[(2 * n_h), n_h],
    initializer=tf.random_normal_initializer(stddev=0.046875))
    logits4 = tf.add(tf.matmul(outs, H4), B4)
    relu4 = tf.nn.relu(logits4)
    clipped_relu4 = tf.minimum(relu4,20.0)
    Lyr4 = tf.nn.dropout(clipped_relu4, 0.5)
 
with tf.name_scope('Lyr5'):
    B5 = tf.get_variable(name='B5', shape=[n_h], 
    initializer=tf.random_normal_initializer(stddev=0.046875))
    H5 = tf.get_variable(name='H5', shape=[n_h, n_chars],
    initializer=tf.random_normal_initializer(stddev=0.046875))
    Lyr5 = tf.add(tf.matmul(Lyr4, H5), B5)
    Lyr5 = tf.reshape(Lyr5, [-1, X_batch_shape[0], n_chars])

Like the hidden layers in the input, the final two hidden layers have weights and biases H4, H5 and B4, B5, respectively. Layer four has the output ReLU activations clipped with dropout. Layer five output finally outputs the probabilities for n_chars (number of alphabets plus blank), one character at a time. Afterwards, we will look at the definition of the loss function and the optimizer:

def get_cost(tgts,logits,len_seq):
    loss_t = ops.ctc_ops.ctc_loss(tgts, logits, len_seq)
    loss_avg = tf.reduce_mean(loss_t)
    return loss_avg

def get_optimizer(logits,len_seq,loss_avg):
    adm_opt = tf.train.AdamOptimizer(learning_rate=plr,beta1=pb1,beta2=pb2,epsilon=peps)
    adm_opt = adm_opt.minimize(loss_avg)
    dec, prob_log = ops.ctc_ops.ctc_beam_search_decoder(logits, len_seq, merge_repeated=False)
    return adm_opt,dec

We will be using the Connectionist Temporal Classification (CTC) loss function, which is available in TensorFlow tensorflow.python.ctc_ops.ctc_loss. It takes the logits and the target variables as inputs and computes the loss. From this, the average loss is computed by the get_costs function. This average loss is minimized using the AdamOptimizer in the get_optimizer function. We will now look at the batch data feeder for training our model:

class Batch:
    def __init__(self):
        self.start_idx = 0
        self.batch_size = 10
        self.audio = []
        self.transcript = []
        get_wav_trans("../../speech_dset/timit/",self.audio,self.transcript) 

    def pad_seq(self,seqs):
        seq_lens = np.asarray([len(st) for st in seqs], dtype=np.int64)
        n_s = len(seqs)
        max_seq_len = np.max(seq_lens)
        s_shape = tuple()
        for s in seqs:
            if len(s) > 0:
                s_shape = np.asarray(s).shape[1:]
                break
            seqs_trc = (np.ones((n_s, max_seq_len) + s_shape) * 0.).astype(np.float32)
            for ix, s in enumerate(seqs):
                if len(s) == 0:
                    continue 
            trc = s[:max_seq_len]
            trc = np.asarray(trc, dtype=np.int64)
            if trc.shape[1:] != s_shape:
                raise ValueError("ERROR in truncation shape")
            seqs_trc[ix, :len(trc)] = trc
         return seqs_trc, seq_lens
 
    def get_sp_tuple(self,seqs):
        ixs = []
        vals = []
        for n, s in enumerate(seqs):
            ixs.extend(zip([n] * len(s), range(len(s))))
            vals.extend(s)
            ixs = np.asarray(ixs, dtype=np.int64)
            vals = np.asarray(vals, dtype=np.int32)
            shape = np.asarray([len(seqs), ixs.max(0)[1] + 1], dtype=np.int64)
        return ixs, vals, shape
 
    def get_next_batch(self):
        src = self.audio[self.start_idx:self.start_idx+self.batch_size]
        tgt = self.transcript[self.start_idx:self.start_idx+self.batch_size]
        self.start_idx += self.batch_size
        if(self.start_idx>len(self.audio)):
            self.start_idx=0
            src,src_len = self.pad_seq(src)
            sp_lbls = self.get_sp_tuple(tgt)
        return src, src_len, sp_lbls

We utilize the get_wav_trans to get all the audio (MFCC features) and the text transcript from the .wav and .txt files. The get_next_batch function returns the source (audio) and target (transcript) in the size of batch_size. The pad_seq function pads the MFCC sequence to a maximum length of the sequence within a specific batch. Similarly, the get_sp_tuple obtains a sparse representation of the target labels. Now, we will look at the training setup:

def get_model():
    input_t = tf.placeholder(tf.float32, [None, None, n_inp + (2 * n_inp * n_ctx)], name='inp')
    tgts = tf.sparse_placeholder(tf.int32, name='tgts')
    len_seq = tf.placeholder(tf.int32, [None], name='len_seq')
    logits = get_logits(input_t,tf.to_int64(len_seq))
    return input_t, tgts, len_seq, logits

The get_model function creates the input placeholder tensors for the source (MFCC features), target (transcribed text labels), and sequence lengths. Then, it calls the get_logits function, which in turn calls the get_layers described earlier. This function thereby creates the model. Now, we will look at the model training loop:

gr = tf.Graph()
with gr.as_default():
    input_t,tgts,len_seq,logits = get_model()
    loss_avg = get_cost(tgts,logits,len_seq)
    adm_opt, dec = get_optimizer(logits,len_seq,loss_avg)
    error_rate = get_error_rates(dec,tgts)
    sess = tf.Session()
    writer = tf.summary.FileWriter('/tmp/models/', graph=sess.graph)
    loss_summary = tf.summary.scalar("loss_avg", loss_avg)
    sum_op = tf.summary.merge_all()
    init_op = tf.global_variables_initializer()
    sess.run(init_op)
    for ep in range(epochs):
        train_cost = 0
        label_err_rate = 0
        batch_feeder = Batch()
        n_batches = np.ceil(len(batch_feeder.audio)/batch_feeder.batch_size)
        n_batches = int(n_batches)
        st = time.time()
        for batch in range(n_batches):
            src,len_src,labels_src = batch_feeder.get_next_batch()
            data_dict = {input_t: src, tgts: labels_src,len_seq:len_src}
            batch_cost, _,summ = sess.run([loss_avg, adm_opt,sum_op], data_dict)
            train_cost += batch_cost * batch_feeder.batch_size
            print("Batch cost: {0}, Train cost: {1}".format(batch_cost,train_cost))
            label_err_rate += sess.run(error_rate, feed_dict=data_dict) * batch_feeder.batch_size
            print('Label error: {}'.format(label_err_rate))
            writer.add_summary(summ,ep*batch_feeder.batch_size+batch)
            saver = tf.train.Saver() 
            saver.save(sess, '/tmp/models/speech2txt.ckpt')
        decoded_val = sess.run(dec[0], feed_dict=data_dict)
        d_decoded_val = tf.sparse_tensor_to_dense(decoded_val,
                        default_value=-1).eval(session=sess)
        d_lbl = decoded_val_to_text(labels_src)
        cnt = 0
        cnt_max = 4
        if cnt < cnt_max:
            for actual_val, decoded_val in zip(d_lbl, d_decoded_val):
                d_str = array2txt(decoded_val)
                print('Batch {}'.format(batch))
                print('Actual: {}'.format(actual_val))
                print('Predicted: {}'.format(d_str))
                cnt += 1
        time_taken = time.time() - st
        log = 'Epoch {}/{}, training_cost: {:.3f}, error_rate: {:.3f}, time: {:.2f} sec'
        print(log.format(ep,epochs,train_cost/len(batch_feeder.audio),
            (label_err_rate/len(batch_feeder.audio)), time_taken))

We first set up the graph and training by calling get_model, get_cost, get_optimizer, and get_error_rates to create the graph, loss function, optimizer, and to calculate error rates, respectively. We initialize the batch_feeder to get the training data in batches. The feed dictionary is populated with the source src, target labels_src, and source length src_len. After completion of each batch, we print the batch error, label error rate, and save the model. We also print the example predicted text after each completion of an epoch.

Table of Contents for Creating the model

Create new playlist

Sign In

Sign Up

Table of Contents for
Creating the model