Generating Linux kernel code with a GRU

We will now look at a simple, fun example, to generate Linux kernel code using an RNN. The complete Jupyter Notebook for this example is available in the book's code repository, under Chapter08. For the training data, we will first extract the kernel code from the Linux source. You can download the latest (or an earlier) version of the Linux kernel from the kernel archives at https://www.kernel.org/.

We will extract the tar file and use only the core kernel under the kernel/directory in the source. Execute the following from the root directory of the kernel tree to extract code from all of the *.c files:

cd kernel
find . -name "*.c" -exec cat &gt;&gt; /tmp/kernel.txt {} ;

This will concatenate all of the *.c files from the core kernel directory, and will write it to the kernel.txt file, under /tmp. You can use any directory, other than the /tmp directory. First, we will prepare the training data from the raw kernel code file:

with codecs.open('/tmp/kernel.txt', 'r', encoding='utf-8', errors='ignore') as kernel_file:
    raw_text = kernel_file.read()
kernel_words = re.split('(-&gt;)|([-&gt;+=&lt;/&|():*])',raw_text)
kernel_words = [w for w in kernel_words if w is not None]
kernel_words = kernel_words[0:300000]
kernel_words = set(kernel_words)
kword_to_int = dict((word, i) for i, word in enumerate(kernel_words))
int_to_kword = dict((i, word) for i, word in enumerate(kernel_words))
vocab_size = len(kword_to_int)
kword_to_int['&lt;UNK&gt;'] = vocab_size
int_to_kword[vocab_size] = '&lt;UNK&gt;'
vocab_size += 1
X_train = [kword_to_int[word] for word in kernel_words]
y_train = X_train[1:]
y_train.append(kword_to_int['&lt;UNK&gt;'])
X_train = np.asarray(X_train)
y_train = np.asarray(y_train)
X_train = np.expand_dims(X_train,axis=1)
y_train = np.expand_dims(y_train,axis=1)
print(X_train.shape, y_train.shape)

In the code, the regular expression, re.split('(->)|([->+=</&|():*])', raw_text), splits sentences that include some of the C language operators, such as pointers, arithmetic, and logical operators. While this is not necessary when a character-level text generator is used, we do use it here, since we are generating text at the word level. We will also create a dictionary, mapping words to integer IDs (and vice versa), in the kword_to_int and int_to_kword variables, respectively. The numpy variables, X_train and y_train , hold the training data and labels, respectively. In X_train, we have a list of word IDs taken from kword_to_int, while y_train contains the next word, corresponding to each word in X_train. These numpy arrays are reshaped into a single dimension vector of word IDs using the numpy.expand_dims function.

As in the previous chapters, we will use the TensorFlow estimator API for our model training and validation:

We will look at the code for the model function:

def rnn_model_fn(features, labels, mode):
    embedding = tf.Variable(tf.truncated_normal([v_size, EMBED_DIMENSION], 
                                                    stddev=1.0/np.sqrt(EMBED_DIMENSION)), 
                                name="word_embeddings")
    word_emb = tf.nn.embedding_lookup(embedding, features['word'])
    rnn_cell = tf.nn.rnn_cell.GRUCell(HIDDEN_SIZE)
    
    outputs, _ = tf.nn.dynamic_rnn(rnn_cell, word_emb, dtype=tf.float32)
    outputs = tf.reshape(outputs, [-1, HIDDEN_SIZE])
    flayer_op = tf.layers.dense(outputs, v_size, name="linear")
    return estimator_spec_for_generation(flayer_op, labels, mode)

We use a simple network, with an input embedding layer followed by a GRUCell and a dense layer.

Note that EMBED_DIMENSION and HIDDEN_SIZE are defined as 50 and 256, respectively. You can try them with different values to experiment with the generated output.

The output of the dense layer is then fed to the optimizer that is defined in the estimator specification function, which we will explore next:

def estimator_spec_for_generation(flayer_op, lbls, md):
    preds_cls = tf.argmax(flayer_op, 1)
    if md == tf.estimator.ModeKeys.PREDICT:
        prev_op = tf.reshape(flayer_op, [-1, 1, v_size])[:, -1, :]
        preds_op = tf.nn.softmax(prev_op)
        return tf.estimator.EstimatorSpec(
        mode=md,
        predictions={
            'preds_probs': preds_op
        })
    trng_loss = tf.losses.sparse_softmax_cross_entropy(labels=lbls, logits=flayer_op)
    if md == tf.estimator.ModeKeys.TRAIN:
        optimizer = tf.train.AdamOptimizer(learning_rate=0.01)
        trng_op = optimizer.minimize(trng_loss, global_step=tf.train.get_global_step())
        return tf.estimator.EstimatorSpec(md, loss=trng_loss, train_op=trng_op)
    ev_met_ops = {'accy': tf.metrics.accuracy(labels=lbls, predictions=preds_cls)}
    return tf.estimator.EstimatorSpec(md, loss=trng_loss, train_op=trng_op)

During the training, we use AdamOptimizer to minimize the sparse_softmax_cross_entropy loss, with the labels specifying the next words in the sequence. During prediction time, we take the softmax output as the probability of the next word. This softmax output represents the probability distribution of the next word in the list of all of the words in the vocabulary of the length, v_size.

We will create the estimator for training, and we will set up the training configuration:

run_config = tf.contrib.learn.RunConfig()
run_config = run_config.replace(model_dir='/tmp/models/',
             save_summary_steps=10,log_step_count_steps=10)
generator = 
  tf.estimator.Estimator(model_fn=rnn_model_fn,config=run_config)

We configured logging the step summary so it step counts every 10 training steps. We then created estimator with the model function and the configuration.

We will now train the model:

train_input_fn = tf.estimator.inputs.numpy_input_fn(
    x={'word': X_train},
    y=y_train,
    batch_size=1024,
    num_epochs=None,
    shuffle=True)
generator.train(input_fn=train_input_fn, steps=300)

We create a training input function, train_input_fn, with the input training data, X_train and y_train. We set the batch size to 1024 and train the model with a step count of 300. When it has executed, we will implement the following loss in the output:

INFO:tensorflow:global_step/sec: 0.598131
INFO:tensorflow:Saving checkpoints for 300 into /tmp/models/model.ckpt.
INFO:tensorflow:Loss for final step: 0.0061470587.

Finally, we will use the trained model to generate the text, one word at a time. For this, we will use generator to predict the next word, given an initial word. The predicted word will then be used as the next input for predicting the subsequent word, and so on. We will concatenate all of these predicted words as the final generated text:

maxlen = 40
next_x = X_train[0:60]
text = "".join([int_to_kword[word] for word in next_x.flatten()])
for i in range(maxlen):
    test_input_fn = tf.estimator.inputs.numpy_input_fn(
      x={'word': next_x},
      num_epochs=1,
      shuffle=False)
    predictions = generator.predict(input_fn=test_input_fn)
    predictions = list(predictions)
    word = int_to_kword[np.argmax(predictions[-1]['preds_probs'])]
    text = text + word
    next_x = np.concatenate((next_x,[[kword_to_int[word]]]))
    next_x = next_x[1:]

We should pick a random sequence of words as the initial text. Here, we picked the first 60 words. You can select any other consecutive list of words from the original kernel code. We store that in the next_x variable, to keep track of the next word in the sequence. The int_to_kword dictionary, which was created during data preparation, is used to transform the predicted IDs into words that are appended to the text output variable. Note that we loop for maxlen number of iterations, which is set as 40 in the code. This can be increased or decreased, to experiment with increasing or decreasing the number of words in the generated text.

We can now look at the final output that is generated:

static int blk_trace_remove_queue Initialize POSIX timer handling for a thread group.
  PAGE_SHIFT;
    if !rb_threads[cpu]s, const struct pci_dev ;
  check_mm the filter_hash does not exist or is empty,
 
    return NULL;

  memsetkexec_image;
struct kimage ;
}

power_attr_rostruct hist_field module_add_modinfo_attrs Fake ip  data;
  struct page 
{
  return arg ? GFP_ATOMIC  PM_SUSPEND_ON
    goto fail_free_buffers;

  ret  sec;
  }

static struct ftrace_ops trace_ops __initdata   into them directly.
   !is_sampling_eventholders_dir, mod MIN_NICE can be offsets in the trace data.
  to the buffer after this will fail and return NULL.
  ring_buffer_record_enable_cpu  {
    area[pos] ;

#ifdef CONFIG_SUSPEND
  if  hist_field_u16;
    break;
  case 1 representing a file path of format and ;

#endif ;
    goto out;
  }

  ftrace_graph_return  Pipe buffer operations for a buffer.  val;
  arch_spin_unlock If we fail, we do not register this tracer.
   
  return ret;
}

While the preceding output closely resembles kernel code, it does not make much sense. Note that the output might differ on each run, and it might also be different in the Notebook found in the book's code repository.

It does appear that the model has learned some of the classic Linux kernel idioms, such as the usage of goto in the code. It might be possible to generate more realistic kernel code with a character-level model, using deeper networks, bidirectional LSTMs, and so on. The reader can experiment with such approaches to get better results.

We will now look at the loss functions and the model graph in TensorBoard. We will look at the model graph first:

loss functions and the model graph in TensorBoard

The graph shows the input of the word embeddings, followed by the RNN cell and the dense layer. The sparse softmax layer gives the final output probabilities of the words in the input vocabulary. AdamOptimizer is used for minimizing the loss function, which takes the gradients as input. We will now look at how the loss function varies in the training steps:

Varying loss function in the training

As you can see, the loss starts at around 11 and falls close to 0.1.

Although this example is not practically useful by itself, the goal was to show the reader that such generative models can be used to learn the underlying structure and essence of the input text. Creating such sequence models, therefore, can help us to understand the structure of the text for a given domain. The domain that we used in the example was operating system code. The reader can explore further, with other texts and deeper models.

Table of Contents for Generating Linux kernel code with a GRU

Create new playlist

Sign In

Sign Up

Table of Contents for
Generating Linux kernel code with a GRU