How to do it...

So far, we have imported a corpus into the R environment. To build a language model, we need to convert it into a sequence of integers. Let's start doing some data preprocessing:

  1. First, we define our tokenizer. We will use it later to convert text into integer sequences:
tokenizer = text_tokenizer(num_words = 35,char_level = F)
tokenizer %>% fit_text_tokenizer(data)

Let's look at the number of unique words in our corpus:

cat("Number of unique words", length(tokenizer$word_index))

We have 37 unique words in our corpus. To look at the first few records of the vocabulary, we can use the following command:

head(tokenizer$word_index)

Let's convert our corpus into an integer sequence using the tokenizer we defined previously:

text_seqs <- texts_to_sequences(tokenizer, data)
str(text_seqs)

The following image displays the structure of the returned sequences:

Here, we can see that texts_to_sequences() returns a list. Let's convert it into a vector and print its length:

text_seqs <- text_seqs[[1]]
length(text_seqs)

The length of our corpus is 48:

  1. Now, let's convert our text sequence into input (feature) and output (labels) sequences, where the input will be a sequence of two consecutive words and the output will be the next word that appears in the sequence:
input_sequence_length <- 2
feature <- matrix(ncol = input_sequence_length)
label <- matrix(ncol = 1) for(i in seq(input_sequence_length, length(text_seqs))){ if(i >= length(text_seqs)){ break() } start_idx <- (i - input_sequence_length) +1 end_idx <- i +1 new_seq <- text_seqs[start_idx:end_idx] feature <- rbind(feature,new_seq[1:input_sequence_length]) label <- rbind(label,new_seq[input_sequence_length+1]) } feature <- feature[-1,] label <- label[-1,]

paste("Feature")
head(feature)

The following screenshot shows the feature sequence that we formulated:

Let's have a look at the label sequences that we created:

paste("label")
head(label)

The following screenshot shows the first few label sequences:

Let's one-hot encode our label and look at the dimensions of our features and labels:

label <- to_categorical(label,num_classes = tokenizer$num_words )

Here, we can see the dimensions of our feature and label data:

cat("Shape of features",dim(feature),"
")
cat("Shape of label",length(label))

The following screenshot shows the dimensions of our features and label sequences:

  1. Now, we create a model for text generation and print its summary:
model <- keras_model_sequential()
model %>%
    layer_embedding(input_dim = tokenizer$num_words,output_dim = 10,input_length = input_sequence_length) %>%
    layer_lstm(units = 50) %>%
    layer_dense(tokenizer$num_words) %>%
    layer_activation("softmax")

summary(model)

The following screenshot shows the summary of the model:

Next, we compile the model and train it:

# compile
model %>% compile(
    loss = "categorical_crossentropy", 
    optimizer = optimizer_rmsprop(lr = 0.001),
    metrics = c('accuracy')
)

# train
model %>% fit(
  feature, label,
#   batch_size = 128,
  epochs = 500
)
  1. In the following code block, we implement a function that will generate a sequence from a language model:
generate_sequence <-function(model, tokenizer, input_length, seed_text, predict_next_n_words){
    input_text <- seed_text
    for(i in seq(predict_next_n_words)){
        encoded <- texts_to_sequences(tokenizer,input_text)[[1]]
        encoded <- pad_sequences(sequences = list(encoded),maxlen = input_length,padding = 'pre')
        yhat <- predict_classes(model,encoded, verbose=0)
        next_word <- tokenizer$index_word[[as.character(yhat)]]
        input_text <- paste(input_text, next_word)
    }
    return(input_text)
}

Now, we can use our custom function, generate_sequence(), to generate text from the integer sequences:

seed_1 = "Jack and"
cat("Text generated from seed 1: " ,generate_sequence(model,tokenizer,input_sequence_length,seed_1,11),"
 ")
seed_2 = "Jack fell"
cat("Text generated from seed 2: ",generate_sequence(model,tokenizer,input_sequence_length,seed_2,11))

The following screenshot shows the text that was generated by the model from the input text:

From this, we can see that our model did a good job of predicting sequences.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset