The dataset that we'll be using in this recipe is comprised of thousands of German phrases with English translations. It is available at http://www.manythings.org/anki/deu-eng.zip. The examples have been taken from the Tatoeba Project.
Let's start by loading the required libraries:
library(keras)
library(stringr)
library(reshape2)
library(purrr)
library(ggplot2)
library(readr)
library(stringi)
The data is in the form of a tab-delimited text file. We will be using the first 10,000 phrases. Let's load the dataset and have a look at the sample data:
lines <- readLines("data/deu.txt", n = 10000)
sentences <- str_split(lines, " ")
sentences[1:10]
The following screenshot shows a few records from the data. It contains German phrases and their translations in English:
We will use the preceding dataset to build our neural machine translation model.