Data preparation

Read the downloaded input dataset:

df = pd.read_csv('data/songdata.csv')

Let's see what we have in our dataset:

df.head()

The preceding code generates the following output:

Our dataset consists of about 57,650 song lyrics:

df.shape[0]

57650

We have song lyrics from about 643 artists:

len(df['artist'].unique())

643

The number of songs from each artist is shown as follows:

df['artist'].value_counts()[:10]

Donna Summer 191 Gordon Lightfoot 189 George Strait 188 Bob Dylan 188 Loretta Lynn 187 Cher 187 Alabama 187 Reba Mcentire 187 Chaka Khan 186 Dean Martin 186 Name: artist, dtype: int64

On average, we have about 89 songs from each artist:

df['artist'].value_counts().values.mean()

89

We have song lyrics in the column text, so we combine all the rows of that column and save it as a text in a variable called data, as follows:

data = ', '.join(df['text'])

Let's see a few lines of a song:

data[:369]

"Look at her face, it's a wonderful face And it means something special to me Look at the way that she smiles when she sees me How lucky can one fellow be? She's just my kind of girl, she makes me feel fine Who could ever believe that she could be mine? She's just my kind of girl, without her I'm blue And if she ever leaves me what could I do, what co"

Since we are building a char-level RNN, we will store all the unique characters in our dataset into a variable called chars; this is basically our vocabulary:

chars = sorted(list(set(data)))

Store the vocabulary size in a variable called vocab_size:

vocab_size = len(chars)

Since the neural networks only accept the input in numbers, we need to convert all the characters in the vocabulary to a number.

We map all the characters in the vocabulary to their corresponding index that forms a unique number. We define a char_to_ix dictionary, which has a mapping of all the characters to their index. To get the index by a character, we also define the ix_to_char dictionary, which has a mapping of all the indices to their respective characters:

char_to_ix = {ch: i for i, ch in enumerate(chars)}
ix_to_char = {i: ch for i, ch in enumerate(chars)}

As you can see in the following code snippet, the character 's' is mapped to an index 68 in the char_to_ix dictionary:

print char_to_ix['s']

68

Similarly, if we give 68 as an input to the ix_to_char, then we get the corresponding character, which is 's':

print ix_to_char[68]

's'

Once we obtain the character to integer mapping, we use one-hot encoding to represent the input and output in vector form. A one-hot encoded vector is basically a vector full of 0s, except, 1 at a position corresponding to a character index.

For example, let's suppose that the vocabSize is 7, and the character z is in the fourth position in the vocabulary. Then, the one-hot encoded representation for the character z can be represented as follows:

vocabSize = 7
char_index = 4

print np.eye(vocabSize)[char_index]

array([0., 0., 0., 0., 1., 0., 0.])

As you can see, we have a 1 at the corresponding index of the character, and the rest of the values are 0s. This is how we convert each character into a one-hot encoded vector.

In the following code, we define a function called one_hot_encoder, which will return the one-hot encoded vectors, given an index of the character:

def one_hot_encoder(index):
return np.eye(vocab_size)[index]
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset