Data preparation

Read the downloaded input dataset:

df = pd.read_csv('data/songdata.csv')

Let's see what we have in our dataset:

df.head()

The preceding code generates the following output:

Our dataset consists of about 57,650 song lyrics:

df.shape[0]

57650

We have song lyrics from about 643 artists:

len(df['artist'].unique())

643

The number of songs from each artist is shown as follows:

df['artist'].value_counts()[:10]

Donna Summer        191
Gordon Lightfoot    189
George Strait       188
Bob Dylan           188
Loretta Lynn        187
Cher                187
Alabama             187
Reba Mcentire       187
Chaka Khan          186
Dean Martin         186
Name: artist, dtype: int64

On average, we have about 89 songs from each artist:

df['artist'].value_counts().values.mean()

89

We have song lyrics in the column text, so we combine all the rows of that column and save it as a text in a variable called data, as follows:

data = ', '.join(df['text'])

Let's see a few lines of a song:

data[:369]

"Look at her face, it's a wonderful face  
And it means something special to me  
Look at the way that she smiles when she sees me  
How lucky can one fellow be?  
  
She's just my kind of girl, she makes me feel fine  
Who could ever believe that she could be mine?  
She's just my kind of girl, without her I'm blue  
And if she ever leaves me what could I do, what co"

Since we are building a char-level RNN, we will store all the unique characters in our dataset into a variable called chars; this is basically our vocabulary:

chars = sorted(list(set(data)))

Store the vocabulary size in a variable called vocab_size:

vocab_size = len(chars)

Since the neural networks only accept the input in numbers, we need to convert all the characters in the vocabulary to a number.

We map all the characters in the vocabulary to their corresponding index that forms a unique number. We define a char_to_ix dictionary, which has a mapping of all the characters to their index. To get the index by a character, we also define the ix_to_char dictionary, which has a mapping of all the indices to their respective characters:

char_to_ix = {ch: i for i, ch in enumerate(chars)}
ix_to_char = {i: ch for i, ch in enumerate(chars)}

As you can see in the following code snippet, the character 's' is mapped to an index 68 in the char_to_ix dictionary:

print char_to_ix['s']

68

Similarly, if we give 68 as an input to the ix_to_char, then we get the corresponding character, which is 's':

print ix_to_char[68]

's'

Once we obtain the character to integer mapping, we use one-hot encoding to represent the input and output in vector form. A one-hot encoded vector is basically a vector full of 0s, except, 1 at a position corresponding to a character index.

For example, let's suppose that the vocabSize is 7, and the character z is in the fourth position in the vocabulary. Then, the one-hot encoded representation for the character z can be represented as follows:

vocabSize = 7
char_index = 4

print np.eye(vocabSize)[char_index]

array([0., 0., 0., 0., 1., 0., 0.])

As you can see, we have a 1 at the corresponding index of the character, and the rest of the values are 0s. This is how we convert each character into a one-hot encoded vector.

In the following code, we define a function called one_hot_encoder, which will return the one-hot encoded vectors, given an index of the character:

def one_hot_encoder(index):
    return np.eye(vocab_size)[index]

Table of Contents for Data preparation

Create new playlist

Sign In

Sign Up

Table of Contents for
Data preparation