Preparing dataset

The dataset we’ll use in this project is the Movie Review data from Rotten Tomatoes . It contains approx 10,662 example review sentences, half positive and half negative. The dataset has a vocabulary of size around 20k. We will use the sklearn wrapper to load the dataset from a raw file and then will use a helper function separate_dataset() to clean the dataset and transform from its raw form to the separate list structure.

#Helper function
def separate_dataset(trainset,ratio=0.5):
    datastring = []
    datatarget = []
    for i in range(int(len(trainset.data)*ratio)):
        data_ = trainset.data[i].split('
')
        data_ = list(filter(None, data_))
        for n in range(len(data_)):
            data_[n] = clearstring(data_[n])
        datastring += data_
        for n in range(len(data_)):
            datatarget.append(trainset.target[i])
    return datastring, datatarget

Here, trainset is an object which stores all the text data and the sentiment label data:

trainset = sklearn.datasets.load_files(container_path = './data', encoding = 'UTF-8')
trainset.data, trainset.target = separate_dataset(trainset,1.0)
print (trainset.target_names)
print ('No of training data' , len(trainset.data))
print ('No. of test data' , len(trainset.target))

# Output:
['negative', 'positive']
No of training data 10662
No of test data 10662

Now we will transform the labels into the one hot encoding.

It's important to understand the dimensions of the one hot encoding vector. Since we have 10662 separate sentences and 2 sentiments that is negative and positive so our onehot vector size will be of size [10662, 2].

We will be using a popular sklearn wrapper train_test_split() to randomly shuffle the data and divide the dataset into 2 parts which are training set and the test set. Further with another helper function build_dataset(), we will create the vocabulary using word count based approach.

You can also try to feed any embedding model in this place to make the model more accurate.

ONEHOT = np.zeros((len(trainset.data),len(trainset.target_names)))
ONEHOT[np.arange(len(trainset.data)),trainset.target] = 1.0
train_X, test_X, train_Y, test_Y, train_onehot, test_onehot = train_test_split(trainset.data, trainset.target, 
ONEHOT, test_size = 0.2)


concat = ' '.join(trainset.data).split()
vocabulary_size = len(list(set(concat)))
data, count, dictionary, rev_dictionary = build_dataset(concat, vocabulary_size)
print('vocab from size: %d'%(vocabulary_size))
print('Most common words', count[4:10])
print('Sample data', data[:10], [rev_dictionary[i] for i in data[:10]])


# OUTPUT:
vocab from size: 20465
'Most common words', [(u'the', 10129), (u'a', 7312), (u'and', 6199), (u'of', 6063), (u'to', 4233), (u'is', 3378)]

'Sample data': 
[4, 662, 9, 2543, 8, 22, 4, 3558, 18064, 98] --> 
[u'the', u'rock', u'is', u'destined', u'to', u'be', u'the', u'21st', u'centurys', u'new']

Few important things to remember while preparing the dataset for the RNN models. We need to add explicitly special tags in the vocabulary to keep track of the start of sentences, extra padding, end of the sentence and unknown words. Hence we have reserved following positions for special tags in our vocab dictionary:

# Tag to mark the beginning of the sentence
'GO' = 0^th position
# Tag to add extra padding in the sentence
'PAD'= 1^st pistion
# Tag to mark the end of the sentence
'EOS'= 2^nd position
# Tag to mark the unknown word
'UNK'= 3^rd position

Table of Contents for Preparing dataset

Create new playlist

Sign In

Sign Up

Table of Contents for
Preparing dataset