The dataset we’ll use in this project is the Movie Review data from Rotten Tomatoes . It contains approx 10,662 example review sentences, half positive and half negative. The dataset has a vocabulary of size around 20k. We will use the sklearn wrapper to load the dataset from a raw file and then will use a helper function separate_dataset() to clean the dataset and transform from its raw form to the separate list structure.
#Helper function
def separate_dataset(trainset,ratio=0.5):
datastring = []
datatarget = []
for i in range(int(len(trainset.data)*ratio)):
data_ = trainset.data[i].split(' ')
data_ = list(filter(None, data_))
for n in range(len(data_)):
data_[n] = clearstring(data_[n])
datastring += data_
for n in range(len(data_)):
datatarget.append(trainset.target[i])
return datastring, datatarget
Here, trainset is an object which stores all the text data and the sentiment label data:
trainset = sklearn.datasets.load_files(container_path = './data', encoding = 'UTF-8')
trainset.data, trainset.target = separate_dataset(trainset,1.0)
print (trainset.target_names)
print ('No of training data' , len(trainset.data))
print ('No. of test data' , len(trainset.target))
# Output:
['negative', 'positive']
No of training data 10662 No of test data 10662
Now we will transform the labels into the one hot encoding.
We will be using a popular sklearn wrapper train_test_split() to randomly shuffle the data and divide the dataset into 2 parts which are training set and the test set. Further with another helper function build_dataset(), we will create the vocabulary using word count based approach.
ONEHOT = np.zeros((len(trainset.data),len(trainset.target_names)))
ONEHOT[np.arange(len(trainset.data)),trainset.target] = 1.0
train_X, test_X, train_Y, test_Y, train_onehot, test_onehot = train_test_split(trainset.data, trainset.target,
ONEHOT, test_size = 0.2)
concat = ' '.join(trainset.data).split()
vocabulary_size = len(list(set(concat)))
data, count, dictionary, rev_dictionary = build_dataset(concat, vocabulary_size)
print('vocab from size: %d'%(vocabulary_size))
print('Most common words', count[4:10])
print('Sample data', data[:10], [rev_dictionary[i] for i in data[:10]])
# OUTPUT:
vocab from size: 20465 'Most common words', [(u'the', 10129), (u'a', 7312), (u'and', 6199), (u'of', 6063), (u'to', 4233), (u'is', 3378)]
'Sample data':
[4, 662, 9, 2543, 8, 22, 4, 3558, 18064, 98] -->
[u'the', u'rock', u'is', u'destined', u'to', u'be', u'the', u'21st', u'centurys', u'new']
Few important things to remember while preparing the dataset for the RNN models. We need to add explicitly special tags in the vocabulary to keep track of the start of sentences, extra padding, end of the sentence and unknown words. Hence we have reserved following positions for special tags in our vocab dictionary:
# Tag to mark the beginning of the sentence
'GO' = 0th position
# Tag to add extra padding in the sentence
'PAD'= 1st pistion
# Tag to mark the end of the sentence
'EOS'= 2nd position
# Tag to mark the unknown word
'UNK'= 3rd position