Preprocessing and preparing the dataset

Define a function for preprocessing the dataset:

def pre_process(text):

# convert to lowercase
text = str(text).lower()

# remove all special characters and keep only alpha numeric characters and spaces
text = re.sub(r'[^A-Za-z0-9s.]',r'',text)

#remove new lines
text = re.sub(r' ',r' ',text)

# remove stop words
text = " ".join([word for word in text.split() if word not in stopWords])

return text

We can see how the preprocessed text looks like by running the following code:

pre_process(data[0][50])

We get the output as:

'agree fancy. everything needed. breakfast pool hot tub nice shuttle airport later checkout time. noise issue tough sleep through. awhile forget noisy door nearby noisy guests. complained management later email credit compd us amount requested would return.'

Preprocess the whole dataset:

data[0] = data[0].map(lambda x: pre_process(x))

The genism library requires input in the form of a list of lists:

text = [ [word1, word2, word3], [word1, word2, word3] ]

We know that each row in our data contains a set of sentences. So, we split them by '.' and convert them into a list:

data[0][1].split('.')[:5]

The preceding code generates the following output:

['stayed crown plaza april april ',
 ' staff friendly attentive',
 ' elevators tiny ',
 ' food restaurant delicious priced little high side',
 ' course washington dc']

Thus, as shown, now, we have the data in a list. But we need to convert them into a list of lists. So, now again we split it by a space ' '. That is, first, we split the data by '.' and then we split them by ' ' so that we can get our data in a list of lists:

corpus = []
for line in data[0][1].split('.'):
words = [x for x in line.split()]
corpus.append(words)

You can see that we have our inputs in the form of a list of lists:

corpus[:2]

[['stayed', 'crown', 'plaza', 'april', 'april'], ['staff', 'friendly', 'attentive']]

Convert the whole text in our dataset to a list of lists:

data = data[0].map(lambda x: x.split('.'))

corpus = []
for i in (range(len(data))):
for line in data[i]:
words = [x for x in line.split()]
corpus.append(words)


print corpus[:2]

As shown, we successfully converted the whole text in our dataset into a list of lists:

[['room', 'kind', 'clean', 'strong', 'smell', 'dogs'],

['generally', 'average', 'ok', 'overnight', 'stay', 'youre', 'fussy']]

Now, the problem we have is that our corpus contains only unigrams and it will not give us results when we give a bigram as an input, for example, san francisco.

So we use gensim's Phrases functions, which collects all the words that occur together and adds an underscore between them. So, now san francisco becomes san_francisco.

We set the min_count parameter to 25, which implies that we ignore all the words and bigrams that appear less the min_count:

phrases = Phrases(sentences=corpus,min_count=25,threshold=50)
bigram = Phraser(phrases)

for index,sentence in enumerate(corpus):
corpus[index] = bigram[sentence]

As you can see, now, an underscore has been added to the bigrams in our corpus:

corpus[111]

[u'connected', u'rivercenter', u'mall', u'downtown', u'san_antonio']

We check one more value from the corpus to see how an underscore is added for bigrams:

corpus[9]

[u'course', u'washington_dc']
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset