We will use previously trained NLTK tokenizer (http://www.nltk.org/index.html) and stopwords for English language to clean our corpus and extract relevant unique words from the corpus. We will also create a small module to clean the provided collection with the list of unprocessed sentences to output the list of words.
"""**Download NLTK tokenizer models (only the first time)**"""
nltk.download("punkt")
nltk.download("stopwords")
def sentence_to_wordlist(raw):
clean = re.sub("[^a-zA-Z]"," ", raw)
words = clean.split()
return map(lambda x:x.lower(),words)
Since we haven't yet captured the data from the text responses in our hypothetical business use case, let's collect a good quality dataset available on the web. Demonstrating our understanding and skills with this corpus will prepare us for the hypothetical business use case data. You can also use your own dataset, but it's important to have a huge amount of words so that word2vec model can generalize well. So we will load our data from Gutenberg.org website.
Then we tokenize the raw corpus into the list of unique clean words as shown in the figure below.
# Article 0on earth from Gutenberg website
filepath = 'http://www.gutenberg.org/files/33224/33224-0.txt
corpus_raw = requests.get(filepath).text
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
raw_sentences = tokenizer.tokenize(corpus_raw)
#sentence where each word is tokenized
sentences = []
for raw_sentence in raw_sentences:
if len(raw_sentence) > 0:
sentences.append(sentence_to_wordlist(raw_sentence))