Prepare the Text Corpus

We will use previously trained NLTK tokenizer (http://www.nltk.org/index.html) and stopwords for English language to clean our corpus and extract relevant unique words from the corpus. We will also create a small module to clean the provided collection with the list of unprocessed sentences to output the list of words.

"""**Download NLTK tokenizer models (only the first time)**"""

nltk.download("punkt")
nltk.download("stopwords")

def sentence_to_wordlist(raw):
    clean = re.sub("[^a-zA-Z]"," ", raw)
    words = clean.split()
    return map(lambda x:x.lower(),words)

Since we haven't yet captured the data from the text responses in our hypothetical business use case, let's collect a good quality dataset available on the web. Demonstrating our understanding and skills with this corpus will prepare us for the hypothetical business use case data. You can also use your own dataset, but it's important to have a huge amount of words so that word2vec model can generalize well. So we will load our data from Gutenberg.org website.

Then we tokenize the raw corpus into the list of unique clean words as shown in the figure below.

Figure 3.1 : This process depicts the data transformation from raw data to the list of words which will be feed into the word2vec model.

# Article 0on earth from Gutenberg website
filepath = 'http://www.gutenberg.org/files/33224/33224-0.txt

corpus_raw = requests.get(filepath).text

tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

raw_sentences = tokenizer.tokenize(corpus_raw)

#sentence where each word is tokenized
sentences = []
for raw_sentence in raw_sentences:
    if len(raw_sentence) > 0:
        sentences.append(sentence_to_wordlist(raw_sentence))

Table of Contents for Prepare the Text Corpus

Create new playlist

Sign In

Sign Up

Table of Contents for
Prepare the Text Corpus