Data analysis and pre-processing

In this section, we are going to define some helper functions that will enable us to build a good Word2Vec model. For this implementation, we are going to use a cleaned version of Wikipedia (http://mattmahoney.net/dc/textdata.html).

So, let's start off by importing the required packages for this implementation:

#importing the required packages for this implementation
import numpy as np
import tensorflow as tf


#Packages for downloading the dataset
from urllib.request import urlretrieve
from os.path import isfile, isdir
from tqdm import tqdm
import zipfile

#packages for data preprocessing
import re
from collections import Counter
import random

Next up, we are going to define a class that will be used to download the dataset if it was not downloaded before:

# In this implementation we will use a cleaned up version of Wikipedia from Matt Mahoney.
# So we will define a helper class that will helps to download the dataset
wiki_dataset_folder_path = 'wikipedia_data'
wiki_dataset_filename = 'text8.zip'
wiki_dataset_name = 'Text8 Dataset'

class DLProgress(tqdm):
    
    last_block = 0

    def hook(self, block_num=1, block_size=1, total_size=None):
        self.total = total_size
        self.update((block_num - self.last_block) * block_size)
        self.last_block = block_num
        
# Cheking if the file is not already downloaded
if not isfile(wiki_dataset_filename):
    with DLProgress(unit='B', unit_scale=True, miniters=1, desc=wiki_dataset_name) as pbar:
        urlretrieve(
            'http://mattmahoney.net/dc/text8.zip',
            wiki_dataset_filename,
            pbar.hook)

# Checking if the data is already extracted if not extract it
if not isdir(wiki_dataset_folder_path):
    with zipfile.ZipFile(wiki_dataset_filename) as zip_ref:
        zip_ref.extractall(wiki_dataset_folder_path)
        
with open('wikipedia_data/text8') as f:
    cleaned_wikipedia_text = f.read()

Output:

Text8 Dataset: 31.4MB [00:39, 794kB/s]

We can have a look at the first 100 characters of this dataset:

cleaned_wikipedia_text[0:100]

' anarchism originated as a term of abuse first used against early working class radicals including t'

Next up, we are going to preprocess the text, so we are going to define a helper function that will help us to replace special characters such as punctuation ones into a know token. Also, to reduce the amount of noise in the input text, you might want to remove words that don't appear frequently in the text:

def preprocess_text(input_text):

    # Replace punctuation with some special tokens so we can use them in our model
    input_text = input_text.lower()
    input_text = input_text.replace('.', ' <PERIOD> ')
    input_text = input_text.replace(',', ' <COMMA> ')
    input_text = input_text.replace('"', ' <QUOTATION_MARK> ')
    input_text = input_text.replace(';', ' <SEMICOLON> ')
    input_text = input_text.replace('!', ' <EXCLAMATION_MARK> ')
    input_text = input_text.replace('?', ' <QUESTION_MARK> ')
    input_text = input_text.replace('(', ' <LEFT_PAREN> ')
    input_text = input_text.replace(')', ' <RIGHT_PAREN> ')
    input_text = input_text.replace('--', ' <HYPHENS> ')
    input_text = input_text.replace('?', ' <QUESTION_MARK> ')
   
    input_text = input_text.replace(':', ' <COLON> ')
    text_words = input_text.split()
    
    # neglecting all the words that have five occurrences of fewer
    text_word_counts = Counter(text_words)
    trimmed_words = [word for word in text_words if text_word_counts[word] > 5]

    return trimmed_words

Now, let's call this function on the input text and have a look at the output:

preprocessed_words = preprocess_text(cleaned_wikipedia_text)
print(preprocessed_words[:30])

Output:
['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against', 'early', 'working', 'class', 'radicals', 'including', 'the', 'diggers', 'of', 'the', 'english', 'revolution', 'and', 'the', 'sans', 'culottes', 'of', 'the', 'french', 'revolution', 'whilst']

Let's see how many words and distinct words we have for the pre-processed version of the text:

print("Total number of words in the text: {}".format(len(preprocessed_words)))
print("Total number of unique words in the text: {}".format(len(set(preprocessed_words))))

Output:

Total number of words in the text: 16680599
Total number of unique words in the text: 63641

And here, I'm creating dictionaries to covert words to integers and backwards, that is, integers to words. The integers are assigned in descending frequency order, so the most frequent word (the) is given the integer 0, the next most frequent gets 1, and so on. The words are converted to integers and stored in the list int_words.

As mentioned earlier in this section, we need to use the integer indexes of the words to look up their values in the weight matrix, so we are going to words to integers and integers to words. This will help us to look up the words and also get the actual word of a specific index. For example, the most repeated word in the input text will be indexed at position 0, followed by the second most repeated one, and so on.

So, let's define a function to create this lookup table:

def create_lookuptables(input_words):
 """
 Creating lookup tables for vocan
 
 Function arguments:
 param words: Input list of words
 """
 input_word_counts = Counter(input_words)
 sorted_vocab = sorted(input_word_counts, key=input_word_counts.get, reverse=True)
 integer_to_vocab = {ii: word for ii, word in enumerate(sorted_vocab)}
 vocab_to_integer = {word: ii for ii, word in integer_to_vocab.items()}
 
 # returning A tuple of dicts
 return vocab_to_integer, integer_to_vocab

Now, let's call the defined function to create the lookup table:

vocab_to_integer, integer_to_vocab = create_lookuptables(preprocessed_words)
integer_words = [vocab_to_integer[word] for word in preprocessed_words]

To build a more accurate model, we can remove words that don't change the context much as of, for, the, and so on. So, it is practically proven that we can build more accurate models while discarding these kinds of words. The process of removing context-irrelevant words from the context is called subsampling. In order to define a general mechanism for word discarding, Mikolov introduced a function for calculating the discard probability of a certain word, which is given by:

Where:

t is a threshold parameter for word discarding
f(w_i) is the frequency of a specific target word w_i in the input dataset

So, we are going to implement a helper function that will calculate the discarding probability of each word in the dataset:

# removing context-irrelevant words threshold
word_threshold = 1e-5

word_counts = Counter(integer_words)
total_number_words = len(integer_words)

#Calculating the freqs for the words
frequencies = {word: count/total_number_words for word, count in word_counts.items()}

#Calculating the discard probability
prob_drop = {word: 1 - np.sqrt(word_threshold/frequencies[word]) for word in word_counts}
training_words = [word for word in integer_words if random.random() < (1 - prob_drop[word])]

Now, we have a more refined and clean version of the input text.

We mentioned that the skip-gram architecture considers the context of the target word while producing its real-valued representation, so it defines a window around the target word that has size C.

Instead of treating all contextual words equally, we are going to assign less weight for words that are a bit far from the target word. For example, if we choose the size of the window to be C = 4, then we are going to select a random number L from the range of 1 to C, and then sample L words from the history and the future of the current word. For more details about this, refer to the Mikolov et al paper at: https://arxiv.org/pdf/1301.3781.pdf.

So, let's go ahead and define this function:

# Defining a function that returns the words around specific index in a specific window
def get_target(input_words, ind, context_window_size=5):
    
    #selecting random number to be used for genearting words form history and feature of the current word
    rnd_num = np.random.randint(1, context_window_size+1)
    start_ind = ind - rnd_num if (ind - rnd_num) > 0 else 0
    stop_ind = ind + rnd_num
    
    target_words = set(input_words[start_ind:ind] + input_words[ind+1:stop_ind+1])
    
    return list(target_words)

Also, let's define a generator function to generate a random batch from the training samples and get the contextual word for each word in that batch:

#Defining a function for generating word batches as a tuple (inputs, targets)
def generate_random_batches(input_words, train_batch_size, context_window_size=5):
    
    num_batches = len(input_words)//train_batch_size
    
    # working on only only full batches
    input_words = input_words[:num_batches*train_batch_size]
    
    for ind in range(0, len(input_words), train_batch_size):
        input_vals, target = [], []
        input_batch = input_words[ind:ind+train_batch_size]
        
        #Getting the context for each word
        for ii in range(len(input_batch)):
            batch_input_vals = input_batch[ii]
            batch_target = get_target(input_batch, ii, context_window_size)
            
            target.extend(batch_target)
            input_vals.extend([batch_input_vals]*len(batch_target))
        yield input_vals, target

Table of Contents for Data analysis and pre-processing

Create new playlist

Sign In

Sign Up

Table of Contents for
Data analysis and pre-processing