Finding similar documents using doc2vec

Now, we will see how to perform document classification using doc2vec. In this section, we will use the 20 news_dataset. It consists of 20,000 documents over 20 different news categories. We will use only four categories: Electronics, Politics, Science, and Sports. So, we have 1,000 documents under each of these four categories. We rename the documents with a prefix, category_. For example, all science documents are renamed as Science_1, Science_2, and so on. After renaming them, we combine all the documents and place them in a single folder. The combined data, along with complete code is available at available as a Jupyter Notebook on GitHub at http://bit.ly/2KgBWYv.

Now, we train our doc2vec model to classify and find similarities between these documents.

First, we import all the necessary libraries:

import warnings
warnings.filterwarnings('ignore')

import os
import gensim
from gensim.models.doc2vec import TaggedDocument

from nltk import RegexpTokenizer
from nltk.corpus import stopwords

tokenizer = RegexpTokenizer(r'w+')
stopWords = set(stopwords.words('english'))

Now, we load all our documents and save the document names in the docLabels list and the document content in a list called data:

docLabels = []
docLabels = [f for f in os.listdir('data/news_dataset') if f.endswith('.txt')]

data = []
for doc in docLabels:
data.append(open('data/news_dataset/'+doc).read())

You can see in docLabels list we have all our documents' names:

docLabels[:5]

['Electronics_827.txt', 'Electronics_848.txt', 'Science829.txt', 'Politics_38.txt', 'Politics_688.txt']

Define a class called DocIterator, which acts as an iterator to run over all the documents:

class DocIterator(object):
def __init__(self, doc_list, labels_list):
self.labels_list = labels_list
self.doc_list = doc_list

def __iter__(self):
for idx, doc in enumerate(self.doc_list):
yield TaggedDocument(words=doc.split(), tags= [self.labels_list[idx]])

Create an object called it to the DocIterator class:

it = DocIterator(data, docLabels)

Now, let's build the model. Let's first, define some of the important hyperparameters of the model:

  • The size parameter represents our embedding size.
  • The alpha parameter represents our learning rate.
  • The min_alpha parameter implies that our learning rate, alpha, will decay to min_alpha during training.
  • Setting dm=1 implies that we use the distributed memory (PV-DM) model and if we set dm=0, it implies that we use the distributed bag of words (PV-DBOW) model for training.
  • The min_count parameter represents the minimum frequency of words. If the particular word's occurrence is less than a min_count, than we can simply ignore that word.

These hyperparameters are defined as:

size = 100
alpha = 0.025
min_alpha = 0.025
dm = 1
min_count = 1

Now let's define the model using gensim.models.Doc2ec() class:

model = gensim.models.Doc2Vec(size=size, min_count=min_count, alpha=alpha, min_alpha=min_alpha, dm=dm)
model.build_vocab(it)

Train the model:

for epoch in range(100):
model.train(it,total_examples=120)
model.alpha -= 0.002
model.min_alpha = model.alpha

After training, we can save the model, using the save function:

model.save('model/doc2vec.model')

We can load the saved model, using the load function:

d2v_model = gensim.models.doc2vec.Doc2Vec.load('model/doc2vec.model')

Now, let's evaluate our model's performance. The following code shows that when we feed the Sports_1.txt document as an input, it will input all the related documents with the corresponding scores:

d2v_model.docvecs.most_similar('Sports_1.txt')

[('Sports_957.txt', 0.719024658203125), ('Sports_694.txt', 0.6904895305633545), ('Sports_836.txt', 0.6636477708816528), ('Sports_869.txt', 0.657712459564209), ('Sports_123.txt', 0.6526877880096436), ('Sports_4.txt', 0.6499642729759216), ('Sports_749.txt', 0.6472041606903076), ('Sports_369.txt', 0.6408025026321411), ('Sports_167.txt', 0.6392412781715393), ('Sports_104.txt', 0.6284008026123047)]
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset