Now, we will see how to perform document classification using doc2vec. In this section, we will use the 20 news_dataset. It consists of 20,000 documents over 20 different news categories. We will use only four categories: Electronics, Politics, Science, and Sports. So, we have 1,000 documents under each of these four categories. We rename the documents with a prefix, category_. For example, all science documents are renamed as Science_1, Science_2, and so on. After renaming them, we combine all the documents and place them in a single folder. The combined data, along with complete code is available at available as a Jupyter Notebook on GitHub at http://bit.ly/2KgBWYv.
Now, we train our doc2vec model to classify and find similarities between these documents.
First, we import all the necessary libraries:
import warnings
warnings.filterwarnings('ignore')
import os
import gensim
from gensim.models.doc2vec import TaggedDocument
from nltk import RegexpTokenizer
from nltk.corpus import stopwords
tokenizer = RegexpTokenizer(r'w+')
stopWords = set(stopwords.words('english'))
Now, we load all our documents and save the document names in the docLabels list and the document content in a list called data:
docLabels = []
docLabels = [f for f in os.listdir('data/news_dataset') if f.endswith('.txt')]
data = []
for doc in docLabels:
data.append(open('data/news_dataset/'+doc).read())
You can see in docLabels list we have all our documents' names:
docLabels[:5]
['Electronics_827.txt', 'Electronics_848.txt', 'Science829.txt', 'Politics_38.txt', 'Politics_688.txt']
Define a class called DocIterator, which acts as an iterator to run over all the documents:
class DocIterator(object):
def __init__(self, doc_list, labels_list):
self.labels_list = labels_list
self.doc_list = doc_list
def __iter__(self):
for idx, doc in enumerate(self.doc_list):
yield TaggedDocument(words=doc.split(), tags= [self.labels_list[idx]])
Create an object called it to the DocIterator class:
it = DocIterator(data, docLabels)
Now, let's build the model. Let's first, define some of the important hyperparameters of the model:
- The size parameter represents our embedding size.
- The alpha parameter represents our learning rate.
- The min_alpha parameter implies that our learning rate, alpha, will decay to min_alpha during training.
- Setting dm=1 implies that we use the distributed memory (PV-DM) model and if we set dm=0, it implies that we use the distributed bag of words (PV-DBOW) model for training.
- The min_count parameter represents the minimum frequency of words. If the particular word's occurrence is less than a min_count, than we can simply ignore that word.
These hyperparameters are defined as:
size = 100
alpha = 0.025
min_alpha = 0.025
dm = 1
min_count = 1
Now let's define the model using gensim.models.Doc2ec() class:
model = gensim.models.Doc2Vec(size=size, min_count=min_count, alpha=alpha, min_alpha=min_alpha, dm=dm)
model.build_vocab(it)
Train the model:
for epoch in range(100):
model.train(it,total_examples=120)
model.alpha -= 0.002
model.min_alpha = model.alpha
After training, we can save the model, using the save function:
model.save('model/doc2vec.model')
We can load the saved model, using the load function:
d2v_model = gensim.models.doc2vec.Doc2Vec.load('model/doc2vec.model')
Now, let's evaluate our model's performance. The following code shows that when we feed the Sports_1.txt document as an input, it will input all the related documents with the corresponding scores:
d2v_model.docvecs.most_similar('Sports_1.txt')
[('Sports_957.txt', 0.719024658203125), ('Sports_694.txt', 0.6904895305633545), ('Sports_836.txt', 0.6636477708816528), ('Sports_869.txt', 0.657712459564209), ('Sports_123.txt', 0.6526877880096436), ('Sports_4.txt', 0.6499642729759216), ('Sports_749.txt', 0.6472041606903076), ('Sports_369.txt', 0.6408025026321411), ('Sports_167.txt', 0.6392412781715393), ('Sports_104.txt', 0.6284008026123047)]