Visualization of document embeddings

Similar to the earlier visualizations, let us look at how doc2vec trains documents, and how the final document embeddings can enable an understanding of the underlying topics of the documents:

In the preceding visualization, you can see that there are some clusters visible in the upper-left corner and at the bottom of the screenshot. This visualization shows that we can figure out the topics of the documents without any explicit labeling of those topics. However, let us see what these clusters include:

In the preceding cluster, we can see that the doc2vec model has discovered a cluster of documents that are discussing malware attacks. The visualization in the middle shows the cluster, while the list on the right shows the documents and their similarity scores, when compared to the document, is this government attempt to destroy cryptocurrency. It can be seen that, although this document does not have any words about malware or attack, doc2vec has discovered that the document is discussing malware attacks by using the contexts in which ransomware is present in other documents. Let us explore another cluster present in the embedding projector:

This cluster shows a set of documents that are completely separate from the other documents in the corpus. These documents discuss URLs from which they were archived and contain a URL along with the words archived from the document. The rest of the documents in this cluster are also URLs and text, similar to the other documents in the cluster:

The preceding figure compares doc2vec with the CBOW model of Word2vec. This illustration shows the clear extension of Word2vec to the doc2vec model, an extension from learning word vectors to document vectors.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset