So after our analysis, we know that our word2vec model has learned some concepts from the provided corpus but how do we visualize it. Because we have created a 300-dimensional space to learn the features, it's practically impossible for us to visualize. To make it possible we will use a dimension reduction algorithm called t-SNE which is very well known for reducing a high dimensional space into more humanly understandable 2 or 3-dimensional space.
-- Laurens van der Maaten
To implement this we will use sklearn package and define the n_components=2 which mean we want to have 2-dimensional space as the out. Next, we will perform the transformation by feeding the word vectors into the t-SNE object.
After this step, we now have a set of value for each word which we can use as x-coordinate and y-coordinates respectively to plot it in the 2d plane. Let's prepare a dataframe to store all the words and its x, y coordinates in the same variable as shown in figure 3.2 and take data from there to create a scatter plot:
tsne = sklearn.manifold.TSNE(n_components=2, random_state=0)
all_word_vectors_matrix = model2vec.wv.vectors
all_word_vectors_matrix_2d = tsne.fit_transform(all_word_vectors_matrix)
points = pd.DataFrame(
[
(word, coords[0], coords[1])
for word, coords in [
(word, all_word_vectors_matrix_2d[model2vec.wv.vocab[word].index])
for word in model2vec.wv.vocab
]
],
columns=["word", "x", "y"]
)
sns.set_context("poster")
ax = points.plot.scatter("x", "y", s=10, figsize=(20, 12))
fig = ax.get_figure()
This is our dataframe containing: words and coordinates for both x and y.
This is what the entire cluster looks like after plotting 425,633 tokens in the 2d plane. Each point is positioned after learning the features and correlations between the nearby words: