GloVe – global vectors for word representation

GloVe is an unsupervised algorithm developed at the Stanford NLP lab that learns vector representations for words from aggregated global word-word co-occurrence statistics (see references). Vectors pretrained on the following web-scale sources are available:

Common Crawl with 42B or 840B tokens and a vocabulary or 1.9M or 2.2M tokens
Wikipedia 2014 + Gigaword 5 with 6B tokens and a vocabulary of 400K tokens
Twitter using 2B tweets, 27B tokens and a vocabulary of 1.2M tokens

We can use gensim to convert and load the vector text files into the KeyedVector object:

from gensim.models import Word2vec, KeyedVectors
 from gensim.scripts.glove2Word2vec import glove2Word2vec
glove2Word2vec(glove_input_file=glove_file, Word2vec_output_file=w2v_file)
 model = KeyedVectors.load_Word2vec_format(w2v_file, binary=False)

The Word2vec authors provide text files containing over 24,000 analogy tests that gensim uses to evaluate word vectors.

The word vectors trained on the Wikipedia corpus cover all analogies and achieve an overall accuracy of 75.5% with some variation across categories:

Category	Samples	Accuracy	Category	Samples	Accuracy
capital-common-countries	506	94.86%	comparative	1,332	88.21%
capital-world	8,372	96.46%	superlative	1,056	74.62%
city-in-state	4,242	60.00%	present-participle	1,056	69.98%
currency	752	17.42%	nationality-adjective	1,640	92.50%
family	506	88.14%	past-tense	1,560	61.15%
adjective-to-adverb	992	22.58%	plural	1,332	78.08%
opposite	756	28.57%	plural-verbs	870	58.51%

The Common Crawl vectors for the 100,000 most common tokens cover about 80% of the analogies and achieve slightly higher accuracy at 78%, whereas the Twitter vectors cover only 25% with 62% accuracy.

Table of Contents for GloVe – global vectors for word representation

Create new playlist

Sign In

Sign Up

Table of Contents for
GloVe – global vectors for word representation