GloVe – global vectors for word representation

GloVe is an unsupervised algorithm developed at the Stanford NLP lab that learns vector representations for words from aggregated global word-word co-occurrence statistics (see references). Vectors pretrained on the following web-scale sources are available:

  • Common Crawl with 42B or 840B tokens and a vocabulary or 1.9M or 2.2M tokens
  • Wikipedia 2014 + Gigaword 5 with 6B tokens and a vocabulary of 400K tokens
  • Twitter using 2B tweets, 27B tokens and a vocabulary of 1.2M tokens

We can use gensim to convert and load the vector text files into the KeyedVector object:

from gensim.models import Word2vec, KeyedVectors
from gensim.scripts.glove2Word2vec import glove2Word2vec
glove2Word2vec(glove_input_file=glove_file, Word2vec_output_file=w2v_file)
model = KeyedVectors.load_Word2vec_format(w2v_file, binary=False)

The Word2vec authors provide text files containing over 24,000 analogy tests that gensim uses to evaluate word vectors.

The word vectors trained on the Wikipedia corpus cover all analogies and achieve an overall accuracy of 75.5% with some variation across categories:

Category

Samples

Accuracy

Category

Samples

Accuracy

capital-common-countries

506

94.86%

comparative

1,332

88.21%

capital-world

8,372

96.46%

superlative

1,056

74.62%

city-in-state

4,242

60.00%

present-participle

1,056

69.98%

currency

752

17.42%

nationality-adjective

1,640

92.50%

family

506

88.14%

past-tense

1,560

61.15%

adjective-to-adverb

992

22.58%

plural

1,332

78.08%

opposite

756

28.57%

plural-verbs

870

58.51%

 

The Common Crawl vectors for the 100,000 most common tokens cover about 80% of the analogies and achieve slightly higher accuracy at 78%, whereas the Twitter vectors cover only 25% with 62% accuracy.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset