How it works...

In step 1, we used DefaultTokenizerFactory() to create the tokenizer factory to tokenize the words. This is the default tokenizer for Word2Vec and it is based on a string tokenizer, or stream tokenizer. We also used CommonPreprocessor as the token preprocessor. A preprocessor will remove anomalies from the text corpus. The CommonPreprocessor is a token preprocessor implementation that removes punctuation marks and converts the text to lowercase. It uses the toLowerCase(String) method and its behavior depends on the default locale.

Here are the configurations that we made in step 2:

  • minWordFrequency()This is the minimum number of times in which a word must exist in the text corpora. In our example, if a word appears fewer than five times, then it is not learned. Words should occur multiple times in text corpora in order for the model to learn useful features about them. In very large text corpora, it's reasonable to raise the minimum value of word occurrences.
  • layerSize(): This defines the number of features in a word vector. This is equivalent to the number of dimensions in the feature space. Words represented by 100 features become points in a 100-dimensional space.
  • iterate(): This specifies the batch on which the training is taking place. We can pass in an iterator to convert to word vectors. In our case, we passed in a sentence iterator.
  • epochs(): This specifies the number of iterations over the training corpus as a whole.
  • windowSize(): This defines the context window size.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset