There's more...

The following are the other tokenizer factory implementations available in DL4J Word2Vec to generate tokenizers for the given input:

  • NGramTokenizerFactory: This is the tokenizer factory that creates a tokenizer based on the n-gram model. N-grams are a combination of contiguous words or letters of length n that are present in the text corpus.
  • PosUimaTokenizerFactory: This creates a tokenizer that filters part of the speech tags.
  • UimaTokenizerFactory: This creates a tokenizer that uses the UIMA analysis engine for tokenization. The analysis engine performs an inspection of unstructured information, makes a discovery, and represents semantic content. Unstructured information is included, but is not restricted to text documents.

Here are the inbuilt token preprocessors (not including CommonPreprocessor) available in DL4J:

  • EndingPreProcessor: This is a preprocessor that gets rid of word endings in the text corpus—for example, it removes sed.ly, and ing from the text.
  • LowCasePreProcessor: This is a preprocessor that converts text to lowercase format.
  • StemmingPreprocessor: This tokenizer preprocessor implements basic cleaning inherited from CommonPreprocessor and performs English porter stemming on tokens.
  • CustomStemmingPreprocessor: This is the stemming preprocessor that is compatible with different stemming processors defined as lucene/tartarus SnowballProgram, such as RussianStemmerDutchStemmer, and FrenchStemmer. This means that it is suitable for multilanguage stemming.
  • EmbeddedStemmingPreprocessor: This tokenizer preprocessor uses a given preprocessor and performs English porter stemming on tokens on top of it.

We can also implement our own token preprocessor—for example, a preprocessor to remove all stop words from the tokens.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset