There's more...

The following are the other tokenizer factory implementations available in DL4J Word2Vec to generate tokenizers for the given input:

NGramTokenizerFactory: This is the tokenizer factory that creates a tokenizer based on the n-gram model. N-grams are a combination of contiguous words or letters of length n that are present in the text corpus.
PosUimaTokenizerFactory: This creates a tokenizer that filters part of the speech tags.
UimaTokenizerFactory: This creates a tokenizer that uses the UIMA analysis engine for tokenization. The analysis engine performs an inspection of unstructured information, makes a discovery, and represents semantic content. Unstructured information is included, but is not restricted to text documents.

Here are the inbuilt token preprocessors (not including CommonPreprocessor) available in DL4J:

EndingPreProcessor: This is a preprocessor that gets rid of word endings in the text corpus—for example, it removes s, ed, ., ly, and ing from the text.
LowCasePreProcessor: This is a preprocessor that converts text to lowercase format.
StemmingPreprocessor: This tokenizer preprocessor implements basic cleaning inherited from CommonPreprocessor and performs English porter stemming on tokens.
CustomStemmingPreprocessor: This is the stemming preprocessor that is compatible with different stemming processors defined as lucene/tartarus SnowballProgram, such as RussianStemmer, DutchStemmer, and FrenchStemmer. This means that it is suitable for multilanguage stemming.
EmbeddedStemmingPreprocessor: This tokenizer preprocessor uses a given preprocessor and performs English porter stemming on tokens on top of it.

We can also implement our own token preprocessor—for example, a preprocessor to remove all stop words from the tokens.

Table of Contents for There's more...