The following are the other tokenizer factory implementations available in DL4J Word2Vec to generate tokenizers for the given input:
- NGramTokenizerFactory: This is the tokenizer factory that creates a tokenizer based on the n-gram model. N-grams are a combination of contiguous words or letters of length n that are present in the text corpus.
- PosUimaTokenizerFactory: This creates a tokenizer that filters part of the speech tags.
- UimaTokenizerFactory: This creates a tokenizer that uses the UIMA analysis engine for tokenization. The analysis engine performs an inspection of unstructured information, makes a discovery, and represents semantic content. Unstructured information is included, but is not restricted to text documents.
Here are the inbuilt token preprocessors (not including CommonPreprocessor) available in DL4J:
- EndingPreProcessor: This is a preprocessor that gets rid of word endings in the text corpus—for example, it removes s, ed, ., ly, and ing from the text.
- LowCasePreProcessor: This is a preprocessor that converts text to lowercase format.
- StemmingPreprocessor: This tokenizer preprocessor implements basic cleaning inherited from CommonPreprocessor and performs English porter stemming on tokens.
- CustomStemmingPreprocessor: This is the stemming preprocessor that is compatible with different stemming processors defined as lucene/tartarus SnowballProgram, such as RussianStemmer, DutchStemmer, and FrenchStemmer. This means that it is suitable for multilanguage stemming.
- EmbeddedStemmingPreprocessor: This tokenizer preprocessor uses a given preprocessor and performs English porter stemming on tokens on top of it.
We can also implement our own token preprocessor—for example, a preprocessor to remove all stop words from the tokens.