In step 1, we used BasicLineIterator, which is a basic, single-line sentence iterator without any customization involved.
In step 2, we used LineSentenceIterator to iterate through multi-sentence text data. Each line is considered a sentence here. We can use them for multiple lines of text.
In step 3, CollectionSentenceIterator will accept a list of strings as text input where each string represents a sentence (document). This can be a list of tweets or articles.
In step 4, FileSentenceIterator processes sentences in a file/directory. Sentences will be processed line by line from each file.
For anything complex, we recommend that you use UimaSentenceIterator, which is a proper machine learning level pipeline. It iterates over a set of files and segments the sentences. The UimaSentenceIterator pipeline can perform tokenization, lemmatization, and part-of-speech tagging. The behavior can be customized based on the analysis engines that are passed on. This iterator is the best fit for complex data, such as data returned from the Twitter API. An analysis engine is a text-processing pipeline.
We can normalize the data and remove anomalies by defining a preprocessor on the data iterator. Hence, we defined a normalizer (preprocessor) in step 5.