How it works...

In step 1, we used BasicLineIterator, which is a basic, single-line sentence iterator without any customization involved.

In step 2, we used LineSentenceIterator to iterate through multi-sentence text data. Each line is considered a sentence here. We can use them for multiple lines of text.

In step 3, CollectionSentenceIterator will accept a list of strings as text input where each string represents a sentence (document). This can be a list of tweets or articles.

In step 4, FileSentenceIterator processes sentences in a file/directory. Sentences will be processed line by line from each file.

For anything complex, we recommend that you use UimaSentenceIterator, which is a proper machine learning level pipeline. It iterates over a set of files and segments the sentences. The UimaSentenceIterator pipeline can perform tokenization, lemmatization, and part-of-speech tagging. The behavior can be customized based on the analysis engines that are passed on. This iterator is the best fit for complex data, such as data returned from the Twitter API. An analysis engine is a text-processing pipeline.

You need to use the reset() method if you want to begin the iterator traversal from the beginning after traversing once.

We can normalize the data and remove anomalies by defining a preprocessor on the data iterator. Hence, we defined a normalizer (preprocessor) in step 5. 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset