Textual data requires careful and diligent preprocessing before any feature extraction/engineering can be performed. There are various steps involved in preprocessing textual data. The following is a list of some of the most widely used preprocessing steps for textual data:
- Tokenization
- Lowercasing
- Removal of special characters
- Contraction expansions
- Stopword removal
- Spell corrections
- Stemming and lemmatization
We will be covering most techniques in detail in the chapters related to use cases. For a better understanding, readers may refer to Chapter 4 and Chapter 7 of Practical Machine Learning with Python, Sarkar and their co-authors, Springer, 2017.