Feature engineering

Once we have our textual data properly processed via the methods mentioned in the previous section, we can utilize some of the following techniques for feature extraction and transformation into numerical form. Code snippets to better understand feature engineering for textual data are available in the Jupyter Notebook feature_engineering_text_data.ipynb:

Bag-of-words model: This is by far the simplest vectorization technique for textual data. In this technique, each document is represented as a vector on N dimensions, where N indicates all possible words across the preprocessed corpus, and each component of the vector either denotes the presence of the word or its frequency.
TF-IDF model: The bag-of-words model works under very simplistic assumptions and at certain times leads to various issues. One of the most common issues is related to some words overshadowing the rest of the words due to very high frequency, as the bag-of-words model utilizes absolute frequencies to vectorize. The Term Frequency-Inverse Document Frequency (TF-IDF) model mitigates this issue by scaling/normalizing the absolute frequencies. Mathematically, the model is defined as follows:
tfidf (w, D) = tf (W, D) * idf (w, D)

Here, tfidf (w, D) denotes the TF-IDF score of each word w in document D, tf (w, D) is the frequency of word w in document D and idf (w, D) denotes the inverse document frequency, calculated as the log transformation of total documents in corpus C divided by frequency of documents where w occurs.

Apart from bag of words and TF-IDF, there are other transformations, such as bag of N-grams, and word embeddings such as Word2vec, GloVe, and many more. We will cover several of them in detail in subsequent chapters.

Table of Contents for Feature engineering

Create new playlist

Sign In

Sign Up

Table of Contents for
Feature engineering