The steps we need to follow in order to prepare the data for model building are as follows:
- Tokenization
- Converting text into integers
- Padding and truncation
To illustrate the steps involved in data preparation, we will make use of a very small text dataset involving five tweets related to when the Apple iPhone X released in September 2017. We will use this small dataset to understand the steps that are involved in data preparation and then we will switch to a larger IMDb dataset in order to build a deep network classification model. The following are the five tweets that we are going to store in t1 to t5:
t1 <- "I'm not a huge $AAPL fan but $160 stock closes down $0.60 for the day on huge volume isn't really bearish"
t2 <- "$AAPL $BAC not sure what more dissapointing: the new iphones or the presentation for the new iphones?"
t3 <- "IMO, $AAPL animated emojis will be the death of $SNAP."
t4 <- "$AAPL get on board. It's going to 175. I think wall st will have issues as aapl pushes 1 trillion dollar valuation but 175 is in the cards"
t5 <- "In the AR vs. VR battle, $AAPL just put its chips behind AR in a big way."
The preceding tweets include text that's in both lowercase and uppercase, punctuation, numbers, and special characters.