Each word or number in the tweet is a token, and the process of splitting tweets into tokens is called tokenization. The code that's used to carry out tokenization is as follows:
tweets <- c(t1, t2, t3, t4, t5)
token <- text_tokenizer(num_words = 10) %>%
fit_text_tokenizer(tweets)
token$index_word[1:3]
$`1`
[1] "the"
$`2`
[1] "aapl"
$`3`
[1] "in"
From the preceding code, we can see the following:
- We started by saving five tweets in tweets.
- For the tokenization process, we specified num_words as 10 to indicate we want to use 10 of the most frequent words and ignore any others.
- Although we specified that we will have 10 frequent words, the maximum value of integers that will be used is actually going to be 10 - 1 = 9.
- We used fit_text_tokenizer, which automatically converts text into lowercase and removes any punctuation from the tweets.
- We observed that the top three most frequent words in these five tweets are the, aapl, and in.
Note that words that have a high frequency may or may not be important for text classification.