How it works...

Steps 1 to 3 loads necessary packages, datasets, and functions required to assess different examples of text2vec. Logistic regression is implemented using the glmnet package with L1 penalty (Lasso regularization). In step 4, a DTM is created using all the vocabulary words present in the train movie reviews, and the test auc value is 0.918. In step 5, the train and test DTMs are pruned using stop words and frequency of occurrence.

The test auc value is observed as 0.916, not much decrease compared to using all the vocabulary words. In step 6, along with single words (or uni-grams), bi-grams are also added to the vocabulary. The test auc value increases to 0.928. Feature hashing is then performed in step 7, and the test auc value is 0.895. Though the auc value reduced, hashing is meant to improve run-time performance of larger datasets. Feature hashing is widely popularized by Yahoo. Finally, in step 8, we perform tf-idf transformation, which returns a test auc value of 0.907.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset