LASSO model

I'm going to provide limited commentary during this portion as we've done this before in Chapter 4, Advanced Feature Selection in Linear Models. We will create our model using LASSO and check the performance on the test data. Let's specify our x and y for the cv.glmnet() function:

> x <- dtm_train_tfidf

> y <- as.factor(train$party)

The minimum number of folds in cross-validation with glmnet is three, which we will use given the small number of observations:

> set.seed(123)

> lasso <- glmnet::cv.glmnet(
x,
y,
nfolds = 3,
type.measure = "class",
alpha = 1,
family = "binomial"
)

> plot(lasso)

The output of the preceding code is as follows:

Wow! All those input features and just a handful are relevant, and the area under the curve (AUC) is around 0.75. Can that hold during validation?

> lasso_test <-
data.frame(predict(lasso, newx = dtm_test_tfidf,
type = 'response'), s = "lambda.1se")
> testY <- as.numeric(ifelse(test$party == "Republican", 1, 0))

> Metrics::auc(testY, lasso_test$X1)
[1] 0.8958333

It is a small dataset, observation-wise, but performance is OK. How could we improve this? Well, you may say we could add observations from the 19th century, but the party affiliation and political debate in that era were very different than today. You could possibly add principal components, or try ensembles. Those are just a few ideas. We'll transition now to looking at some other quantitative methods of interest. 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset