Summary

In this chapter, we discussed feature extraction and developed an understanding about the basic techniques for transforming arbitrary data into feature representations that can be used by machine learning algorithms. First, we created features from categorical explanatory variables using one-hot encoding and scikit-learn's DictVectorizer.Then, we discussed the creation of feature vectors for one of the most common types of data used in machine learning problems: text. We worked through several variations of the bag-of-words model, which discards all syntax and encodes only the frequencies of the tokens in a document. We began by creating basic binary term frequencies with CountVectorizer. You learned to preprocess text by filtering stop words and stemming tokens, and you also replaced the term counts in our feature vectors with TF-IDF weights that penalize common words and normalize for documents of different lengths. Next, we created feature vectors for images. We began with an optical character recognition problem in which we represented images of hand-written digits with flattened matrices of pixel intensities. This is a computationally costly approach. We improved our representations of images by extracting only their most interesting points as SURF descriptors.

Finally, you learned to standardize data to ensure that our estimators can learn from all of the explanatory variables and can converge as quickly as possible. We will use these feature extraction techniques in the subsequent chapters' examples. In the next chapter, we will combine the bag-of-words representation with a generalization of multiple linear regressions to classify documents.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset