Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 1. Premodel Workflow

This chapter will cover the following topics:

Getting sample data from external sources
Creating sample data for toy analysis
Confirming the characteristics of created data
Scaling data to the standard normal
Creating binary features through thresholding
Working with categorical variables
Binarizing label features
Imputing missing values through various strategies
Using Pipelines for multiple preprocessing steps
Reducing dimensionality with PCA
Using factor analytics for decomposition
Kernel PCA for nonlinear dimensionality reduction
Using truncated SVD to reduce dimensionality
Decomposition to classify with DictionaryLearning
Putting it all together with Pipelines
Using Gaussian processes for regression
Defining the Gaussian process object directly
Using stochastic gradient descent for regression

Introduction

This chapter discusses setting data, preparing data, and premodel dimensionality reduction. These are not the attractive parts of machine learning (ML), but they often turn out to be what determines if a model will work or not.

There are three main parts to the chapter. Firstly, we'll create fake data; this might seem trivial, but creating fake data and fitting models to fake data is an important step in model testing. It's more useful in situations where we implement an algorithm from scratch, but I'll cover it here for completeness, and in the event you don't have data of your own, you can just create it. Secondly, we'll look at broadly handling data transformations as a preprocessing step, which includes data imputation, categorical variable encoding, and so on. Thirdly, we'll look at situations where we have a large number of features relative to the number of observations we have.

This chapter, especially the first half, will set the stage for the later chapters. In order to use scikit-learn, data is required. The first two sections will discuss acquiring the data; the rest of the first half will discuss preparing this data for use.

Tip

This book is written using scikit-learn 0.15, NumPy 1.9, and pandas 0.13. There are other packages used as well, so it's advisable that you refer to the installation instructions included in this book.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 1. Premodel Workflow

Create new playlist

Sign In

Sign Up

Chapter 1. Premodel Workflow

Introduction

Tip

Table of Contents for
1. Premodel Workflow