Dimensionality reduction with principal component analysis

Dimensionality reduction is the process of reducing the number of dimensions or features. A lot of real data contains a very high number of features. It is not uncommon to have thousands of features. So we need to drill down to features that matter.

Dimensionality reduction serves several purposes, such as:

  • Data compression
  • Visualization

When the number of dimensions is reduced, it reduces the disk and memory footprint. Last but not least, it helps algorithms to run faster. It also helps reduce highly correlated dimensions to one.

Humans can only visualize three dimensions, but data has access to a much higher number of dimensions. Visualization can help find hidden patterns in a particular piece of data. Dimensionality reduction helps visualization by compacting multiple features into one.

The most popular algorithm for dimensionality reduction is principal component analysis (PCA).

Let's look at the following dataset:

Let's say the goal is to divide this two-dimensional data into one dimension. The way to do this would be to find a line on which we can project this data. Let's find a line that is good for projecting this data on:

This is the line that has the shortest projected distance from the data points. Let's check this out further by dropping the shortest lines from each data point to this projected line:

Another way to look at this is to find a line to project the data on so that the sum of the square distances of the data points from this line would be minimized. The gray line segments are also called projection errors.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset