Dataset collection and exploration

In this section, we will describe the very famous MNIST dataset. This dataset will be used throughout this chapter. The MNIST database of handwritten digits (downloaded from https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html) has a training set of 60,000 examples and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image. Consequently, this is a very good example dataset for those who are trying to learn techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting. The original black and white (bi-level) images from NIST were size-normalized to fit in a 20 x 20 pixel box while preserving their aspect ratio.

The MNIST database was constructed from NIST's special database 3 and special database 1, which contain binary images of handwritten digits. A sample of the dataset is given in the following:

Figure 20: A snap of the MNIST dataset

You can see that there are 780 features altogether. Consequently, sometimes, many machine learning algorithms will fail due to the high-dimensional nature of your dataset. Therefore, to address this issue, in the next section, we will show you how to reduce the dimensions without sacrificing the qualities machine learning tasks, such as classification. However, before diving into the problem, let's get some background knowledge on regression analysis first.

Table of Contents for Dataset collection and exploration

Create new playlist

Sign In

Sign Up

Table of Contents for
Dataset collection and exploration