Chapter 15
IN THIS CHAPTER
Considering the use of standard datasets
Accessing a standard dataset
Performing dataset tasks
The reason to have computers in the first place is to manage data. You can easily lose sight of the overriding goal of computers when faced with all the applications that don’t seem to manage anything. However, even these applications manage data. For example, a graphics application, even if it simply displays pictures from last year’s camping trip, is still managing data. When looking at a Facebook page, you see data in myriad forms transferred over an Internet connection. In fact, it would be hard to find a consumer application that doesn’t manage data, and impossible to find a business application that doesn’t manage data in some way. Consequently, data is king on the computer.
Because the sorts of management an application performs differs by the purpose of the application, the number of commonly available standard datasets is quite large. Consequently, finding the right dataset for your needs can be time consuming. Along with defining the need for standardized datasets, this chapter also looks at methods that you can use to locate the right standard dataset for your application.
After you have a dataset loaded, you need to perform various tasks with it. An application can perform a simple analysis, display data content, or perform Create, Read, Update, and Delete (CRUD) tasks as described in the “Considering CRUD” section of Chapter 13. The point is that functional applications, like any other application, require access to a standardized data source to look for better ways of accomplishing tasks.
A standard dataset is one that provides a specific number of records using a specific format. It normally appears in the public domain and is used by professionals around the world for various sorts of tests. Professionals categorize these datasets in various ways:
Depending on where you search, you can find all sorts of other information, such as who donated the data and when. In some cases, old data may not reflect current social trends, making any testing you perform suspect. Some languages actually build the datasets into their downloadable source so that you don’t even have to do anything more than load them.
Of course, knowing what a standard dataset is and why you would use it are two different questions. Many developers want to test using their own custom data, which is prudent, but using a standard dataset does provide specific benefits, as listed here:
Locating the right dataset for testing purposes is essential. Fortunately, you don’t have to look very hard because some online sites provide you with everything needed to make a good decision. The following sections offer insights into locating the right dataset for your needs.
Datasets appear in a number of places online, and you can use many of them for general needs. An example of these sorts of datasets appears on the UCI Machine Learning Repository at http://archive.ics.uci.edu/ml/datasets.html
, shown in Figure 15-1. As the table shows, the site categorizes the individual datasets so that you can find the dataset you need. More important, the table helps you understand the kinds of tasks that people normally employ the dataset to perform.
If you want to know more about a particular dataset, you click its link and go to a page like the one shown in Figure 15-2. You can determine whether a dataset will help you test certain application features, such as searching for and repairing missing values. The Number of Web Hits field tells you how popular the dataset is, which can affect your ability to find others who have used the dataset for testing purposes. All this information is helpful in ensuring that you get the right dataset for a particular need; the goals include error detection, performance testing, and comparison with other applications of the same type.
Depending on your programming language, you likely need to use a library to work with datasets in any meaningful way. One such library for Python is Scikit-learn (http://scikit-learn.org/stable/
). This is one of the more popular libraries because it contains such an extensive set of features and also provides the means for loading both internal and external datasets as described at http://scikit-learn.org/stable/datasets/index.html
. You can obtain various kinds of datasets using Scikit-learn as follows:
http://svmlight.joachims.org/
) and libsvm (https://www.csie.ntu.edu.tw/~cjlin/libsvm/
) implementations, which include datasets that enable you to perform sparse dataset tasks.pandas.io
: Provides access to common data formats that include CSV, Excel, JSON, and SQL.scipy.io
: Obtains information from binary formats popular with the scientific community, including .mat
and .arff
files.numpy/routines.io
: Loads columnar data into NumPy (http://www.numpy.org/
) arrays.skimage.io
: Loads images and videos into NumPy arrays.scipy.io.wavfile.read
: Reads .wav
file data into NumPy arrays.The fact that Python provides access to such a large variety of datasets might make you think that a common mechanism exists for loading them. Actually, you need a variety of techniques to load even common datasets. As the datasets become more esoteric, you need additional libraries and other techniques to get the job done. The following sections don’t give you an exhaustive view of dataset loading in Python, but you do get a good overview of the process for commonly used datasets so that you can use these datasets within the functional programming environment. (See the “Finding Haskell support” sidebar in this chapter for reasons that Haskell isn’t included in the sections that follow.)
As previously mentioned, a toy dataset is one that contains a small amount of common data that you can use to test basic assumptions, functions, algorithms, and simple code. The toy datasets reside directly in Scikit-learn, so you don’t have to do anything special except call a function to use them. The following list provides a quick overview of the function used to import each of the toy datasets into your Python code:
load_boston()
: Regression analysis with the Boston house-prices datasetload_iris()
: Classification with the iris datasetload_diabetes()
: Regression with the diabetes datasetload_digits([n_class])
: Classification with the digits datasetload_linnerud()
: Multivariate regression using the linnerud dataset (health data described at https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/datasets/descr/linnerud.rst
)load_wine()
: Classification with the wine datasetload_breast_cancer()
: Classification with the Wisconsin breast cancer datasetThe technique for loading each of these datasets is the same across examples. The following example shows how to load the Boston house-prices dataset:
from sklearn.datasets import load_boston
Boston = load_boston()
print(Boston.data.shape)
To see how the code works, click Run Cell. The output from the print()
call is (506, 13)
. You can see the output shown in Figure 15-3.
The purpose of each of the data generator functions is to create randomly generated datasets that have specific attributes. For example, you can control the number of data points using the n_samples
argument and use the centers
argument to control how many groups the function creates within the dataset. Each of the calls starts with the word make. The kind of data depends on the function; for example, make_blobs()
creates Gaussian blobs for clustering (see http://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_blobs.html
for details). The various functions reflect the kind of labeling provided: single label and multilabel. You can also choose bi-clustering, which allows clustering of both matrix rows and columns. Here's an example of creating custom data:
from sklearn.datasets import make_blobs
X, Y = make_blobs(n_samples=120, n_features=2, centers=4)
print(X.shape)
The output will tell you that you have indeed created an X
object containing a dataset with two features and 120 cases for each feature. The Y
object contains the color values for the cases. Seeing the data plotted using the following code is more interesting:
import matplotlib.pyplot as plt
%matplotlib inline
plt.scatter(X[:, 0], X[:, 1], s=25, c=Y)
plt.show()
The %matplotlib
magic function appears in Table 11-1. In this case, you tell Notebook to present the plot inline. The output is a scatter chart using the x-axis and y-axis contained in X
. The c=Y
argument tells scatter() to create the chart using the color values found in Y
. Figure 15-4 shows the output of this example. Notice that you can clearly see the four clusters based on their color (even though the colors don't appear in the book).
At some point, you need larger datasets of common data to use for testing. The toy datasets that worked fine when you were testing your functions may not do the job any longer. Python provides access to larger datasets that help you perform more complex testing but won’t require you to rely on network sources. These datasets will still load on your system so that you’re not waiting on network latency during testing. Consequently, they’re between the toy datasets and a real-world dataset in size. More important, because they rely on actual (standardized) data, they reflect real-world complexity. The following list tells you about the common datasets:
fetch_olivetti_faces
()
: Olivetti faces dataset from AT&T containing ten images each of 40 different test subjects; each grayscale image is 64 x 64 pixels in sizefetch_20newsgroups(
subset='train'
):
Data from 18,000 newsgroup posts based on 20 topics, with the dataset split into two subgroups: one for training and one for testingfetch_mldata(
'MNIST original'
,
data_home=custom_data_home
)
: Dataset containing machine learning data in the form of 70,000, 28-x-28-pixel handwritten digits from 0 through 9fetch_lfw_people(min_faces_per_person=70, resize=0.4)
: Labeled Faces in the Wild dataset described at http://vis-www.cs.umass.edu/lfw/
, which contains pictures of famous people in JPEG formatsklearn.datasets.fetch_covtype()
: U.S. forestry dataset containing the predominant tree type in each of the patches of forest in the datasetsklearn.datasets.fetch_rcv1()
: Reuters Corpus Volume I (RCV1) is a dataset containing 800,000 manually categorized stories from Reuters, Ltd.Notice that each of these functions begins with the word fetch. Some of these datasets require a long time to load. For example, the Labeled Faces in the Wild (LFW) dataset is 200MB in size, which means that you wait several minutes just to load it. However, at 200MB, the dataset also begins (in small measure) to start reflecting the size of real-world datasets. The following code shows how to fetch the Olivetti faces dataset:
from sklearn.datasets import fetch_olivetti_faces
data = fetch_olivetti_faces()
print(data.images.shape)
When you run this code, you see that the shape is 400 images, each of which is 64 x 64 pixels. The resulting data
object contains a number of properties, including images. To access a particular image, you use data.images[
?
]
, where ? is the number of the image you want to access in the range from 0 to 399. Here is an example of how you can display an individual image from the dataset.
import matplotlib.pyplot as plt
%matplotlib inline
plt.imshow(data.images[1], cmap="gray")
plt.show()
You're unlikely to find a common dataset used with Python that doesn't provide relatively good documentation. You need to find the documentation online if you want the full story about how the dataset is put together, what purpose it serves, and who originated it, as well as any needed statistics. Fortunately, you can employ a few tricks to interact with a dataset without resorting to major online research. The following sections offer some tips for working with the dataset entries found in this chapter.
The previous sections of this chapter show how to load or fetch existing datasets from specific sources. These datasets generally have specific characteristics that you can discover online at places like http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html
for the Boston house-prices dataset. However, you can also use the dir()
function to learn about dataset content. When you use dir(Boston)
with the previously created Boston house-prices dataset, you discover that it contains DESCR
, data
, feature_names
, and target
properties. Here is a short description of each property:
DESCR
: Text that describes the dataset content and some of the information you need to use it effectivelydata
: The content of the dataset in the form of values used for analysis purposesfeature_names
: The names of the various attributes in the order in which they appear in data
target
: An array of values used with data
to perform various kinds of analysisThe print(Boston.DESCR)
function displays a wealth of information about the Boston house-prices dataset, including the names of attributes that you can use to interact with the data. Figure 15-6 shows the results of these queries.
The common datasets are in a form that allows various types of analysis, as shown by the examples provided on the sites that describe them. However, you might not want to work with the dataset in that manner; instead, you may want something that looks a bit more like a database table. Fortunately, you can use the pandas (https://pandas.pydata.org/
) library to perform the conversion in a manner that makes using the datasets in other ways easy. Using the Boston house-prices dataset as an example, the following code performs the required conversion:
import pandas as pd
BostonTable = pd.DataFrame(Boston.data,
columns=Boston.feature_names)
If you want to include the target values with the DataFrame
, you must also execute: BostonTable['target'] = Boston.target
. However, this chapter doesn't use the target data.
If you were to do a dir()
command against a DataFrame
, you would find that it provides you with an overwhelming number of functions to try. The documentation at https://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.DataFrame.html
supplies a good overview of what's possible (which includes all the usual database-specific tasks specified by CRUD). The following example code shows how to perform a query against a pandas DataFrame
. In this case, the code selects only those housing areas where the crime rate is below 0.02 per capita.
CRIMTable = BostonTable.query('CRIM < 0.02')
print(CRIMTable.count()['CRIM'])
The output shows that only 17 records match the criteria. The count()
function enables the application to count the records in the resulting CRIMTable
. The index, ['CRIM']
, selects just one of the available attributes (because every column is likely to have the same values).
You can display all these records with all of the attributes, but you may want to see only the number of rooms and the average house age for the affected areas. The following code shows how to display just the attributes you actually need:
print(CRIMTable[['RM', 'AGE']])
Figure 15-7 shows the output from this code. As you can see, the houses vary between 5 and nearly 8 rooms in size. The age varies from almost 14 years to a little over 65 years.
You might find it a bit hard to work with the unsorted data in Figure 15-7. Fortunately, you do have access to the full range of common database features. If you want to sort the values by number of rooms, you use:
print(CRIMTable[['RM', 'AGE']].sort_values('RM'))
As an alternative, you can always choose to sort by average home age:
print(CRIMTable[['RM', 'AGE']].sort_values('AGE'))