Figure 1.1 |
Rules for the contact lens data. |
13 |
Figure 1.2 |
Decision tree for the contact lens data. |
14 |
Figure 1.3 |
Decision trees for the labor negotiations data. |
18 |
Figure 1.4 |
Life cycle of a data mining project. |
29 |
Figure 2.1 |
A family tree and two ways of expressing the sister-of relation. |
48 |
Figure 2.2 |
ARFF file for the weather data. |
58 |
Figure 2.3 |
Multi-instance ARFF file for the weather data. |
60 |
Figure 3.1 |
A linear regression function for the CPU performance data. |
69 |
Figure 3.2 |
A linear decision boundary separating Iris setosas from Iris versicolors. |
70 |
Figure 3.3 |
Constructing a decision tree interactively: (A) creating a rectangular test involving petallength and petalwidth; (B) the resulting (unfinished) decision tree. |
73 |
Figure 3.4 |
Models for the CPU performance data: (A) linear regression; (B) regression tree; (C) model tree. |
74 |
Figure 3.5 |
Decision tree for a simple disjunction. |
76 |
Figure 3.6 |
The exclusive-or problem. |
77 |
Figure 3.7 |
Decision tree with a replicated subtree. |
77 |
Figure 3.8 |
Rules for the iris data. |
81 |
Figure 3.9 |
The shapes problem. |
82 |
Figure 3.10 |
Different ways of partitioning the instance space. |
86 |
Figure 3.11 |
Different ways of representing clusters. |
88 |
Figure 4.1 |
Pseudocode for 1R. |
93 |
Figure 4.2 |
Tree stumps for the weather data. |
106 |
Figure 4.3 |
Expanded tree stumps for the weather data. |
108 |
Figure 4.4 |
Decision tree for the weather data. |
109 |
Figure 4.5 |
Tree stump for the ID code attribute. |
111 |
Figure 4.6 |
Covering algorithm: (A) covering the instances; (B) decision tree for the same problem. |
113 |
Figure 4.7 |
The instance space during operation of a covering algorithm. |
115 |
Figure 4.8 |
Pseudocode for a basic rule learner. |
118 |
Figure 4.9 |
(A) Finding all item sets with sufficient coverage; (B) finding all sufficiently accurate association rules for a k-item set. |
127 |
Figure 4.10 |
Logistic regression: (A) the logit transform; (B) example logistic regression function. |
130 |
Figure 4.11 |
The perceptron: (A) learning rule; (B) representation as a neural network. |
132 |
Figure 4.12 |
The Winnow algorithm: (A) unbalanced version; (B) balanced version. |
134 |
Figure 4.13 |
A kD-tree for four training instances: (A) the tree; (B) instances and splits. |
137 |
Figure 4.14 |
Using a kD-tree to find the nearest neighbor of the star. |
137 |
Figure 4.15 |
Ball tree for 16 training instances: (A) instances and balls; (B) the tree. |
139 |
Figure 4.16 |
Ruling out an entire ball (gray) based on a target point (star) and its current nearest neighbor. |
140 |
Figure 4.17 |
Iterative distance-based clustering. |
143 |
Figure 4.18 |
A ball tree: (A) two cluster centers and their dividing line; (B) corresponding tree. |
145 |
Figure 4.19 |
Hierarchical clustering displays. |
149 |
Figure 4.20 |
Clustering the weather data. |
151 |
Figure 4.21 |
Hierarchical clusterings of the iris data. |
153 |
Figure 5.1 |
A hypothetical lift chart. |
185 |
Figure 5.2 |
Analyzing the expected benefit of a mailing campaign when the cost of mailing is (A) $0.50 and (B) $0.80. |
187 |
Figure 5.3 |
A sample ROC curve. |
188 |
Figure 5.4 |
ROC curves for two learning schemes. |
189 |
Figure 5.5 |
Effect of varying the probability threshold: (A) error curve; (B) cost curve. |
193 |
Figure 6.1 |
Example of subtree raising, where node C is “raised” to subsume node B. |
214 |
Figure 6.2 |
Pruning the labor negotiations decision tree. |
216 |
Figure 6.3 |
Algorithm for forming rules by incremental reduced-error pruning. |
226 |
Figure 6.4 |
RIPPER: (A) algorithm for rule learning; (B) meaning of symbols. |
228 |
Figure 6.5 |
Algorithm for expanding examples into a partial tree. |
229 |
Figure 6.6 |
Example of building a partial tree. |
230 |
Figure 6.7 |
Rules with exceptions for the iris data. |
232 |
Figure 6.8 |
Extended prefix trees for the weather data: (A) the full data; (B) the data conditional on temperature=mild; (C) the data conditional on humidity=normal. |
238 |
Figure 7.1 |
A boundary between two rectangular classes. |
249 |
Figure 7.2 |
A maximum margin hyperplane. |
253 |
Figure 7.3 |
Support vector regression: (A) ε=1; (B) ε=2; (C) ε=0.5. |
257 |
Figure 7.4 |
Example data sets and corresponding perceptrons. |
262 |
Figure 7.5 |
Step vs sigmoid: (A) step function; (B) sigmoid function. |
264 |
Figure 7.6 |
Gradient descent using the error function w2+1. |
265 |
Figure 7.7 |
Multilayer perceptron with a hidden layer (omitting bias inputs). |
267 |
Figure 7.8 |
Hinge, squared and 0–1 loss functions. |
271 |
Figure 7.9 |
Pseudocode for model tree induction. |
278 |
Figure 7.10 |
Model tree for a data set with nominal attributes. |
279 |
Figure 8.1 |
Attribute space for the weather dataset. |
292 |
Figure 8.2 |
Discretizing the temperature attribute using the entropy method. |
299 |
Figure 8.3 |
The result of discretizing the temperature attribute. |
299 |
Figure 8.4 |
Class distribution for a two-class, two-attribute problem. |
302 |
Figure 8.5 |
Principal component transform of a dataset: (A) variance of each component; (B) variance plot. |
306 |
Figure 8.6 |
Comparing principal component analysis and Fisher’s linear discriminant analysis. |
312 |
Figure 8.7 |
Number of international phone calls from Belgium, 1950–1973. |
318 |
Figure 8.8 |
Overoptimistic probability estimation for a two-class problem. |
329 |
Figure 9.1 |
A simple Bayesian network for the weather data. |
341 |
Figure 9.2 |
Another Bayesian network for the weather data. |
342 |
Figure 9.3 |
The Markov blanket for variable x6 in a 10-variable Bayesian network. |
348 |
Figure 9.4 |
The weather data: (A) reduced version; (B) corresponding AD tree. |
350 |
Figure 9.5 |
A two-class mixture model. |
354 |
Figure 9.6 |
DensiTree showing possible hierarchical clusterings of a given data set. |
360 |
Figure 9.7 |
Probability contours for three types of model, all based on Gaussians. |
362 |
Figure 9.8 |
(A) Bayesian network for a mixture model; (B) multiple copies of the Bayesian network, one for each observation; (C) plate notation version of (B). |
371 |
Figure 9.9 |
(A) Bayesian network for probabilistic PCA; (B) equal-probability contour for a Gaussian distribution along with its covariance matrix’s principal eigenvector. |
372 |
Figure 9.10 |
The singular value decomposition of a t by d matrix. |
377 |
Figure 9.11 |
Graphical models for (A) pLSA, (B) LDAb, and (C) smoothed LDAb. |
379 |
Figure 9.12 |
(A) Bayesian network and (B) corresponding factor graph. |
382 |
Figure 9.13 |
The Markov blanket for variable x6 in a 10-variable factor graph. |
383 |
Figure 9.14 |
(A) and (B) Bayesian network and corresponding factor graph; (C) and (D) Naïve Bayes model and corresponding factor graph. |
384 |
Figure 9.15 |
(A) Bayesian network representing the joint distribution of y and its parents; (B) factor graph for a logistic regression for the conditional distribution of y given its parents. |
384 |
Figure 9.16 |
(A) Undirected graph representing a Markov random field structure; (B) corresponding factor graph. |
385 |
Figure 9.17 |
Message sequence in an example factor graph. |
389 |
Figure 9.18 |
(A) and (B) First- and second-order Markov models for a sequence of variables; (C) Hidden Markov model; (D) Markov random field. |
404 |
Figure 9.19 |
Mining emails for meeting details. |
406 |
Figure 9.20 |
(A) Dynamic Bayesian network representation of a hidden Markov model; (B) similarly structured Markov random field; (C) factor graph for (A); and (D) factor graph for a linear chain conditional random field. |
407 |
Figure 10.1 |
A feedforward neural network. |
424 |
Figure 10.2 |
Computation graph showing forward propagation in a deep network. |
426 |
Figure 10.3 |
Backpropagation in a deep network (the forward computation is shown with gray arrows). |
429 |
Figure 10.4 |
Parameter updates that follow the forward and backward propagation steps (shown with gray arrows). |
430 |
Figure 10.5 |
Typical learning curves for the training and validation sets. |
431 |
Figure 10.6 |
Pseudocode for mini-batch based stochastic gradient descent. |
435 |
Figure 10.7 |
Typical convolutional neural network architecture. |
439 |
Figure 10.8 |
Original image; filtered with the two Sobel operators; magnitude of the result. |
441 |
Figure 10.9 |
Examples of what random neurons detect in different layers of a convolutional neural network using the visualization approach of Zeiler and Fergus (2013). Underlying imagery kindly provided by Matthew Zeiler. |
442 |
Figure 10.10 |
Example of the convolution, pooling, and decimation operations used in convolutional neural networks. |
443 |
Figure 10.11 |
A simple autoencoder. |
445 |
Figure 10.12 |
A deep autoencoder with multiple layers of transformation. |
447 |
Figure 10.13 |
Low-dimensional principal component space (left) compared with one learned by a deep autoencoder (right). |
447 |
Figure 10.14 |
Boltzmann machines: (A) fully connected; (B) restricted; (C) more general form of (B). |
450 |
Figure 10.15 |
(A) Deep Boltzmann machine and (B) deep belief network. |
453 |
Figure 10.16 |
(A) Feedforward network transformed into a recurrent network; (B) hidden Markov model; and (C) recurrent network obtained by unwrapping (A). |
456 |
Figure 10.17 |
Structure of a “long short-term memory” unit. |
459 |
Figure 10.18 |
Recurrent neural networks: (A) bidirectional, (B) encoder-decoder. |
460 |
Figure 10.19 |
A deep encoder-decoder recurrent network. |
460 |
Figure 12.1 |
Algorithm for bagging. |
483 |
Figure 12.2 |
Algorithm for boosting. |
488 |
Figure 12.3 |
Algorithm for additive logistic regression. |
493 |
Figure 12.4 |
Simple option tree for the weather data. |
494 |
Figure 12.5 |
Alternating decision tree for the weather data. |
495 |
Figure 13.1 |
A tangled web. |
521 |