>
This section is included to assist the students to perform the activities in the book. It includes detailed steps that are to be performed by the students to achieve the objectives of the activities.
In the code, backslash () indicates a line break, where the code does not fit a line. A backslash at the end of the line escapes the newline character. This means that the content in the line following the backslash should be read as if it started where the backslash character is.
This section will explore the combinatoric explosion possible when two players play randomly. We will be using a program, building on the previous results that generate all possible sequences of moves between a computer player and a human player. Determine the number of different wins, losses, and draws in terms of action sequences. Assume that the human player may make any possible move. In this example, given that the computer player is playing randomly, we will examine the wins, losses, and draws belonging to two randomly playing players:
def all_moves_from_board(board, sign):
move_list = []
for i, v in enumerate(board):
if v == EMPTY_SIGN:
move_list.append(board[:i] + sign + board[i+1:])
return move_list
all_moves_from_board_list( [ EMPTY_SIGN * 9 ], AI_SIGN )
['X........',
'.X.......',
'..X......',
'...X.....',
'....X....',
'.....X...',
'......X..',
'.......X.',
'........X']
['XO.......',
'X.O......',
'X..O.....',
'X...O....',
'X....O...',
'X.....O..',
'X......O.',
.
.
.
.
'......OX.',
'.......XO',
'O.......X',
'.O......X',
'..O.....X',
'...O....X',
'....O...X',
'.....O..X',
'......O.X',
'.......OX']
def filter_wins(move_list, ai_wins, opponent_wins):
for board in move_list:
won_by = game_won_by(board)
if won_by == AI_SIGN:
ai_wins.append(board)
move_list.remove(board)
elif won_by == OPPONENT_SIGN:
opponent_wins.append(board)
move_list.remove(board)
def count_possibilities():
board = EMPTY_SIGN * 9
move_list = [board]
ai_wins = []
opponent_wins = []
for i in range(9):
print('step ' + str(i) + '. Moves: ' + str(len(move_list)))
sign = AI_SIGN if i % 2 == 0 else OPPONENT_SIGN
move_list = all_moves_from_board_list(move_list, sign)
filter_wins(move_list, ai_wins, opponent_wins)
print('First player wins: ' + str(len(ai_wins)))
print('Second player wins: ' + str(len(opponent_wins)))
print('Draw', str(len(move_list)))
print('Total', str(len(ai_wins) + len(opponent_wins) + len(move_list)))
count_possibilities()
step 0. Moves: 1
step 1. Moves: 9
step 2. Moves: 72
step 3. Moves: 504
step 4. Moves: 3024
step 5. Moves: 13680
step 6. Moves: 49402
step 7. Moves: 111109
step 8. Moves: 156775
First player wins: 106279
Second player wins: 68644
Draw 91150
Total 266073
As you can see, the tree of board states consists of 266,073 leaves. The count_possibilities function essentially implements a breadth first search algorithm to traverse all the possible states of the game. Notice that we do count these states multiple times, because placing an X on the top-right corner on step 1 and placing an X on the top-left corner on step 3 leads to similar possible states as starting with the top-left corner and then placing an X on the top-right corner. If we implemented a detection of duplicate states, we would have to check less nodes. However, at this stage, due to the limited depth of the game, we omit this step.
Follow these steps to complete the activity:
def player_can_win(board, sign):
next_moves = all_moves_from_board(board, sign)
for next_move in next_moves:
if game_won_by(next_move) == sign:
return True
return False
def ai_move(board):
new_boards = all_moves_from_board(board, AI_SIGN)
for new_board in new_boards:
if game_won_by(new_board) == AI_SIGN:
return new_board
safe_moves = []
for new_board in new_boards:
if not player_can_win(new_board, OPPONENT_SIGN):
safe_moves.append(new_board)
return choice(safe_moves) if len(safe_moves) > 0 else new_boards[0]
def all_moves_from_board( board, sign ):
def all_moves_from_board(board, sign):
move_list = []
for i, v in enumerate(board):
if v == EMPTY_SIGN:
new_board = board[:i] + sign + board[i+1:]
move_list.append(new_board)
if game_won_by(new_board) == AI_SIGN:
return [new_board]
if sign == AI_SIGN:
safe_moves = []
for move in move_list:
if not player_can_win(move, OPPONENT_SIGN):
safe_moves.append(move)
return safe_moves if len(safe_moves) > 0 else move_list[0:1]
else:
return move_list
count_possibilities()
step 0. Moves: 1
step 1. Moves: 9
step 2. Moves: 72
step 3. Moves: 504
step 4. Moves: 3024
step 5. Moves: 5197
step 6. Moves: 18606
step 7. Moves: 19592
step 8. Moves: 30936
First player wins: 20843
Second player wins: 962
Draw 20243
Total 42048
We are doing better than before. We not only got rid of almost 2/3 of possible games again, but most of the time, the AI player either wins or settles for a draw. Despite our effort to make the AI better, it can still lose in 962 ways. We will eliminate all these losses in the next activity.
Follow these steps to complete the activity:
def all_moves_from_board(board, sign):
if sign == AI_SIGN:
empty_field_count = board.count(EMPTY_SIGN)
if empty_field_count == 9:
return [sign + EMPTY_SIGN * 8]
elif empty_field_count == 7:
return [
board[:8] + sign if board[8] == EMPTY_SIGN else
board[:4] + sign + board[5:]
]
move_list = []
for i, v in enumerate(board):
if v == EMPTY_SIGN:
new_board = board[:i] + sign + board[i+1:]
move_list.append(new_board)
if game_won_by(new_board) == AI_SIGN:
return [new_board]
if sign == AI_SIGN:
safe_moves = []
for move in move_list:
if not player_can_win(move, OPPONENT_SIGN):
safe_moves.append(move)
return safe_moves if len(safe_moves) > 0 else move_list[0:1]
else:
return move_list
countPossibilities()
step 0. Moves: 1
step 1. Moves: 1
step 2. Moves: 8
step 3. Moves: 8
step 4. Moves: 48
step 5. Moves: 38
step 6. Moves: 108
step 7. Moves: 76
step 8. Moves: 90
First player wins: 128
Second player wins: 0
Draw 60
This section will practice using the EasyAI library and develop a heuristic. We will be using connect four game. The game board is seven cells wide and cells high. When you make a move, you can only select the column in which you drop your token. Then gravity pulls the token down to the lowest possible empty cell. Your objective is to connect four of your own tokens horizontally, vertically, or diagonally, before your opponent does this, or you run out of empty spaces. The rules of the game can be found at: https://en.wikipedia.org/wiki/Connect_Four
from easyAI import TwoPlayersGame
from easyAI.Player import Human_Player
class ConnectFour(TwoPlayersGame):
def __init__(self, players):
self.players = players
def possible_moves(self):
return []
def make_move(self, move):
return
def unmake_move(self, move):
# optional method (speeds up the AI)
return
def lose(self):
return False
def is_over(self):
return (self.possible_moves() == []) or self.lose()
def show(self):
print ('board')
def scoring(self):
return -100 if self.lose() else 0
if __name__ == "__main__":
from easyAI import AI_Player, Negamax
ai_algo = Negamax(6)
__init__
possible_moves
make_move
unmake_move (optional)
lose
show
def __init__(self, players):
self.players = players
# 0 1 2 3 4 5 6
# 7 8 9 10 11 12 13
# ...
# 35 36 37 38 39 40 41
self.board = [0 for i in range(42)]
self.nplayer = 1 # player 1 starts.
def generate_winning_tuples():
tuples = []
# horizontal
tuples += [
list(range(row*7+column, row*7+column+4, 1))
for row in range(6)
for column in range(4)]
# vertical
tuples += [
list(range(row*7+column, row*7+column+28, 7))
for row in range(3)
for column in range(7)
]
# diagonal forward
tuples += [
list(range(row*7+column, row*7+column+32, 8))
for row in range(3)
for column in range(4)
]
# diagonal backward
tuples += [
list(range(row*7+column, row*7+column+24, 6))
for row in range(3)
for column in range(3, 7, 1)
]
return tuples
self.tuples=generate_winning_tuples()
def possible_moves(self):
return [column+1
for column in range(7)
if any([
self.board[column+row*7] == 0
for row in range(6)
])
]
def make_move(self, move):
column = int(move) - 1
for row in range(5, -1, -1):
index = column + row*7
if self.board[index] == 0:
self.board[index] = self.nplayer
return
def unmake_move(self, move):
# optional method (speeds up the AI)
column = int(move) - 1
for row in range(6):
index = column + row*7
if self.board[index] != 0:
self.board[index] = 0
return
def lose(self):
return any([all([(self.board[c] == self.nopponent)
for c in line])
for line in self.tuples])
def is_over(self):
return (self.possible_moves() == []) or self.lose()
def show(self):
print(' '+' '.join([
' '.join([['.', 'O', 'X'][self.board[7*row+column]]
for column in range(7)]
)
for row in range(6)])
)
Now that all functions are complete, you can try out the example. Feel free to play a round or two against the opponent. You can see that the opponent is not perfect, but it plays reasonably well. If you have a strong computer, you can increase the parameter of the Negamax algorithm. I encourage you to come up with a better heuristic.
You are working at the government office of Metropolis, trying to forecast the need for elementary school capacity. Your task is to figure out a 2025 and 2030 prediction for the number of children starting elementary school. Past data are as follows:
Plot tendencies on a two-dimensional chart. Use linear regression.
Our features are the years ranging from 2001 to 2018. For simplicity, we can indicate 2001 as year 1, and 2018 as year 18.
x = np.array(range(1, 19))
y = np.array([
147026,
144272,
140020,
143801,
146233,
144539,
141273,
135389,
142500,
139452,
139722,
135300,
137289,
136511,
132884,
125683,
127255,
124275
])
Use np.polyfit to determine the coefficients of the regression line.
[a, b] = np.polyfit(x, y, 1)
[-1142.0557275541753, 148817.5294117646]
Plot the results using matplotlib.pyplot to determine future tendencies.
import matplotlib.pyplot as plot
plot.scatter( x, y )
plot.plot( [0, 30], [b, 30*a+b] )
plot.show()
This section will discuss how to perform linear, polynomial, and support vector regression with scikit-learn. We will also learn to predict the best fit model for a given task. We will be assuming that you are a software engineer at a financial institution and your employer wants to know whether linear regression, or support vector regression is a better fit for predicting stock prices. You will have to load all data of the S&P 500 from a data source. Then build a regressor using linear regression, cubic polynomial linear regression, and a support vector regression with a polynomial kernel of degree 3. Then separate training and test data. Plot the test labels and the prediction results and compare them with the y=x line. And finally, compare how well the three models score.
Let's load the S&P 500 index data using Quandl, then prepare the data for prediction. You can read the process in the Predicting the Future section of the topic Linear Regression with Multiple Variables.
import quandl
import numpy as np
from sklearn import preprocessing
from sklearn import model_selection
from sklearn import linear_model
from sklearn.preprocessing import PolynomialFeatures
from matplotlib import pyplot as plot
from sklearn import svm
data_frame = quandl.get("YALE/SPCOMP")
data_frame[['Long Interest Rate', 'Real Price',
'Real Dividend', 'Cyclically Adjusted PE Ratio']]
data_frame.fillna(-100, inplace=True)
# We shift the price data to be predicted 20 years forward
data_frame['Real Price Label'] = data_frame['RealPrice'].shift(-240)
# Then exclude the label column from the features
features = np.array(data_frame.drop('Real Price Label', 1))
# We scale before dropping the last 240 rows from the features
scaled_features = preprocessing.scale(features)
# Save the last 240 rows before dropping them
scaled_features_latest240 = scaled_features[-240:]
# Exclude the last 240 rows from the data used for # # modelbuilding
scaled_features = scaled_features[:-240]
# Now we can drop the last 240 rows from the data frame
data_frame.dropna(inplace=True)
# Then build the labels from the remaining data
label = np.array(data_frame['Real Price Label'])
# The rest of the model building stays
(features_train,
features_test,
label_train,
label_test
) = model_selection.train_test_split(
scaled_features,
label,
test_size=0.1
)
Let's first use a polynomial of degree 1 for the evaluation of the model and for the prediction. We are still recreating the main example from the second topic.
model = linear_model.LinearRegression()
model.fit(features_train, label_train)
model.score(features_test, label_test)
0.8978136465083912
label_predicted = model.predict(features_test)
plot.plot(
label_test, label_predicted, 'o',
[0, 3000], [0, 3000]
)
The closer the dots are to the y=x line, the less error the model works with.
It is now time to perform a linear multiple regression with quadratic polynomials. The only change is in the Linear Regression model
poly_regressor = PolynomialFeatures(degree=3)
poly_scaled_features = poly_regressor.fit_transform(scaled_features)
(poly_features_train,
poly_features_test,
poly_label_train,
poly_label_test) = model_selection.train_test_split(
poly_scaled_features,
label,
test_size=0.1)
model = linear_model.LinearRegression()
model.fit(poly_features_train, poly_label_train)
print('Polynomial model score: ', model.score(
poly_features_test, poly_label_test))
print(' ')
poly_label_predicted = model.predict(poly_features_test)
plot.plot(
poly_label_test, poly_label_predicted, 'o',
[0, 3000], [0, 3000]
)
The model is performing surprisingly well on test data. Therefore, we can already suspect our polynomials are overfitting for scenarios used in training and testing.
We will now perform a Support Vector regression with a polynomial kernel of degree 3.
model = svm.SVR(kernel='poly')
model.fit(features_train, label_train)
label_predicted = model.predict(features_test)
plot.plot(
label_test, label_predicted, 'o',
[0,3000], [0,3000]
)
model.score(features_test, label_test)
The output will be 0.06388628722032952.
We will now perform a Support Vector regression with a polynomial kernel of degree 3.
This section will discuss how to prepare data for a classifier. We will be using german.data from https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/, as an example and prepare the data for training and testing a classifier. Make sure all your labels are numeric, and the values are prepared for classification. Use 80% of the data points as training data.
CheckingAccountStatus DurationMonths CreditHistory CreditPurpose CreditAmount SavingsAccount EmploymentSince DisposableIncomePercent PersonalStatusSex OtherDebtors PresentResidenceMonths Property Age OtherInstallmentPlans Housing NumberOfExistingCreditsInBank Job LiabilityNumberOfPeople Phone ForeignWorker CreditScore
import pandas
data_frame = pandas.read_csv('german.data', sep=' ')
data_frame.replace('NA', -1000000, inplace=True)
labels = {
'CheckingAccountStatus': ['A11', 'A12', 'A13', 'A14'],
'CreditHistory': ['A30', 'A31', 'A32', 'A33', 'A34'],
'CreditPurpose': ['A40', 'A41', 'A42', 'A43', 'A44', 'A45', 'A46', 'A47', 'A48', 'A49', 'A410'],
'SavingsAccount': ['A61', 'A62', 'A63', 'A64', 'A65'],
'EmploymentSince': ['A71', 'A72', 'A73', 'A74', 'A75'],
'PersonalStatusSex': ['A91', 'A92', 'A93', 'A94', 'A95'],
'OtherDebtors': ['A101', 'A102', 'A103'],
'Property': ['A121', 'A122', 'A123', 'A124'],
'OtherInstallmentPlans': ['A141', 'A142', 'A143'],
'Housing': ['A151', 'A152', 'A153'],
'Job': ['A171', 'A172', 'A173', 'A174'],
'Phone': ['A191', 'A192'],
'ForeignWorker': ['A201', 'A202']
}
from sklearn import preprocessing
label_encoders = {}
data_frame_encoded = pandas.DataFrame()
for column in data_frame:
if column in labels:
label_encoders[column] = preprocessing.LabelEncoder()
label_encoders[column].fit(labels[column])
data_frame_encoded[column] = label_encoders[
column].transform(data_frame[column])
else:
data_frame_encoded[column] = data_frame[column]
Let's verify that we did everything correctly:
data_frame_encoded.head()
CheckingAccountStatus DurationMonths CreditHistory CreditPurpose
0 0 6 4 4
1 1 48 2 4
2 3 12 4 7
3 0 42 2 3
4 0 24 3 0
CreditAmount SavingsAccount EmploymentSince DisposableIncomePercent
0 1169 4 4 4
1 5951 0 2 2
2 2096 0 3 2
3 7882 0 3 2
4 4870 0 2 3
PersonalStatusSex OtherDebtors ... Property Age
0 2 0 ... 0 67
1 1 0 ... 0 22
2 2 0 ... 0 49
3 2 2 ... 1 45
4 2 0 ... 3 53
OtherInstallmentPlans Housing NumberOfExistingCreditsInBank Job
0 2 1 2 2
1 2 1 1 2
2 2 1 1 1
3 2 2 1 2
4 2 2 2 2
LiabilityNumberOfPeople Phone ForeignWorker CreditScore
0 1 1 0 1
1 1 0 0 2
2 2 0 0 1
3 2 0 0 1
4 2 0 0 2
[5 rows x 21 columns]
label_encoders
{'CheckingAccountStatus': LabelEncoder(),
'CreditHistory': LabelEncoder(),
'CreditPurpose': LabelEncoder(),
'EmploymentSince': LabelEncoder(),
'ForeignWorker': LabelEncoder(),
'Housing': LabelEncoder(),
'Job': LabelEncoder(),
'OtherDebtors': LabelEncoder(),
'OtherInstallmentPlans': LabelEncoder(),
'PersonalStatusSex': LabelEncoder(),
'Phone': LabelEncoder(),
'Property': LabelEncoder(),
'SavingsAccount': LabelEncoder()}
All the 21 columns are available, and the label encoders have been saved in an object too. Our data are now pre-processed.
You don't need to save these label encoders if you don't wish to decode the encoded values. We just saved them for the sake of completeness.
import numpy as np
features = np.array(
data_frame_encoded.drop(['CreditScore'], 1)
)
label = np.array(data_frame_encoded['CreditScore'])
Our features are not yet scaled. This is a problem, because the credit amount distances can be significantly higher than the differences in age for instance.
We must perform scaling of the training and testing data together, therefore, the latest step when we can still perform scaling is before we split training data from testing data.
scaled_features = preprocessing.MinMaxScaler(
feature_range=(0,1)).fit_transform(features)
from sklearn import model_selection
features_train, features_test, label_train,
label_test = model_selection.train_test_split(
scaled_features,
label,
test_size = 0.2
)
This section will learn how the parametrization of the k-nearest neighbor classifier affects the end result. The accuracy of credit scoring is currently quite low: 66.5%. Find a way to increase it by a few percentage points. And to ensure that it happens correctly, you will need to do the previous exercises.
There are many ways to accomplish this exercise. In this solution, I will show you one way to increase the credit score by changing the parametrization.
You must have completed Exercise 13, to be able to complete this activity.
You must have completed Exercise 13, to be able to complete this activity
classifier = neighbors.KNeighborsClassifier(n_neighbors=10)
classifier.fit(
features_train,label_train
)
classifier.score(features_test, label_test)
K=10: accuracy is 71.5%
K=15: accuracy is 70.5%
K=25: accuracy is 72%
K=50: accuracy is 74%
This section will discuss how to use the different parameters of a Support Vector Machine classifier. We will be using comparing and contrasting the different support vector regression classifier parameters you learned and find a set of parameters resulting in the highest classification data on the training and testing data loaded and prepared in previous activity. And to ensure that it happens correctly, you will need to have completed the previous activities and exercises.
We will try out a few combinations. You may choose different parameters, that
classifier = svm.SVC(kernel="linear")
classifier.fit(features_train, label_train)
classifier.score(features_test, label_test)
classifier = svm.SVC(kernel="poly", C=2, degree=4, gamma=0.05)
classifier.fit(features_train, label_train)
classifier.score(features_test, label_test)
The output is as follows: 0.705.
classifier = svm.SVC(kernel="poly", C=2, degree=4, gamma=0.25)
classifier.fit(features_train, label_train)
classifier.score(features_test, label_test)
The output is as follows: 0.76.
classifier = svm.SVC(kernel="poly", C=2, degree=4, gamma=0.5)
classifier.fit(features_train, label_train)
classifier.score(features_test, label_test)
The output is as follows: 0.72.
classifier = svm.SVC(kernel="sigmoid")
classifier.fit(features_train, label_train)
classifier.score(features_test, label_test)
The output is as follows: 0.71.
classifier = svm.SVC(kernel="rbf", gamma=0.15)
classifier.fit(features_train, label_train)
classifier.score(features_test, label_test)
The output is as follows: 0.76.
This section will discuss how to build a reliable decision tree model capable of aiding your company in finding cars clients are likely to buy. We will be assuming that you are employed by a car rental agency focusing on building a lasting relationship with its clients. Your task is to build a decision tree model classifying cars into one of four categories: unacceptable, acceptable, good, very good.
The data set can be accessed here: https://archive.ics.uci.edu/ml/datasets/Car+Evaluation. Click the Data Folder link to download the data set. Click the Data Set Description link to access the description of the attributes.
Evaluate the utility of your decision tree model.
Buying,Maintenance,Doors,Persons,LuggageBoot,Safety,Class
We simply call the label Class. We named the six features after their descriptions in https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.names.
import pandas
data_frame = pandas.read_csv('car.data')
Let's check if the data got loaded correctly:
data_frame.head()
Buying Maintenance Doors Persons LuggageBoot Safety Class
0 vhigh vhigh 2 2 small low unacc
1 vhigh vhigh 2 2 small med unacc
2 vhigh vhigh 2 2 small high unacc
3 vhigh vhigh 2 2 med low unacc
4 vhigh vhigh 2 2 med med unacc
labels = {
'Buying': ['vhigh', 'high', 'med', 'low'],
'Maintenance': ['vhigh', 'high', 'med', 'low'],
'Doors': ['2', '3', '4', '5more'],
'Persons': ['2', '4', 'more'],
'LuggageBoot': ['small', 'med', 'big'],
'Safety': ['low', 'med', 'high'],
'Class': ['unacc', 'acc', 'good', 'vgood']
}
from sklearn import preprocessing
label_encoders = {}
data_frame_encoded = pandas.DataFrame()
for column in data_frame:
if column in labels:
label_encoders[column] = preprocessing.LabelEncoder()
label_encoders[column].fit(labels[column])
data_frame_encoded[column] = label_encoders[column].transform(data_frame[column])
else:
data_frame_encoded[column] = data_frame[column]
import numpy as np
features = np.array(data_frame_encoded.drop(['Class'], 1))
label = np.array( data_frame_encoded['Class'] )
from sklearn import model_selection
features_train, features_test, label_train, label_test = model_selection.train_test_split(
features,
label,
test_size=0.1
)
Note that the train_test_split method will be available in model_selection module, not in the cross_validation module starting in scikit-learn 0.20. In previous versions, model_selection already contains the train_test_split method.
from sklearn.tree import DecisionTreeClassifier
decision_tree = DecisionTreeClassifier()
decision_tree.fit(features_train, label_train)
The output of the fit method is as follows:
DecisionTreeClassifier(
class_weight=None,
criterion='gini',
max_depth=None,
max_features=None,
max_leaf_nodes=None,
min_impurity_decrease=0.0,
min_impurity_split=None,
min_samples_leaf=1,
min_samples_split=2,
min_weight_fraction_leaf=0.0,
presort=False,
random_state=None,
splitter='best'
)
You can see the parametrization of the decision tree classifier. There are quite a few options we could set to tweak the performance of the classifier model.
decision_tree.score( features_test, label_test )
The output is as follows:
0.9884393063583815
from sklearn.metrics import classification_report
print(
classification_report(
label_test,
decision_tree.predict(features_test)
)
)
The output is as follows:
precision recall f1-score support
0 0.97 0.97 0.97 36
1 1.00 1.00 1.00 5
2 1.00 0.99 1.00 127
3 0.83 1.00 0.91 5
avg / total 0.99 0.99 0.99 173
The model has been proven to be quite accurate. In case of such a high accuracy score, suspect the possibility of overfitting.
We can reuse Steps 1 – 5 of Activity 1. The end of Step 5 looks as follows:
from sklearn import model_selection
features_train, features_test, label_train, label_test = model_selection.train_test_split(
features,
label,
test_size=0.1
)
If you are using IPython, your variables may already be accessible in your console.
from sklearn.ensemble import RandomForestClassifier,ExtraTreesClassifier
random_forest_classifier = RandomForestClassifier(n_estimators=100, max_depth=6)
random_forest_classifier.fit(features_train, label_train)
extra_trees_classifier =ExtraTreesClassifier(
n_estimators=100, max_depth=6
)
extra_trees_classifier.fit(features_train, label_train)
from sklearn.metrics import classification_report
print(
classification_report(
label_test,
random_forest_classifier.predict(features_test)
)
)
The output for model 1 is as follows:
precision recall f1-score support
0 0.78 0.78 0.78 36
1 0.00 0.00 0.00 5
2 0.94 0.98 0.96 127
3 0.75 0.60 0.67 5
avg / total 0.87 0.90 0.89 173
The output for model 1 is as follows:
print(
classification_report(
label_test,
extra_trees_classifier.predict(features_test)
)
)
precision recall f1-score support
0 0.72 0.72 0.72 36
1 0.00 0.00 0.00 5
2 0.93 1.00 0.96 127
3 0.00 0.00 0.00 5
avg / total 0.83 0.88 0.86 173
random_forest_classifier.score(features_test, label_test)
The output is as follows:
0.9017341040462428
The output for extraTreesClassifier is as follows:
extra_trees_classifier.score(features_test, label_test)
The output is as follows:
0.884393063583815
We can see that the random forest classifier is performing slightly better than the extra trees classifier.
random_forest_classifier.feature_importances_
The output is as follows:
array([0.12656512, 0.09934031, 0.02073233, 0.35550329, 0.05411809, 0.34374086])
The output for extra_trees_classifier is as follows:
extra_trees_classifier.feature_importances_
The output is as follows:
array([0.08699494, 0.07557066, 0.01221275, 0.38035005, 0.05879822, 0.38607338])
Both classifiers treats the third and the fifth attributes quite unimportant. We may not be sure about the fifth attribute, as the importance score is more than 5% in both models. However, we are quite certain that the third attribute is the least significant attribute in the decision. Let's see the feature names once again.
data_frame_encoded.head()
The output is as follows:
Buying Maintenance Doors Persons LuggageBoot Safety Class
0 3 3 0 0 2 1
1 3 3 0 0 2 2
2 3 3 0 0 2 0
3 3 3 0 0 1 1
4 3 3 0 0 1 2
The least important feature is Doors. It is quite evident in hindsight: the number of doors doesn't have as big of an influence in the car's rating than the safety rating for instance.
features2 = np.array(data_frame_encoded.drop(['Class', 'Doors'], 1))
label2 = np.array(data_frame_encoded['Class'])
features_train2,
features_test2,
label_train2,
label_test2 = model_selection.train_test_split(
features2,
label2,
test_size=0.1
)
random_forest_classifier2 = RandomForestClassifier(
n_estimators=100, max_depth=6
)
random_forest_classifier2.fit(features_train2, label_train2)
extra_trees_classifier2 = ExtraTreesClassifier(
n_estimators=100, max_depth=6
)
extra_trees_classifier2.fit(features_train2, label_train2)
print(
classification_report(
label_test2,
random_forest_classifier2.predict(features_test2)
)
)
The output is as follows:
precision recall f1-score support
0 0.89 0.85 0.87 40
1 0.00 0.00 0.00 3
2 0.95 0.98 0.96 125
3 1.00 1.00 1.00 5
avg / total 0.92 0.93 0.93 173
print(
classification_report(
label_test2,
extra_trees_classifier2.predict(features_test2)
)
)
The output is as follows:
precision recall f1-score support
0 0.78 0.78 0.78 40
1 0.00 0.00 0.00 3
2 0.93 0.98 0.95 125
3 1.00 0.40 0.57 5
avg / total 0.88 0.90 0.88 173
Although we did improve a few percentage points, note that a direct comparison is not possible, because of following reasons. First, the train-test split selects different data for training and testing. A few badly selected data points may easily cause a few percentage point increase or decrease in the scores. Second, the way how we train the classifiers also has random elements. This randomization may also shift the performance of the classifiers a bit. Always use best judgement when interpreting results and measure your results multiple times on different train-test splits if needed.
random_forest_classifier2 = RandomForestClassifier(
n_estimators=150,
max_ depth=8,
criterion='entropy',
max_features=5
)
random_forest_classifier2.fit(features_train2, label_train2)
print(
classification_report(
label_test2,
random_forest_classifier2.predict(features_test2)
)
)
The output is as follows:
precision recall f1-score support
0 0.95 0.95 0.95 40
1 0.50 1.00 0.67 3
2 1.00 0.97 0.98 125
3 0.83 1.00 0.91 5
avg / total 0.97 0.97 0.97 173
extra_trees_classifier2 = ExtraTreesClassifier(
n_estimators=150,
max_depth=8,
criterion='entropy',
max_features=5
)
extra_trees_classifier2.fit(features_train2, label_train2)
print(
classification_report(
label_test2,
extra_trees_classifier2.predict(features_test2)
)
)
The output is as follows:
precision recall f1-score support
0 0.92 0.88 0.90 40
1 0.40 0.67 0.50 3
2 0.98 0.97 0.97 125
3 0.83 1.00 0.91 5
avg / total 0.95 0.94 0.94 173
This section will detect product sales that perform similarly in nature to recognize trends in product sales.
We will be using the Sales Transactions Weekly Dataset from this URL:
https://archive.ics.uci.edu/ml/datasets/Sales_Transactions_Dataset_Weekly Perform clustering on the dataset using the k-means Algorithm. Make sure you prepare your data for clustering based on what you have learned in the previous chapters.
Use the default settings for the k-means algorithm.
import pandas
pandas.read_csv('Sales_Transactions_Dataset_Weekly.csv')
import numpy as np
drop_columns = ['Product_Code']
for w in range(0, 52):
drop_columns.append('W' + str(w))
features = data_frame.drop(dropColumns, 1)
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaled_features = scaler.fit_transform(features)
from sklearn.cluster import KMeans
k_means_model = KMeans()
k_means_model.fit(scaled_features)
k_means_model.labels_
k_means_model.cluster_centers_
The output will be as follows:
array([5, 5, 4, 5, 5, 3, 4, 5, 5, 5, 5, 5, 4, 5, 0, 0, 0, 0, 0, 4, 4, 4,
4, 0, 0, 5, 0, 0, 5, 0, 4, 4, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 5, 0, 0, 5, 0, 0, 0, 0, 0, 4, 0, 0, 5, 0, 0, 5, 0,
...
1, 7, 3, 2, 6, 7, 6, 2, 2, 6, 2, 7, 2, 7, 2, 6, 1, 3, 2, 2, 6, 6,
7, 7, 7, 1, 1, 2, 1, 2, 7, 7, 6, 2, 7, 6, 6, 6, 1, 6, 1, 6, 7, 7,
1, 1, 3, 5, 3, 3, 3, 5, 7, 2, 2, 2, 3, 2, 2, 7, 7, 3, 3, 3, 3, 2,
2, 6, 3, 3, 5, 3, 2, 2, 6, 7, 5, 2, 2, 2, 6, 2, 7, 6, 1])
How are these labels beneficial?
Suppose that in the original data frame, the product names are given. You can easily recognize that similar types of products sell similarly. There are also products that fluctuate a lot, and products that are seasonal in nature. For instance, if some products promoted fat loss and getting into shape, they tend to sell during the first half of the year, before the beach season.
This section will learn how images can be clustered. We will be assuming that you are working for a company detecting human emotions from photos. Your task is to extract pixels making up a face in an avatar photo.
Create a clustering algorithm with Mean Shift to cluster pixels of images. Examine the results of the Mean Shift algorithm and check if any of the clusters contains a face when used on avatar images.
Then apply the k-means algorithm with a fixed default number of clusters: 8. Compare your results with the Mean Shift clustering algorithm.
image = Image.open('destructuring.jpg')
pixels = image.load()
import pandas
data_frame = pandas.DataFrame(
[[x,y,pixels[x,y][0], pixels[x,y][1], pixels[x,y][2]]
for x in range(image.size[0])
for y in range(image.size[1])
],
columns=['x', 'y', 'r', 'g', 'b']
)
from sklearn.cluster import MeanShift
mean_shift_model = MeanShift()
mean_shift_model.fit(data_frame)
for i in range(len(mean_shift_model.cluster_centers_)):
image = Image.open('destructuring.jpg')
pixels = image.load()
for j in range(len(data_frame)):
if (mean_shift_model.labels_[j] != i ):
pixels[ int(data_frame['x'][j]),
int(data_frame['y'][j]) ] = (255, 255, 255)
image.save( 'cluster' + str(i) + '.jpg' )
k_means_model = KMeans(n_clusters=8)
k_means_model.fit(data_frame)
for i in range(len(k_means_model.cluster_centers_)):
image = Image.open('destructuring.jpg')
pixels = image.load()
for j in range(len(data_frame)):
if (k_means_model.labels_[j] != i):
pixels[int(data_frame['x'][j]), int(data_frame['y'][j])] = (255, 255, 255)
image.save('kmeanscluster' + str(i) + '.jpg')
The output for the first is as follows:
The output for the second is as follows:
The output for the third is as follows:
The output for the fourth is as follows:
The output for the fifth is as follows:
The output for the sixth is as follows:
The output for the seventh is as follows:
The output for the eighth is as follows:
As you can see, the fifth cluster recognized my face quite well. The clustering algorithm indeed located data points that are close and contain similar colors.
import tensorflow.keras.datasets.mnist as mnist
(features_train, label_train),
(features_test, label_test) = mnist.load_data()
features_train = features_train / 255.0
features_test = features_test / 255.0
def flatten(matrix):
return [elem for row in matrix for elem in row]
features_train_vector = [
flatten(image) for image in features_train
]
features_test_vector = [
flatten(image) for image in features_test
]
import numpy as np
label_train_vector = np.zeros((label_train.size, 10))
for i, label in enumerate(label_train_vector):
label[label_train[i]] = 1
label_test_vector = np.zeros((label_test.size, 10))
for i, label in enumerate(label_test_vector):
label[label_test[i]] = 1
import tensorflow as tf
f = tf.nn.softmax
x = tf.placeholder(tf.float32, [None, 28 * 28 ])
W = tf.Variable( tf.random_normal([784, 10]))
b = tf.Variable( tf.random_normal([10]))
y = f(tf.add(tf.matmul( x, W ), b ))
import random
y_true = tf.placeholder(tf.float32, [None, 10])
cross_entropy = tf.nn.softmax_cross_entropy_with_logits_v2(
logits=y,
labels=y_true
)
cost = tf.reduce_mean(cross_entropy)
optimizer = tf.train.GradientDescentOptimizer(
learning_rate = 0.5
).minimize(cost)
session = tf.Session()
session.run(tf.global_variables_initializer())
iterations = 600
batch_size = 200
sample_size = len(features_train_vector)
for _ in range(iterations):
indices = random.sample(range(sample_size), batchSize)
batch_features = [
features_train_vector[i] for i in indices
]
batch_labels = [
label_train_vector[i] for i in indices
]
min = i * batch_size
max = (i+1) * batch_size
dictionary = {
x: batch_features,
y_true: batch_labels
}
session.run(optimizer, feed_dict=dictionary)
label_predicted = session.run(classify( x ), feed_dict={
x: features_test_vector
})
label_predicted = [
np.argmax(label) for label in label_predicted
]
confusion_matrix(label_test, label_predicted)
The output is as follows:
array([[ 0, 0, 223, 80, 29, 275, 372, 0, 0, 1],
[ 0, 915, 4, 10, 1, 13, 192, 0, 0, 0],
[ 0, 39, 789, 75, 63, 30, 35, 0, 1, 0],
[ 0, 6, 82, 750, 13, 128, 29, 0, 0, 2],
[ 0, 43, 16, 16, 793, 63, 49, 0, 2, 0],
[ 0, 22, 34, 121, 40, 593, 76, 5, 0, 1],
[ 0, 29, 34, 6, 44, 56, 788, 0, 0, 1],
[ 1, 54, 44, 123, 715, 66, 24, 1, 0, 0],
[ 0, 99, 167, 143, 80, 419, 61, 0, 4, 1],
[ 0, 30, 13, 29, 637, 238, 58, 3, 1, 0]], dtype=int64)
accuracy_score(label_test, label_predicted)
The output is as follows:
0.4633
for _ in range(iterations):
indices = random.sample(range(sample_size), batch_size)
batch_features = [
features_train_vector[i] for i in indices
]
batch_labels = [
label_train_vector[i] for i in indices
]
min = i * batch_size
max = (i+1) * batch_size
dictionary = {
x: batch_features,
y_true: batch_labels
}
session.run(optimizer, feed_dict=dictionary)
Second run: 0.5107
Third run: 0.5276
Fourth run: 0.5683
Fifth run: 0.6002
Sixth run: 0.6803
Seventh run: 0.6989
Eighth run: 0.7074
Ninth run: 0.713
Tenth run: 0.7163
Twentieth run: 0.7308
Thirtieth run: 0.8188
Fortieth run: 0.8256
Fiftieth run: 0.8273
At the end of the fiftieth run, the improved confusion matrix looks as follows:
array([
[946, 0, 6, 3, 0, 1, 15, 2, 7, 0],
[ 0,1097, 3, 7, 1, 0, 4, 0, 23, 0],
[11, 3, 918, 11, 18, 0, 13, 8, 50, 0],
[3, 0, 23, 925, 2, 10, 4, 9, 34, 0],
[2, 2, 6, 1, 929, 0, 14, 2, 26, 0],
[16, 4, 7, 62, 8, 673, 22, 3, 97, 0],
[8, 2, 4, 3, 8, 8, 912, 2, 11, 0],
[5, 9, 33, 6, 9, 1, 0, 949, 16, 0],
[3, 4, 5, 12, 7, 4, 12, 3, 924, 0],
[8, 5, 7, 40, 470, 11, 5, 212, 251, 0]
],
dtype=int64)
Not a bad result. More than 8 out of 10 digits are accurately recognized.
This section will discuss how deep learning improves the performance of your model. We will be assuming that your boss is not satisfied with the results you presented in previous activity and asks you to consider adding two hidden layers to your original model and determine whether new layers improve the accuracy of the model. And to ensure that it happens correctly, you will need to have knowledge of Deep Learning.
x = tf.placeholder(tf.float32, [None, 28 * 28 ])
f1 = tf.nn.relu
W1 = tf.Variable(tf.random_normal([784, 200]))
b1 = tf.Variable(tf.random_normal([200]))
layer1_out = f1(tf.add(tf.matmul(x, W1), b1))
f2 = tf.nn.softmax
W2 = tf.Variable(tf.random_normal([200, 100]))
b2 = tf.Variable(tf.random_normal([100]))
layer2_out = f2(tf.add(tf.matmul(layer1_out, W2), b2))
f3 = tf.nn.softmax
W3 = tf.Variable(tf.random_normal([100, 10]))
b3 = tf.Variable( tf.random_normal([10]))
y = f3(tf.add(tf.matmul(layer2_out, W3), b3))
y_true = tf.placeholder(tf.float32, [None, 10])
cross_entropy = tf.nn.softmax_cross_entropy_with_logits_v2(
logits=y,
labels=y_true
)
cost = tf.reduce_mean(cross_entropy)
optimizer = tf.train.GradientDescentOptimizer(
learning_rate=0.5).minimize(cost)
session = tf.Session()
session.run(tf.global_variables_initializer())
iterations = 600
batch_size = 200
sample_size = len(features_train_vector)
for _ in range(iterations):
indices = random.sample(range(sample_size), batchSize)
batch_features = [
features_train_vector[i] for i in indices
]
batch_labels = [
label_train_vector[i] for i in indices
]
min = i * batch_size
max = (i+1) * batch_size
dictionary = {
x: batch_features,
y_true: batch_labels
}
session.run(optimizer, feed_dict=dictionary)
label_predicted = session.run(y, feed_dict={
x: features_test_vector
})
label_predicted = [
np.argmax(label) for label in label_predicted
]
confusion_matrix(label_test, label_predicted)
The output is as follows:
array([[ 801, 11, 0, 14, 0, 0, 56, 0, 61, 37],
[ 2, 1069, 0, 22, 0, 0, 18, 0, 9, 15],
[ 276, 138, 0, 225, 0, 2, 233, 0, 105, 53],
[ 32, 32, 0, 794, 0, 0, 57, 0, 28, 67],
[ 52, 31, 0, 24, 0, 3, 301, 0, 90, 481],
[ 82, 50, 0, 228, 0, 3, 165, 0, 179, 185],
[ 71, 23, 0, 14, 0, 0, 712, 0, 67, 71],
[ 43, 85, 0, 32, 0, 3, 31, 0, 432, 402],
[ 48, 59, 0, 192, 0, 2, 45, 0, 425, 203],
[ 45, 15, 0, 34, 0, 2, 39, 0, 162, 712]],
dtype=int64)
accuracy_score(label_test, label_predicted)
The output is 0.4516.
The accuracy did not improve.
Let's see if further runs improve the accuracy of the model.
Second run: 0.5216
Third run: 0.5418
Fourth run: 0.5567
Fifth run: 0.564
Sixth run: 0.572
Seventh run: 0.5723
Eighth run: 0.6001
Ninth run: 0.6076
Tenth run: 0.6834
Twentieth run: 0.7439
Thirtieth run: 0.7496
Fortieth run: 0.7518
Fiftieth run: 0.7536
Afterwards, we got the following results: 0.755, 0.7605, 0.7598, 0.7653
The final confusion matrix:
array([[ 954, 0, 2, 1, 0, 6, 8, 0, 5, 4],
[ 0, 1092, 5, 3, 0, 0, 6, 0, 27, 2],
[ 8, 3, 941, 16, 0, 2, 13, 0, 35, 14],
[ 1, 1, 15, 953, 0, 14, 2, 0, 13, 11],
[ 4, 3, 8, 0, 0, 1, 52, 0, 28, 886],
[ 8, 1, 5, 36, 0, 777, 16, 0, 31, 18],
[ 8, 1, 6, 1, 0, 6, 924, 0, 9, 3],
[ 3, 10, 126, 80, 0, 4, 0, 0, 35, 770],
[ 4, 0, 6, 10, 0, 6, 4, 0, 926, 18],
[ 4, 5, 1, 8, 0, 2, 2, 0, 18, 969]],
dtype=int64)
This deep neural network behaves even more chaotically than the single layer one. It took 600 iterations of 200 samples to get from an accuracy of 0.572 to 0.5723. Not long after this iteration, we jumped from 0.6076 to 0.6834 in that number of iterations.