Classification using Tensorflow

Neural networks can also be designed to classify data. As with the previous classifier, they can generate a probability of belonging to a class, and as such, we can use the threshold we want for the precision we require.

This example will be our first real dive into neural networks. Just as for the previous case, we will use placeholders, but instead of explicitly setting variables, we will use standard Tensorflow functions to create them.

Just as before, we will use the same data with all our current features:

X = np.asarray([get_features(aid, ['LinkCount', 'NumCodeLines', 
'NumTextTokens', 'AvgSentLen',
'AvgWordLen', 'NumAllCaps',
'NumExclams']) for aid in all_answers])
Y = np.asarray([meta[aid]['Score'] > 0 for aid in all_answers])

Of course, an exercise here is to replicate previous results by using fewer features and see how this neural network will be able to discriminate between good and bad posts.

Neural networks are not the same as the brain. We explicitly create layers, when in reality no such thing exists (more on this later, but this is required to understand how we create a simple neural network). It is a good practice to factor out the layers we want to create, so, for instance, we will create two types of layers: one for dense layers, meaning that they connect all input to all output, and one for the output layer that has only one output unit:

import tensorflow as tf

def create_dense(x, n_units, name, alpha=0.2):
# Hidden layer
h = tf.layers.dense(x, n_units, activation=tf.nn.leaky_relu, name=name)
return h

def create_output(x):
# Output layer
h = tf.layers.dense(x, 1, activation=tf.nn.sigmoid, name="Output")
return h

This output unit is created with a sigmoid activation. This means that the inner tf.matmult that creates values between -inf and +inf is fed inside a function that maps these to the interval [0, 1]. 0 and 1 cannot be achieved for the output, so when we train our neural network, we have to keep this in memory. As such, for the target probability in our training, we change the output to accommodate this impossibility:

Y = Y.astype(np.float32)[:, None]
bce_ceil = 1e-5
Y = Y * (1 - 2 * bce_ceil) + bce_ceil

And now, we can split our data:

from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=0.8)

Let's start with setting our usual hyper parameters:

n_epochs = 500
batch_size = 1000
steps = 10
layer1_size = 5

If we use all seven features, our neural network building looks like this:

X_tf = tf.placeholder(tf.float32, (None, 7), name="Input")
Y_ref_tf = tf.placeholder(tf.float32, (None, 1), name="Target_output")

h1 = create_dense(X_tf, layer1_size, name="Layer1")
Y_tf = create_output(h1)

loss = tf.reduce_mean(tf.square(Y_ref_tf - Y_tf))

grad_speed = .01
my_opt = tf.train.GradientDescentOptimizer(grad_speed)
train_step = my_opt.minimize(loss)

The gradient step is now far greater than the one from the regression example. We could use a smaller step, but this would require more steps to achieve a local minimum of our loss function.

We can now train our neural network, very similarly to what we did in Chapter 2, Classifying with Real-world Examples. The only difference is that at the end, we also run the test data inside the neural network:

with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
loss_vec = []
for epoch in range(n_epochs):
permut = np.random.permutation(len(X_train))
for j in range(0, len(X_train), batch_size):
batch = permut[j:j+batch_size]
Xs = X_train[batch]
Ys = Y_train[batch]

sess.run(train_step, feed_dict={X_tf: Xs, Y_ref_tf: Ys})

temp_loss = sess.run(loss, feed_dict={X_tf: X_train, Y_ref_tf: Y_train})
loss_vec.append(temp_loss)
if epoch % steps == steps - 1:
print('Epoch #%i loss = %s' % (epoch, temp_loss))

predict_train = sess.run(Y_tf, feed_dict={X_tf: X_train})
predict_test = sess.run(Y_tf, feed_dict={X_tf: X_test})

For now, we throw away the neural network we trained, which is why we are also using the test data in the same session. We will see how to save and reuse a model in Chapter 8, Artificial Neural Networks and Deep Learning.

We can, of course, also display how well the optimizer behaved:

plt.plot(loss_vec, 'k-')
plt.title('Loss per Epoch)
plt.xlabel(Epoc')
plt.ylabel('Loss')

Refer to the following graph:

The loss per generation can be very different from one run to another. grad_speed is the most important parameter that changes this graph. Its value is a compromise between convergence speed and stability, and I advise you to try different values to see how this function behaves with different values and different runs.

If we look at the scores for the training and the test scores, we can see that we have results that match the best of the previous classifiers:

from sklearn.metrics import accuracy_score
score = accuracy_score(Y_train > .5, predict_train > .5)
print("Score (on training data): %.2f" % score)
score = accuracy_score(Y_test > .5, predict_test > .5)
print("Score (on testing data): %.2f" % score)

This will output:

Score (on training data): 0.65
Score (on testing data): 0.65

It is a good time to go back to the hyper parameters, especially the size of the intermediate or hidden layer and modify the number of nodes. Is lowering it degrading the classifier's behavior? Is increasing it improving it? What about adding another intermediate layer? What is the impact of its number of neurons?

A nice feature of sklearn is the abundance of support functions and tutorials. This is a function from a tutorial on confusion matrices that helps visualize the quality of a classifier:

def plot_confusion_matrix(cm, classes,
normalize=False,
title='Confusion matrix',
cmap=plt.cm.Blues):
"""
This function prints and plots the confusion matrix.
Normalization can be applied by setting `normalize=True`.
"""
import itertools
if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
print("Normalized confusion matrix")
else:
print('Confusion matrix, without normalization')

print(cm)

plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45)
plt.yticks(tick_marks, classes)

fmt = '.2f' if normalize else 'd'
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, format(cm[i, j], fmt),
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")

plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')

We can now use it with a threshold at .5 to see the behavior of this classifier on trained and tested data:

class_names = ["Poor", "Good"]
from sklearn import metrics
print(metrics.classification_report(Y_train > .5, predict_train > .5, target_names=class_names))
plot_confusion_matrix(metrics.confusion_matrix(Y_train > .5, pre-dict_train > .5), classes=class_names,title='Confusion matrix, without normaliza-tion')
plt.show()
print(metrics.classification_report(Y_test > .5, predict_test > .5, target_names=class_names))
plot_confusion_matrix(metrics.confusion_matrix(Y_test > .5, pre-dict_test > .5), classes=class_names,title='Confusion matrix, without normaliza-tion')

This will output:

             precision recall f1-score support

Poor 0.63 0.73 0.67 8035
Good 0.67 0.57 0.62 7965

Refer to the following graph:

See the following data: 

avg / total       0.65      0.65      0.65      16000

precision recall f1-score support

Poor 0.62 0.73 0.67 1965
Good 0.68 0.57 0.62 2035

avg / total 0.65 0.65 0.65 4000

It is very interesting to see the stability of the classifier when moving from training data to test data. In both cases, we can also see that there are still lots of misclassifications, with an emphasis on good posts that were labeled as bad posts (which is probably better than the opposite!).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset