Keeping up with velocity

Various algorithms work using incremental learning. For classification, we will recall the following:

  • sklearn.naive_bayes.MultinomialNB
  • sklearn.naive_bayes.BernoulliNB
  • sklearn.linear_model.Perceptron
  • sklearn.linear_model.SGDClassifier
  • sklearn.linear_model.PassiveAggressiveClassifier

For regression, we will recall the following:

  • sklearn.linear_model.SGDRegressor
  • sklearn.linear_model.PassiveAggressiveRegressor

As for velocity, they are all comparable in speed. You can try for yourself with the following script:

In: from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import BernoulliNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import PassiveAggressiveClassifier
import pandas as pd
from datetime import datetime
classifiers = {'SGDClassifier hinge loss' : SGDClassifier(loss='hinge',
random_state=101, max_iter=10),
'SGDClassifier log loss' : SGDClassifier(loss='log',
random_state=101, max_iter=10),
'Perceptron' : Perceptron(random_state=101,max_iter=10),
'BernoulliNB' : BernoulliNB(),
'PassiveAggressiveClassifier' : PassiveAggressiveClassifier(
random_state=101, max_iter=10)
}
large_dataset = 'large_dataset_10__6.csv'
for algorithm in classifiers:
start = datetime.now()
minmax_scaler = MinMaxScaler(feature_range=(0, 1))
streaming = pd.read_csv(large_dataset, header=None, chunksize=100)
learner = classifiers[algorithm]
cumulative_accuracy = list()
for n,chunk in enumerate(streaming):
y = chunk.iloc[:,0]
X = chunk.iloc[:,1:]
if n > 50 :
cumulative_accuracy.append(learner.score(X,y))
learner.partial_fit(X,y,classes=np.unique(y))
elapsed_time = datetime.now() - start
print (algorithm + ' : mean accuracy %0.3f in %s secs'
% (np.mean(cumulative_accuracy),elapsed_time.total_seconds()))

Out: BernoulliNB : mean accuracy 0.734 in 41.101 secs
Perceptron : mean accuracy 0.616 in 37.479 secs
SGDClassifier hinge loss : mean accuracy 0.712 in 38.43 secs
SGDClassifier log loss : mean accuracy 0.716 in 39.618 secs
PassiveAggressiveClassifier : mean accuracy 0.625 in 40.622 secs
As a general note, remember that smaller batches are slower, since that implies more disk access from a database or a file, which is always a bottleneck.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset