Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Performance considerations

The three unsupervised learning techniques share the same limitation—a high computational complexity.

K-means

The K-means has the computational complexity of O(iKnm), where i is the number of iterations (or recursions), K is the number of clusters, n is the number of observations, and m is the number of features. Here are some remedies to the poor performance of the K-means algorithm:

Reducing the average number of iterations by seeding the centroid using a technique such as initialization by ranking the variance of the initial cluster, as described in the beginning of this chapter
Using a parallel implementation of K-means and leveraging a large-scale framework such as Hadoop or Spark
Reducing the number of outliers and features by filtering out the noise with a smoothing algorithm such as a discrete Fourier transform or a Kalman filter
Decreasing the dimensions of the model by following a two-step process:
1. Execute a first pass with a smaller number of clusters K and/or a loose exit condition regarding the reassignment of data points. The data points close to each centroid are aggregated into a single observation.
2. Execute a second pass on the aggregated observations.

EM

The computational complexity of the expectation-maximization algorithm for each iteration (E + M steps) is O(m²n), where m is the number of hidden or latent variables and n is the number of observations.

A partial list of suggested performance improvement includes:

Filtering of raw data to remove noise and outliers
Using a sparse matrix on a large feature set to reduce the complexity of the covariance matrix, if possible
Applying the Gaussian mixture model wherever possible—the assumption of Gaussian distribution simplifies the computation of the log likelihood
Using a parallel data processing framework such as Apache Hadoop or Spark, as discussed in the Apache Spark section in Chapter 12, Scalable Frameworks
Using a kernel function to reduce the estimate of covariance in the E-step

PCA

The computational complexity of the extraction of the principal components is O(m²n + n³), where m is the number of features and n is the number of observations. The first term represents the computational complexity for computing the covariance matrix. The second term reflects the computational complexity of the eigenvalue decomposition.

The list of potential performance improvements or alternative solutions for PCA includes the following:

Assuming that the variance is Gaussian
Using a sparse matrix to compute eigenvalues for problems with large feature sets and missing data
Investigating alternatives to PCA to reduce the dimension of a model such as the discrete Fourier transform (DFT) or singular value decomposition (SVD) [4:17]
Using PCA in conjunction with EM (at the research stage)
Deploying a dataset on a parallel data processing framework such as Apache Hadoop or Spark, as described in the Apache Spark section in Chapter 12, Scalable Frameworks

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Performance considerations

Create new playlist

Sign In

Sign Up

Performance considerations

K-means

EM

PCA

Table of Contents for
Performance considerations