Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Unsupervised machine learning algorithm

In machine learning, unsupervised learning is used for finding the hidden structure from the unlabeled dataset. Since the datasets are not labeled, there will be no error while evaluating for potential solutions.

Unsupervised machine learning includes several algorithms, some of which are as follows:

Clustering
Artificial neural networks
Vector quantization

We will consider popular clustering algorithms here.

Clustering

Clustering is the task of grouping a set of object in such a way that similar objects with similar characteristics are grouped in the same category, but other objects are grouped in other categories. In clustering, the input datasets are not labeled; they need to be labeled based on the similarity of their data structure.

In unsupervised machine learning, the classification technique performs the same procedure to map the data to a category with the help of the provided set of input training datasets. The corresponding procedure is known as clustering (or cluster analysis), and involves grouping data into categories based on some measure of inherent similarity; for example, the distance between data points.

From the following figure, we can identify clustering as grouping objects based on their similarity:

There are several clustering techniques available within R libraries, such as k-means, k-medoids, hierarchical, and density-based clustering. Among them, k-means is widely used as the clustering algorithm in data science. This algorithm asks for a number of clusters to be the input parameters from the user side.

Applications of clustering are as follows:

Market segmentation
Social network analysis
Organizing computer network
Astronomical data analysis

Clustering with R

We are considering the k-means method here for implementing the clustering model over the iris input dataset, which can be achieved by just calling its in-built R dataset – the iris data (for more information, visit http://stat.ethz.ch/R-manual/R-devel/library/datasets/html/iris.html). Here we will see how k-means clustering can be performed with R.

# Loading iris flower dataset
data("iris")
# generating clusters for iris dataset
kmeans <- kmeans(iris[, -5], 3, iter.max = 1000)

# comparing iris Species with generated cluster points
Comp <- table(iris[, 5], kmeans$cluster)

Deriving clusters for small datasets is quite simple, but deriving it for huge datasets requires the use of Hadoop for providing computation power.

Performing clustering with R and Hadoop

Since the k-means clustering algorithm is already developed in RHadoop, we are going to use and understand it. You can make changes in their Mappers and Reducers as per the input dataset format. As we are dealing with Hadoop, we need to develop the Mappers and Reducers to be run on nodes in a parallel manner.

The outline of the clustering algorithm is as follows:

Defining the dist.fun distance function
Defining the k-means.map k-means Mapper function
Defining the k-means.reduce k-means Reducer function
Defining the k-means.mr k-means MapReduce function
Defining input data points to be provided to the clustering algorithms

Now we will run k-means.mr (the k-means MapReduce job) by providing the required parameters.

Let's understand them one by one.

dist.fun: First, we will see the dist.fun function for calculating the distance between a matrix of center C and a matrix of point P, which is tested. It can produce 10⁶ points and 10² centers in five dimensions in approximately 16 seconds.
```
# distance calculation function
dist.fun = 
      function(C, P) {
        apply(
          C,
          1, 
          function(x) 
            colSums((t(P) - x)^2))}
```

k-means.map: The Mapper of the k-means MapReduce algorithm will compute the distance between points and all the centers and return the closest center for each point. This Mapper will run in iterations based on the following code. With the first iteration, the cluster center will be assigned randomly and from the next iteration, it will calculate these cluster centers based on the minimum distance from all the points of the cluster.

# k-Means Mapper
  kmeans.map = 
      function(., P) {
        nearest = {

# First interations- Assign random cluster centers 
          if(is.null(C)) 
            sample(
              1:num.clusters, 
              nrow(P), 
              replace = T)

# Rest of the iterations, where the clusters are assigned # based on the minimum distance from points
          else {
            D = dist.fun(C, P)
            nearest = max.col(-D)}}
 
       if(!(combine || in.memory.combine))
          keyval(nearest, P) 
        else 
          keyval(nearest, cbind(1, P))}

k-means.reduce: The Reducer of the k-means MapReduce algorithm will compute the column average of matrix points as key.

# k-Means Reducer
kmeans.reduce = {

# calculating the column average for both of the 
# conditions

      if (!(combine || in.memory.combine) ) 
        function(., P) 
          t(as.matrix(apply(P, 2, mean)))
      else 
        function(k, P) 
          keyval(
            k, 
            t(as.matrix(apply(P, 2, sum))))}

kmeans.mr: Defining the k-means MapReduce function involves specifying several input parameters, which are as follows:

P: This denotes the input data points
num.clusters: This is the total number of clusters
num.iter: This is the total number of iterations to be processed with datasets

combine: This will decide whether the Combiner should be enabled or disabled (TRUE or FALSE)

# k-Means MapReduce – for 
kmeans.mr = 
  function(
    P, 
    num.clusters, 
    num.iter, 
    combine, 
    in.memory.combine) {
    C = NULL
    for(i in 1:num.iter ) {
      C = 
        values(

# Loading hdfs dataset
          from.dfs(

# MapReduce job, with specification of input dataset,
# Mapper and Reducer
            mapreduce(
              P,
              map = kmeans.map,
              reduce = kmeans.reduce)))
      if(combine || in.memory.combine)
        C = C[, -1]/C[, 1]
      if(nrow(C) < num.clusters) {
        C = 
          rbind(
            C,
            matrix(
              rnorm(
                (num.clusters - 
                   nrow(C)) * nrow(C)), 
              ncol = nrow(C)) %*% C) }}
        C}

Defining the input data points to be provided to the clustering algorithms:

# Input data points
P = do.call(
      rbind, 
      rep(


        list(

# Generating Matrix of
          matrix(
# Generate random normalized data with sd = 10
            rnorm(10, sd = 10), 
            ncol=2)), 
        20)) + 
    matrix(rnorm(200), ncol =2)

Running kmeans.mr (the k-means MapReduce job) by providing it with the required parameters.

# Running kmeans.mr Hadoop MapReduce algorithms with providing the required input parameters

kmeans.mr(
      to.dfs(P),
      num.clusters = 12, 
      num.iter = 5,
      combine = FALSE,
      in.memory.combine = FALSE)

The output of the preceding command is shown in the following screenshot:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Unsupervised machine learning algorithm

Create new playlist

Sign In

Sign Up

Unsupervised machine learning algorithm

Clustering

Clustering with R

Performing clustering with R and Hadoop

Table of Contents for
Unsupervised machine learning algorithm