Unsupervised machine learning algorithm

In machine learning, unsupervised learning is used for finding the hidden structure from the unlabeled dataset. Since the datasets are not labeled, there will be no error while evaluating for potential solutions.

Unsupervised machine learning includes several algorithms, some of which are as follows:

  • Clustering
  • Artificial neural networks
  • Vector quantization

We will consider popular clustering algorithms here.

Clustering

Clustering is the task of grouping a set of object in such a way that similar objects with similar characteristics are grouped in the same category, but other objects are grouped in other categories. In clustering, the input datasets are not labeled; they need to be labeled based on the similarity of their data structure.

In unsupervised machine learning, the classification technique performs the same procedure to map the data to a category with the help of the provided set of input training datasets. The corresponding procedure is known as clustering (or cluster analysis), and involves grouping data into categories based on some measure of inherent similarity; for example, the distance between data points.

From the following figure, we can identify clustering as grouping objects based on their similarity:

Clustering

There are several clustering techniques available within R libraries, such as k-means, k-medoids, hierarchical, and density-based clustering. Among them, k-means is widely used as the clustering algorithm in data science. This algorithm asks for a number of clusters to be the input parameters from the user side.

Applications of clustering are as follows:

  • Market segmentation
  • Social network analysis
  • Organizing computer network
  • Astronomical data analysis

Clustering with R

We are considering the k-means method here for implementing the clustering model over the iris input dataset, which can be achieved by just calling its in-built R dataset – the iris data (for more information, visit http://stat.ethz.ch/R-manual/R-devel/library/datasets/html/iris.html). Here we will see how k-means clustering can be performed with R.

# Loading iris flower dataset
data("iris")
# generating clusters for iris dataset
kmeans <- kmeans(iris[, -5], 3, iter.max = 1000)

# comparing iris Species with generated cluster points
Comp <- table(iris[, 5], kmeans$cluster)

Deriving clusters for small datasets is quite simple, but deriving it for huge datasets requires the use of Hadoop for providing computation power.

Performing clustering with R and Hadoop

Since the k-means clustering algorithm is already developed in RHadoop, we are going to use and understand it. You can make changes in their Mappers and Reducers as per the input dataset format. As we are dealing with Hadoop, we need to develop the Mappers and Reducers to be run on nodes in a parallel manner.

The outline of the clustering algorithm is as follows:

  • Defining the dist.fun distance function
  • Defining the k-means.map k-means Mapper function
  • Defining the k-means.reduce k-means Reducer function
  • Defining the k-means.mr k-means MapReduce function
  • Defining input data points to be provided to the clustering algorithms

Now we will run k-means.mr (the k-means MapReduce job) by providing the required parameters.

Let's understand them one by one.

  • dist.fun: First, we will see the dist.fun function for calculating the distance between a matrix of center C and a matrix of point P, which is tested. It can produce 106 points and 102 centers in five dimensions in approximately 16 seconds.
    # distance calculation function
    dist.fun = 
          function(C, P) {
            apply(
              C,
              1, 
              function(x) 
                colSums((t(P) - x)^2))}
  • k-means.map: The Mapper of the k-means MapReduce algorithm will compute the distance between points and all the centers and return the closest center for each point. This Mapper will run in iterations based on the following code. With the first iteration, the cluster center will be assigned randomly and from the next iteration, it will calculate these cluster centers based on the minimum distance from all the points of the cluster.
    # k-Means Mapper
      kmeans.map = 
          function(., P) {
            nearest = {
    
    # First interations- Assign random cluster centers 
              if(is.null(C)) 
                sample(
                  1:num.clusters, 
                  nrow(P), 
                  replace = T)
    
    # Rest of the iterations, where the clusters are assigned # based on the minimum distance from points
              else {
                D = dist.fun(C, P)
                nearest = max.col(-D)}}
     
           if(!(combine || in.memory.combine))
              keyval(nearest, P) 
            else 
              keyval(nearest, cbind(1, P))}
  • k-means.reduce: The Reducer of the k-means MapReduce algorithm will compute the column average of matrix points as key.
    # k-Means Reducer
    kmeans.reduce = {
    
    # calculating the column average for both of the 
    # conditions
    
          if (!(combine || in.memory.combine) ) 
            function(., P) 
              t(as.matrix(apply(P, 2, mean)))
          else 
            function(k, P) 
              keyval(
                k, 
                t(as.matrix(apply(P, 2, sum))))}
  • kmeans.mr: Defining the k-means MapReduce function involves specifying several input parameters, which are as follows:
    • P: This denotes the input data points
    • num.clusters: This is the total number of clusters
    • num.iter: This is the total number of iterations to be processed with datasets
    • combine: This will decide whether the Combiner should be enabled or disabled (TRUE or FALSE)
      # k-Means MapReduce – for 
      kmeans.mr = 
        function(
          P, 
          num.clusters, 
          num.iter, 
          combine, 
          in.memory.combine) {
          C = NULL
          for(i in 1:num.iter ) {
            C = 
              values(
      
      # Loading hdfs dataset
                from.dfs(
      
      # MapReduce job, with specification of input dataset,
      # Mapper and Reducer
                  mapreduce(
                    P,
                    map = kmeans.map,
                    reduce = kmeans.reduce)))
            if(combine || in.memory.combine)
              C = C[, -1]/C[, 1]
            if(nrow(C) < num.clusters) {
              C = 
                rbind(
                  C,
                  matrix(
                    rnorm(
                      (num.clusters - 
                         nrow(C)) * nrow(C)), 
                    ncol = nrow(C)) %*% C) }}
              C}
  • Defining the input data points to be provided to the clustering algorithms:
    # Input data points
    P = do.call(
          rbind, 
          rep(
    
    
            list(
    
    # Generating Matrix of
              matrix(
    # Generate random normalized data with sd = 10
                rnorm(10, sd = 10), 
                ncol=2)), 
            20)) + 
        matrix(rnorm(200), ncol =2)
  • Running kmeans.mr (the k-means MapReduce job) by providing it with the required parameters.
    # Running kmeans.mr Hadoop MapReduce algorithms with providing the required input parameters
    
    kmeans.mr(
          to.dfs(P),
          num.clusters = 12, 
          num.iter = 5,
          combine = FALSE,
          in.memory.combine = FALSE)
  • The output of the preceding command is shown in the following screenshot:
    Performing clustering with R and Hadoop
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset