In machine learning, unsupervised learning is used for finding the hidden structure from the unlabeled dataset. Since the datasets are not labeled, there will be no error while evaluating for potential solutions.
Unsupervised machine learning includes several algorithms, some of which are as follows:
We will consider popular clustering algorithms here.
Clustering is the task of grouping a set of object in such a way that similar objects with similar characteristics are grouped in the same category, but other objects are grouped in other categories. In clustering, the input datasets are not labeled; they need to be labeled based on the similarity of their data structure.
In unsupervised machine learning, the classification technique performs the same procedure to map the data to a category with the help of the provided set of input training datasets. The corresponding procedure is known as clustering (or cluster analysis), and involves grouping data into categories based on some measure of inherent similarity; for example, the distance between data points.
From the following figure, we can identify clustering as grouping objects based on their similarity:
There are several clustering techniques available within R libraries, such as k-means, k-medoids, hierarchical, and density-based clustering. Among them, k-means is widely used as the clustering algorithm in data science. This algorithm asks for a number of clusters to be the input parameters from the user side.
Applications of clustering are as follows:
We are considering the k-means
method here for implementing the clustering model over the iris
input dataset, which can be achieved by just calling its in-built R dataset – the iris
data (for more information, visit http://stat.ethz.ch/R-manual/R-devel/library/datasets/html/iris.html). Here we will see how k-means clustering can be performed with R.
# Loading iris flower dataset data("iris") # generating clusters for iris dataset kmeans <- kmeans(iris[, -5], 3, iter.max = 1000) # comparing iris Species with generated cluster points Comp <- table(iris[, 5], kmeans$cluster)
Deriving clusters for small datasets is quite simple, but deriving it for huge datasets requires the use of Hadoop for providing computation power.
Since the k-means clustering algorithm is already developed in RHadoop, we are going to use and understand it. You can make changes in their Mappers and Reducers as per the input dataset format. As we are dealing with Hadoop, we need to develop the Mappers and Reducers to be run on nodes in a parallel manner.
The outline of the clustering algorithm is as follows:
Now we will run k-means.mr
(the k-means MapReduce job) by providing the required parameters.
Let's understand them one by one.
dist.fun
: First, we will see the dist.fun
function for calculating the distance between a matrix of center C
and a matrix of point P
, which is tested. It can produce 106 points and 102 centers in five dimensions in approximately 16 seconds.# distance calculation function dist.fun = function(C, P) { apply( C, 1, function(x) colSums((t(P) - x)^2))}
k-means.map
: The Mapper of the k-means MapReduce algorithm will compute the distance between points and all the centers and return the closest center for each point. This Mapper will run in iterations based on the following code. With the first iteration, the cluster center will be assigned randomly and from the next iteration, it will calculate these cluster centers based on the minimum distance from all the points of the cluster.# k-Means Mapper kmeans.map = function(., P) { nearest = { # First interations- Assign random cluster centers if(is.null(C)) sample( 1:num.clusters, nrow(P), replace = T) # Rest of the iterations, where the clusters are assigned # based on the minimum distance from points else { D = dist.fun(C, P) nearest = max.col(-D)}} if(!(combine || in.memory.combine)) keyval(nearest, P) else keyval(nearest, cbind(1, P))}
k-means.reduce
: The Reducer of the k-means MapReduce algorithm will compute the column average of matrix points as key.# k-Means Reducer kmeans.reduce = { # calculating the column average for both of the # conditions if (!(combine || in.memory.combine) ) function(., P) t(as.matrix(apply(P, 2, mean))) else function(k, P) keyval( k, t(as.matrix(apply(P, 2, sum))))}
kmeans.mr
: Defining the k-means MapReduce function involves specifying several input parameters, which are as follows:P
: This denotes the input data pointsnum.clusters
: This is the total number of clustersnum.iter
: This is the total number of iterations to be processed with datasetscombine
: This will decide whether the Combiner should be enabled or disabled (TRUE
or FALSE
)# k-Means MapReduce – for kmeans.mr = function( P, num.clusters, num.iter, combine, in.memory.combine) { C = NULL for(i in 1:num.iter ) { C = values( # Loading hdfs dataset from.dfs( # MapReduce job, with specification of input dataset, # Mapper and Reducer mapreduce( P, map = kmeans.map, reduce = kmeans.reduce))) if(combine || in.memory.combine) C = C[, -1]/C[, 1] if(nrow(C) < num.clusters) { C = rbind( C, matrix( rnorm( (num.clusters - nrow(C)) * nrow(C)), ncol = nrow(C)) %*% C) }} C}
# Input data points P = do.call( rbind, rep( list( # Generating Matrix of matrix( # Generate random normalized data with sd = 10 rnorm(10, sd = 10), ncol=2)), 20)) + matrix(rnorm(200), ncol =2)
kmeans.mr
(the k-means MapReduce job) by providing it with the required parameters.# Running kmeans.mr Hadoop MapReduce algorithms with providing the required input parameters kmeans.mr( to.dfs(P), num.clusters = 12, num.iter = 5, combine = FALSE, in.memory.combine = FALSE)