Introducing stream clustering

Clustering can be defined as the task of separating a set of observations/tuples into groups/clusters so that the intra-cluster records are similar and the inter-cluster records are dissimilar. There are several approaches to clustering when we are dealing with data at rest. In streaming data, data continues to arrive at a particular rate. We don't have the luxury of accessing the data randomly or making multiple passes on the data. Among the data stream clustering methods, a large number of algorithms use a two-phase scheme which consists of an online component that processes data stream points and produces summary statistics, and an offline component that uses the summary data to generate the clusters.

The online/offline two-stage processing is the most common framework adopted by many of the stream clustering algorithms.

Before we go on to explain the online/offline two-stage process, let us quickly look at micro-clusters.

Micro-clusters are created by a single pass to the data. As each data point arrives, it is assigned to the closest micro-cluster. You can think of a micro-cluster as a summary of similar data points. The summary is typically stored in the form of a cluster center, the local density of the point, and may include more statistics, such as variance. In the stream, if we are not able to allocate a new incoming data point to any of the existing micro-clusters, a new micro-cluster is formed with that data point.

The online step deals with micro-cluster formation. As the data arrives, either new micro-clusters are created or points are assigned to existing micro-clusters.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset