Time series analysis using the Cloudera Spark-TS package

As discussed in Chapter 9, Advanced Machine Learning with Streaming and Graph Data, we will see how to configure the Spark-TS package developed by Cloudera. Mainly, we will talk about the TimeSeriesRDD in this section.

Time series data

Time series data consists of sequences of measurements, each occurring at a point in time. A variety of terms are used to describe time series data, and many of them apply to conflicting or overlapping concepts. In the interest of clarity, in Spark-TS, Cloudera sticks to a particular vocabulary. Three objects are important in time series data analysis: time series, instant, and observation:

  • A time series is a sequence of real (that is, floating-point) values, each linked to a specific timestamp. Particularly, this sticks with time series as meaning a univariate time series. In Scala, a time series is usually represented by a Breeze presented at https://github.com/scalanlp/breeze vector, and in Python, a 1-D NumPy array (refer to http://www.numpy.org/ for more), and has a DateTimeIndex as shown at https://github.com/sryza/spark-timeseries/blob/master/src/main/scala/com/cloudera/sparkts/DateTimeIndex.scala.
  • On the other hand, an instant is the vector of values in a collection of time series corresponding to a single point in time. In the Spark-TS library, each time series is typically labeled with a key that enables it to be identified among a collection of time series.
  • Finally, an observation is a tuple of (timestamp, key, value), that is, a single value in a time series or instant.

However, not all data with timestamps are time series data. For example, logs don't fit directly into time series since they consist of discrete events, not scalar measurements taken at intervals. However, measurements of log messages per hour would constitute a time series.

Configuring Spark-TS

The most straightforward way to access Spark-TS from Scala is to depend on it in a Maven project. Do this by including the following repo in pom.xml:

<repositories> 
    <repository> 
      <id>cloudera-repos</id> 
      <name>Cloudera Repos</name> 
      <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url> 
    </repository> 
</repositories> 
 And including the following dependency in the pom.xml:  
<dependency> 
      <groupId>com.cloudera.sparkts</groupId> 
      <artifactId>sparkts</artifactId> 
      <version>0.1.0</version> 
</dependency> 

To get the raw pom.xml file, interested readers should go to the following URL:

https://github.com/sryza/spark-timeseries/blob/master/pom.xml

Alternatively, to access it in a spark-shell, download the JAR from https://repository.cloudera.com/cloudera/libs-release-local/com/cloudera/sparkts/sparkts/0.1.0/sparkts-0.1.0-jar-with-dependencies.jar, and then launch the shell with the following command as discussed in the Using external libraries with Spark Core section in this chapter:

spark-shell 
      --jars sparkts-0.1.0-jar-with-dependencies.jar 
      --driver-class-path sparkts-0.1.0-jar-with-dependencies.jar

TimeSeriesRDD

According to the Spark-TS engineering blog written on the Cloudera website at http://blog.cloudera.com/blog/2015/12/spark-ts-a-new-library-for-analyzing-time-series-data-with-apache-spark/, TimeSeriesRDD is central to Spark-TS, where each object in the RDD stores a full univariate series. Operations that tend to apply exclusively to time series are much more efficient. For example, if you want to generate a set of lagged time series from your original collection of time series, each lagged series can be computed just by looking at a single record in the input RDD.

Similarly, with imputing missing values based on surrounding values, or fitting time series models to each series, all of the data needed is present in a single array. Therefore, the central abstraction of the Spark-TS library is TimeSeriesRDD, which is simply a collection of time series on which you can operate in a distributed fashion. This approach allows you to avoid storing timestamps for each series and instead store a single DateTimeIndex to which all the series vectors conform. TimeSeriesRDD[K] extends RDD[(K, Vector[Double])], where K is the key type (usually a string), and the second element in the tuple is a Breeze vector representing the time series.

A more technical discussion can be found in the GitHub URL: https://github.com/sryza/spark-timeseries. Since this is a Third Party Package, a detailed discussion is out of the scope of this book, we believe.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset