Chapter 3. Data Preprocessing

Real-world data is usually noisy and inconsistent with missing observations. No classification, regression, or clustering model can extract relevant information from unprocessed data.

Data preprocessing consists of cleaning, filtering, transforming, and normalizing raw observations using statistics in order to correlate features or groups of features, identify trends and model, and filter out noise. The purpose of cleansing raw data is twofold:

  • Extract some basic knowledge from raw datasets
  • Evaluate the quality of data and generate clean datasets for unsupervised or supervised learning

You should not underestimate the power of traditional statistical analysis methods to infer and classify information from textual or unstructured data.

In this chapter, you will learn how to:

  • Apply commonly used moving average techniques to detect long-term trends in a time series
  • Identify market and sector cycles using discrete Fourier series
  • Leverage the Kalman filter to extract the state of a dynamic system from incomplete and noisy observations

Time series

The overwhelming majority of examples used to illustrate the different machine algorithms in this book process time series or sequential, ordered, or unordered data.

Each library has its own container type to manipulate datasets. The challenge is to define all possible conversions between types from different libraries needed to implement a large variety of machine learning models. Such a strategy may result in a combinatorial explosion of implicit conversion. A solution consists of creating a generic class to manage conversion from and to any type used by a third-party library.

Note

Scala.collection.JavaConversions _

Scala provides a standard package to convert collection types from Scala to Java and vice versa.

The generic data transformation, DT, can be used to transform any XTSeries time series:

class DT[T,U] extends PipeOperator[XTSeries[T], XTSeries[U]] {
  override def |> : PartialFunction[XTSeries[T], XTSeries[U]]
}

Let's consider the simple case of using a Java library, the Apache Commons Math framework, and JFreeChart for visualization, and define a parameterized time series class, XTSeries[T]. The > data transformation converts a time series of values of type T, XTSeries[T], into a time series of values of type U, XTSeries[U]. The following diagram provides an overview of type conversion in data transformation:

Time series

Let's create the XTSeries class. As a container, the class should be an implementation of the Scala higher-order collections functions such as map, foreach, or zip. The class should support at least conversion to DblVector and DblMatrix types introduced in the first chapter.

Here is a partial implementation of the XTSeries class. Comments, exceptions, argument validations, and debugging code are omitted in the code:

class XTSeries[T](label: String, arr: Array[T]) { // 1
  def apply(n: Int): T = arr.apply(n)

  @implicitNotFound("Undefined conversion to DblVector") // 2
  def toDblVector(implicit f: T=>Double):DblVector =arr.map(f(_))

  @implicitNotFound("Undefined conversion to DblMatrix") // 2
  def toDblMatrix(implicit fv: T => DblVector): DblMatrix = arr.map( fv( _ ) )

  def + (n: Int, t: T)(implicit f: (T,T) => T): T = f(arr(n), t)

  def head: T = arr.head  //3
  def drop(n: Int):XTSeries[T] = XTSeries(label,arr.drop(n))
  def map[U: ClassTag](f: T => U): XTSeries[U] = XTSeries[U](label, arr.map( x =>f(x)))
  def foreach( f: T => Unit) = arr.foreach(f) //3
  def sortWith(lt: (T,T)=>Boolean):XTSeries[T] = XTSeries[T](label, arr.sortWith(lt))
  def max(implicit cmp: Ordering[T]): T = arr.max //4
def min(implicit cmp: Ordering[T]): T = arr.min
…
}

The class takes an optional label and an invariant array of the parameterized type T. The annotation @specialized (line 1) instructs the compiler to generate two versions of the class:

  • A generic XTSeries[T] class that exploits all the implicit conversions required to perform operations on time series of a generic type
  • An optimized XTSeries[Double] class that bypasses the conversion and offers the client code with a faster implementation

The conversion to DblVector (resp. DblMatrix) relies on the implicit conversion of elements to type Double (resp. DblVector) (line 2). The @implicitNotFound annotation instructs the compiler to omit an error if no implicit conversion is detected. The conversion methods are used to implement the implicit conversion introduced in the previous section. These methods are defined in the singleton org.scalaml.core.Types.CommonsMath library. The following code shows the implementation of the conversion methods:

object Types {
   object CommonMath {
     implicit def series2DblVector[T](xt: XTSeries[T])(implicit f: T=>Double):DblVector = xt.toDblVector(f)
     implicit def series2DblMatrix[T](xt: XTSeries[T])(implicit f: T=>DblVector): DblMatrix = xt.toDblMatrix(f)

}

This code snippet exposes a subset of the Scala higher-order collections methods (line 3) applied to the time series. The computation of the minimum and maximum values in the time series required that the cmp ordering/compare method be defined for the elements of the type T (line 4).

Let's put our versatile XTSeries class to use in creating a basic preprocessing data transformation starting with the ubiquitous moving average techniques.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset