Chapter 2. Hello World!

In the first chapter, you were acquainted with some rudimentary concepts regarding data processing, clustering, and classification. This chapter is dedicated to the creation and maintenance of a flexible end-to-end workflow to train and classify data. The first section of the chapter introduces a data-centric (functional) approach to create number-crunching applications.

You will learn how to:

  • Apply the concept of monadic design to create dynamic workflows
  • Leverage some of Scala's advanced functional features, such as dependency injection, to build portable computational workflows
  • Take into account the bias-variance trade-off in selecting a model
  • Overcome overfitting in modeling
  • Break down data into training, test, and validation sets
  • Implement model validation in Scala using precision, recall, and F score

Modeling

Data is the lifeline of any scientist, and the selection of data providers is critical in developing or evaluating any statistical inference or machine learning algorithm.

A model by any other name

We briefly introduced the concept of a model in the Model categorization section in Chapter 1, Getting Started.

What constitutes a model? Wikipedia provides a reasonably good definition of a model as understood by scientists [2:1]:

A scientific model seeks to represent empirical objects, phenomena, and physical processes in a logical and objective way.

Models that are rendered in software allow scientists to leverage computational power to simulate, visualize, manipulate and gain intuition about the entity, phenomenon, or process being represented.

In statistics and the probabilistic theory, a model describes data that one might observe from a system to express any form of uncertainty and noise. A model allows us to infer rules, make predictions, and learn from data.

A model is composed of features, also known as attributes or variables, and a set of relation between those features. For instance, the model represented by the function f(x, y) = x.sin(2y) has two features, x and y, and a relation, f. These two features are assumed to be independent. If the model is subject to a constraint such as f(x, y) < 20, then the conditional independence is no longer valid.

An astute Scala programmer would associate a model to a monoid for which the set is a group of observations and the operator is the function implementing the model. If it walks like a monoid and quacks like a monoid, then it is a monoid.

Models come in a variety of shapes and forms:

  • Parametric: This consists of functions and equations (for example, y = sin(2t + w))
  • Differential: This consists of ordinary and partial differential equations (for example, dy = 2x.dx)
  • Probabilistic: This consists of probability distributions (for example, p (x|c) = exp (k.logx – x)/x!)
  • Graphical: This consists of graphs that abstract out the conditional independence between variables (for example, p(x,y|c) = p(x|c).p(y|c))
  • Directed graphs: This consists of temporal and spatial relationships (for example, a scheduler)
  • Numerical method: This consists of finite elements and methods such as Newton-Raphson
  • Chemistry: This consists of formula and components (for example, H2O, Fe + C12 = FeC13, and so on)
  • Taxonomy: This consists of a semantic definition and relationship of concepts (for example, APG/Eudicots/Rosids/Huaceae/Malvales)
  • Grammar and lexicon: This consists of a syntactic representation of documents (for example, Scala programming language)
  • Inference logic: This consists of a distribution pattern such as IF (stock vol > 1.5 * average) AND rsi > 80 THEN…

Model versus design

The confusion between model and design is quite common in Computer Science, the reason being that these terms have different meanings for different people depending on the subject. The following metaphors should help with your understanding of these two concepts:

  • Modeling: This is describing something you know. A model makes the assumption, which becomes an assertion if proven correct (for example, the US population, p, increases by 1.2 percent a year, dp/dt= 1.012).
  • Designing: This is manipulating representation for things you don't know. Designing can be seen as the exploration phase of modeling (for example, what are the features that contribute to the growth of the US population? Birth rate? Immigration? Economic conditions? Social policies?).

Selecting a model's features

The selection of a model's features is the process of discovering and documenting the minimum set of variables required to build the model. Scientists make the assumption that data contains many redundant or irrelevant features. Redundant features do not provide information already given by the selected features, and irrelevant features provide no useful information.

Selecting features consists of two consecutive steps:

  1. Searching for new feature subsets.
  2. Evaluating these feature subsets using a scoring mechanism.

The process of evaluating each possible subset of features to find the one that maximizes the objective function or minimizes the error rate is computationally intractable for large datasets. A model with n features requires 2n-1 evaluations.

Extracting features

An observation is a set of indirect measurements of hidden, also known as latent, variables, which may be noisy or contain a high degree of correlation and redundancies. Using raw observations in a classification task would very likely produce inaccurate classes. Using all features from the observation also incurs a high computation cost.

The purpose of extracting features is to reduce the number of variables or dimensions of the model by eliminating redundant or irrelevant features. The features are extracted by transforming the original set of observations into a smaller set at the risk of losing some vital information embedded in the original set.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset