Home Page Icon
Home Page
Table of Contents for
References
Close
References
by Alex Kozlov, Patrick R. Nicolas, Pascal Bugnion
Scala:Applied Machine Learning
Scala:Applied Machine Learning
Table of Contents
Scala:Applied Machine Learning
Scala:Applied Machine Learning
Credits
Preface
What this learning path covers
What you need for this learning path
Module 1
Installing the JDK
Installing and using SBT
Module 2
Module 3
Who this learning path is for
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions
I. Module 1
1. Scala and Data Science
Data science
Programming in data science
Why Scala?
Static typing and type inference
Scala encourages immutability
Scala and functional programs
Null pointer uncertainty
Easier parallelism
Interoperability with Java
When not to use Scala
Summary
References
2. Manipulating Data with Breeze
Code examples
Installing Breeze
Getting help on Breeze
Basic Breeze data types
Vectors
Dense and sparse vectors and the vector trait
Matrices
Building vectors and matrices
Advanced indexing and slicing
Mutating vectors and matrices
Matrix multiplication, transposition, and the orientation of vectors
Data preprocessing and feature engineering
Breeze – function optimization
Numerical derivatives
Regularization
An example – logistic regression
Towards re-usable code
Alternatives to Breeze
Summary
References
3. Plotting with breeze-viz
Diving into Breeze
Customizing plots
Customizing the line type
More advanced scatter plots
Multi-plot example – scatterplot matrix plots
Managing without documentation
Breeze-viz reference
Data visualization beyond breeze-viz
Summary
4. Parallel Collections and Futures
Parallel collections
Limitations of parallel collections
Error handling
Setting the parallelism level
An example – cross-validation with parallel collections
Futures
Future composition – using a future's result
Blocking until completion
Controlling parallel execution with execution contexts
Futures example – stock price fetcher
Summary
References
5. Scala and SQL through JDBC
Interacting with JDBC
First steps with JDBC
Connecting to a database server
Creating tables
Inserting data
Reading data
JDBC summary
Functional wrappers for JDBC
Safer JDBC connections with the loan pattern
Enriching JDBC statements with the "pimp my library" pattern
Wrapping result sets in a stream
Looser coupling with type classes
Type classes
Coding against type classes
When to use type classes
Benefits of type classes
Creating a data access layer
Summary
References
6. Slick – A Functional Interface for SQL
FEC data
Importing Slick
Defining the schema
Connecting to the database
Creating tables
Inserting data
Querying data
Invokers
Operations on columns
Aggregations with "Group by"
Accessing database metadata
Slick versus JDBC
Summary
References
7. Web APIs
A whirlwind tour of JSON
Querying web APIs
JSON in Scala – an exercise in pattern matching
JSON4S types
Extracting fields using XPath
Extraction using case classes
Concurrency and exception handling with futures
Authentication – adding HTTP headers
HTTP – a whirlwind overview
Adding headers to HTTP requests in Scala
Summary
References
8. Scala and MongoDB
MongoDB
Connecting to MongoDB with Casbah
Connecting with authentication
Inserting documents
Extracting objects from the database
Complex queries
Casbah query DSL
Custom type serialization
Beyond Casbah
Summary
References
9. Concurrency with Akka
GitHub follower graph
Actors as people
Hello world with Akka
Case classes as messages
Actor construction
Anatomy of an actor
Follower network crawler
Fetcher actors
Routing
Message passing between actors
Queue control and the pull pattern
Accessing the sender of a message
Stateful actors
Follower network crawler
Fault tolerance
Custom supervisor strategies
Life-cycle hooks
What we have not talked about
Summary
References
10. Distributed Batch Processing with Spark
Installing Spark
Acquiring the example data
Resilient distributed datasets
RDDs are immutable
RDDs are lazy
RDDs know their lineage
RDDs are resilient
RDDs are distributed
Transformations and actions on RDDs
Persisting RDDs
Key-value RDDs
Double RDDs
Building and running standalone programs
Running Spark applications locally
Reducing logging output and Spark configuration
Running Spark applications on EC2
Spam filtering
Lifting the hood
Data shuffling and partitions
Summary
Reference
11. Spark SQL and DataFrames
DataFrames – a whirlwind introduction
Aggregation operations
Joining DataFrames together
Custom functions on DataFrames
DataFrame immutability and persistence
SQL statements on DataFrames
Complex data types – arrays, maps, and structs
Structs
Arrays
Maps
Interacting with data sources
JSON files
Parquet files
Standalone programs
Summary
References
12. Distributed Machine Learning with MLlib
Introducing MLlib – Spam classification
Pipeline components
Transformers
Estimators
Evaluation
Regularization in logistic regression
Cross-validation and model selection
Beyond logistic regression
Summary
References
13. Web APIs with Play
Client-server applications
Introduction to web frameworks
Model-View-Controller architecture
Single page applications
Building an application
The Play framework
Dynamic routing
Actions
Composing the response
Understanding and parsing the request
Interacting with JSON
Querying external APIs and consuming JSON
Calling external web services
Parsing JSON
Asynchronous actions
Creating APIs with Play: a summary
Rest APIs: best practice
Summary
References
14. Visualization with D3 and the Play Framework
GitHub user data
Do I need a backend?
JavaScript dependencies through web-jars
Towards a web application: HTML templates
Modular JavaScript through RequireJS
Bootstrapping the applications
Client-side program architecture
Designing the model
The event bus
AJAX calls through JQuery
Response views
Drawing plots with NVD3
Summary
References
A. Pattern Matching and Extractors
Pattern matching in for comprehensions
Pattern matching internals
Extracting sequences
Summary
Reference
II. Module 2
1. Getting Started
Mathematical notation for the curious
Why machine learning?
Classification
Prediction
Optimization
Regression
Why Scala?
Abstraction
Higher-kind projection
Covariant functors for vectors
Contravariant functors for co-vectors
Monads
Scalability
Configurability
Maintainability
Computation on demand
Model categorization
Taxonomy of machine learning algorithms
Unsupervised learning
Clustering
Dimension reduction
Supervised learning
Generative models
Discriminative models
Semi-supervised learning
Reinforcement learning
Don't reinvent the wheel!
Tools and frameworks
Java
Scala
Apache Commons Math
Description
Licensing
Installation
JFreeChart
Description
Licensing
Installation
Other libraries and frameworks
Source code
Context versus view bounds
Presentation
Primitives and implicits
Primitive types
Type conversions
Immutability
Performance of Scala iterators
Let's kick the tires
An overview of computational workflows
Writing a simple workflow
Step 1 – scoping the problem
Step 2 – loading data
Step 3 – preprocessing the data
Immutable normalization
Step 4 – discovering patterns
Analyzing data
Plotting data
Step 5 – implementing the classifier
Selecting an optimizer
Training the model
Classifying observations
Step 6 – evaluating the model
Summary
2. Hello World!
Modeling
A model by any other name
Model versus design
Selecting features
Extracting features
Defining a methodology
Monadic data transformation
Error handling
Explicit models
Implicit models
A workflow computational model
Supporting mathematical abstractions
Step 1 – variable declaration
Step 2 – model definition
Step 3 – instantiation
Composing mixins to build a workflow
Understanding the problem
Defining modules
Instantiating the workflow
Modularization
Profiling data
Immutable statistics
Z-Score and Gauss
Assessing a model
Validation
Key quality metrics
F-score for binomial classification
F-score for multinomial classification
Cross-validation
One-fold cross validation
K-fold cross validation
Bias-variance decomposition
Overfitting
Summary
3. Data Preprocessing
Time series in Scala
Types and operations
The magnet pattern
The transpose operator
The differential operator
Lazy views
Moving averages
The simple moving average
The weighted moving average
The exponential moving average
Fourier analysis
Discrete Fourier transform
DFT-based filtering
Detection of market cycles
The discrete Kalman filter
The state space estimation
The transition equation
The measurement equation
The recursive algorithm
Prediction
Correction
Kalman smoothing
Fixed lag smoothing
Experimentation
Benefits and drawbacks
Alternative preprocessing techniques
Summary
4. Unsupervised Learning
Clustering
K-means clustering
Measuring similarity
Defining the algorithm
Step 1 – cluster configuration
Defining clusters
Initializing clusters
Step 2 – cluster assignment
Step 3 – reconstruction/error minimization
Creating K-means components
Tail recursive implementation
Iterative implementation
Step 4 – classification
The curse of dimensionality
Setting up the evaluation
Evaluating the results
Tuning the number of clusters
Validation
The expectation-maximization algorithm
Gaussian mixture models
Overview of EM
Implementation
Classification
Testing
The online EM algorithm
Dimension reduction
Principal components analysis
Algorithm
Implementation
Test case
Evaluation
Non-linear models
Kernel PCA
Manifolds
Performance considerations
K-means
EM
PCA
Summary
5. Naïve Bayes Classifiers
Probabilistic graphical models
Naïve Bayes classifiers
Introducing the multinomial Naïve Bayes
Formalism
The frequentist perspective
The predictive model
The zero-frequency problem
Implementation
Design
Training
Class likelihood
Binomial model
The multinomial model
Classifier components
Classification
F1 validation
Feature extraction
Testing
The Multivariate Bernoulli classification
Model
Implementation
Naïve Bayes and text mining
Basics of information retrieval
Implementation
Analyzing documents
Extracting the frequency of relative terms
Generating the features
Testing
Retrieving the textual information
Evaluating the text mining classifier
Pros and cons
Summary
6. Regression and Regularization
Linear regression
One-variate linear regression
Implementation
Test case
Ordinary least squares regression
Design
Implementation
Test case 1 – trending
Test case 2 – feature selection
Regularization
Ln roughness penalty
Ridge regression
Design
Implementation
Test case
Numerical optimization
Logistic regression
Logistic function
Binomial classification
Design
The training workflow
Step 1 – configuring the optimizer
Step 2 – computing the Jacobian matrix
Step 3 – managing the convergence of the optimizer
Step 4 – defining the least squares problem
Step 5 – minimizing the sum of square errors
Test
Classification
Summary
7. Sequential Data Models
Markov decision processes
The Markov property
The first order discrete Markov chain
The hidden Markov model
Notations
The lambda model
Design
Evaluation – CF-1
Alpha – the forward pass
Beta – the backward pass
Training – CF-2
The Baum-Welch estimator (EM)
Decoding – CF-3
The Viterbi algorithm
Putting it all together
Test case 1 – training
Test case 2 – evaluation
HMM as a filtering technique
Conditional random fields
Introduction to CRF
Linear chain CRF
Regularized CRFs and text analytics
The feature functions model
Design
Implementation
Configuring the CRF classifier
Training the CRF model
Applying the CRF model
Tests
The training convergence profile
Impact of the size of the training set
Impact of the L2 regularization factor
Comparing CRF and HMM
Performance consideration
Summary
8. Kernel Models and Support Vector Machines
Kernel functions
An overview
Common discriminative kernels
Kernel monadic composition
Support vector machines
The linear SVM
The separable case – the hard margin
The nonseparable case – the soft margin
The nonlinear SVM
Max-margin classification
The kernel trick
Support vector classifiers – SVC
The binary SVC
LIBSVM
Design
Configuration parameters
The SVM formulation
The SVM kernel function
The SVM execution
Interface to LIBSVM
Training
Classification
C-penalty and margin
Kernel evaluation
Applications in risk analysis
Anomaly detection with one-class SVC
Support vector regression
An overview
SVR versus linear regression
Performance considerations
Summary
9. Artificial Neural Networks
Feed-forward neural networks
The biological background
Mathematical background
The multilayer perceptron
The activation function
The network topology
Design
Configuration
Network components
The network topology
Input and hidden layers
The output layer
Synapses
Connections
The initialization weights
The model
Problem types (modes)
Online training versus batch training
The training epoch
Step 1 – input forward propagation
The computational flow
Error functions
Operating modes
Softmax
Step 2 – error backpropagation
Weights' adjustment
The error propagation
The computational model
Step 3 – exit condition
Putting it all together
Training and classification
Regularization
The model generation
The Fast Fisher-Yates shuffle
Prediction
Model fitness
Evaluation
The execution profile
Impact of the learning rate
The impact of the momentum factor
The impact of the number of hidden layers
Test case
Implementation
Evaluation of models
Impact of the hidden layers' architecture
Convolution neural networks
Local receptive fields
Sharing of weights
Convolution layers
Subsampling layers
Putting it all together
Benefits and limitations
Summary
10. Genetic Algorithms
Evolution
The origin
NP problems
Evolutionary computing
Genetic algorithms and machine learning
Genetic algorithm components
Encoding
Value encoding
Predicate encoding
Solution encoding
The encoding scheme
Flat encoding
Hierarchical encoding
Genetic operators
Selection
Crossover
Mutation
The fitness score
Implementation
Software design
Key components
Population
Chromosomes
Genes
Selection
Controlling the population growth
The GA configuration
Crossover
Population
Chromosomes
Genes
Mutation
Population
Chromosomes
Genes
Reproduction
Solver
GA for trading strategies
Definition of trading strategies
Trading operators
The cost function
Trading signals
Trading strategies
Trading signal encoding
A test case
Creating trading strategies
Configuring the optimizer
Finding the best trading strategy
Tests
The weighted score
The unweighted score
Advantages and risks of genetic algorithms
Summary
11. Reinforcement Learning
Reinforcement learning
The problem
A solution – Q-learning
Terminology
Concepts
Value of a policy
The Bellman optimality equations
Temporal difference for model-free learning
Action-value iterative update
Implementation
Software design
The states and actions
The search space
The policy and action-value
The Q-learning components
The Q-learning training
Tail recursion to the rescue
The validation
The prediction
Option trading using Q-learning
The OptionProperty class
The OptionModel class
Quantization
Putting it all together
Evaluation
Pros and cons of reinforcement learning
Learning classifier systems
Introduction to LCS
Why LCS?
Terminology
Extended learning classifier systems
XCS components
Application to portfolio management
The XCS core data
XCS rules
Covering
An implementation example
Benefits and limitations of learning classifier systems
Summary
12. Scalable Frameworks
An overview
Scala
Object creation
Streams
Parallel collections
Processing a parallel collection
The benchmark framework
Performance evaluation
Scalability with Actors
The Actor model
Partitioning
Beyond actors – reactive programming
Akka
Master-workers
Exchange of messages
Worker actors
The workflow controller
The master actor
Master with routing
Distributed discrete Fourier transform
Limitations
Futures
The Actor life cycle
Blocking on futures
Handling future callbacks
Putting it all together
Apache Spark
Why Spark?
Design principles
In-memory persistency
Laziness
Transforms and actions
Shared variables
Experimenting with Spark
Deploying Spark
Using Spark shell
MLlib
RDD generation
K-means using Spark
Performance evaluation
Tuning parameters
Tests
Performance considerations
Pros and cons
0xdata Sparkling Water
Summary
A. Basic Concepts
Scala programming
List of libraries and tools
Code snippets format
Best practices
Encapsulation
Class constructor template
Companion objects versus case classes
Enumerations versus case classes
Overloading
Design template for immutable classifiers
Utility classes
Data extraction
Data sources
Extraction of documents
DMatrix class
Counter
Monitor
Mathematics
Linear algebra
QR decomposition
LU factorization
LDL decomposition
Cholesky factorization
Singular Value Decomposition
Eigenvalue decomposition
Algebraic and numerical libraries
First order predicate logic
Jacobian and Hessian matrices
Summary of optimization techniques
Gradient descent methods
Steepest descent
Conjugate gradient
Stochastic gradient descent
Quasi-Newton algorithms
BFGS
L-BFGS
Nonlinear least squares minimization
Gauss-Newton
Levenberg-Marquardt
Lagrange multipliers
Overview of dynamic programming
Finances 101
Fundamental analysis
Technical analysis
Terminology
Trading data
Trading signals and strategy
Price patterns
Options trading
Financial data sources
Suggested online courses
References
III. Module 3
1. Exploratory Data Analysis
Getting started with Scala
Distinct values of a categorical field
Summarization of a numeric field
Grepping across multiple fields
Basic, stratified, and consistent sampling
Working with Scala and Spark Notebooks
Basic correlations
Summary
2. Data Pipelines and Modeling
Influence diagrams
Sequential trials and dealing with risk
Exploration and exploitation
Unknown unknowns
Basic components of a data-driven system
Data ingest
Data transformation layer
Data analytics and machine learning
UI component
Actions engine
Correlation engine
Monitoring
Optimization and interactivity
Feedback loops
Summary
3. Working with Spark and MLlib
Setting up Spark
Understanding Spark architecture
Task scheduling
Spark components
MQTT, ZeroMQ, Flume, and Kafka
HDFS, Cassandra, S3, and Tachyon
Mesos, YARN, and Standalone
Applications
Word count
Streaming word count
Spark SQL and DataFrame
ML libraries
SparkR
Graph algorithms – GraphX and GraphFrames
Spark performance tuning
Running Hadoop HDFS
Summary
4. Supervised and Unsupervised Learning
Records and supervised learning
Iris dataset
Labeled point
SVMWithSGD
Logistic regression
Decision tree
Bagging and boosting – ensemble learning methods
Unsupervised learning
Problem dimensionality
Summary
5. Regression and Classification
What regression stands for?
Continuous space and metrics
Linear regression
Logistic regression
Regularization
Multivariate regression
Heteroscedasticity
Regression trees
Classification metrics
Multiclass problems
Perceptron
Generalization error and overfitting
Summary
6. Working with Unstructured Data
Nested data
Other serialization formats
Hive and Impala
Sessionization
Working with traits
Working with pattern matching
Other uses of unstructured data
Probabilistic structures
Projections
Summary
7. Working with Graph Algorithms
A quick introduction to graphs
SBT
Graph for Scala
Adding nodes and edges
Graph constraints
JSON
GraphX
Who is getting e-mails?
Connected components
Triangle counting
Strongly connected components
PageRank
SVD++
Summary
8. Integrating Scala with R and Python
Integrating with R
Setting up R and SparkR
Linux
Mac OS
Windows
Running SparkR via scripts
Running Spark via R's command line
DataFrames
Linear models
Generalized linear model
Reading JSON files in SparkR
Writing Parquet files in SparkR
Invoking Scala from R
Using Rserve
Integrating with Python
Setting up Python
PySpark
Calling Python from Java/Scala
Using sys.process._
Spark pipe
Jython and JSR 223
Summary
9. NLP in Scala
Text analysis pipeline
Simple text analysis
MLlib algorithms in Spark
TF-IDF
LDA
Segmentation, annotation, and chunking
POS tagging
Using word2vec to find word relationships
A Porter Stemmer implementation of the code
Summary
10. Advanced Model Monitoring
System monitoring
Process monitoring
Model monitoring
Performance over time
Criteria for model retiring
A/B testing
Summary
A. Bibliography
Index
Search in book...
Toggle Font Controls
Playlists
Add To
Create new playlist
Name your new playlist
Playlist description (optional)
Cancel
Create playlist
Sign In
Email address
Password
Forgot Password?
Create account
Login
or
Continue with Facebook
Continue with Google
Sign Up
Full Name
Email address
Confirm Email Address
Password
Login
Create account
or
Continue with Facebook
Continue with Google
Prev
Previous Chapter
Summary
Next
Next Chapter
14. Visualization with D3 and the Play Framework
References
This Wikipedia
page gives information on semantic URLs:
https://en.wikipedia.org/wiki/Semantic_URL
and
http://apiux.com/2013/04/03/url-design-restful-web-services/
.
For a much more in depth discussion of the Play framework, I suggest
Play Framework Essentials
by
Julien Richard-Foy
.
REST in Practice: Hypermedia and Systems Architecture
, by
Jim Webber
,
Savas Parastatidis
and
Ian Robinson
describes how to architect REST APIs.
Add Highlight
No Comment
..................Content has been hidden....................
You can't read the all page of ebook, please click
here
login for view all page.
Day Mode
Cloud Mode
Night Mode
Reset