Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Previous Chapter

E. References for Part V

About the Authors

Index

A

AARFF (attribute-relation file format), streamDM Introduction
actions, The First Wave: Functional APIs, Spark Components
advanced stateful operations
- flatMapGroupsWith State, Using FlatMapGroupsWithState-When a timeout actually times out
- group with state operations, Understanding Group with State Operations
- mapGroupsWithState, Using MapGroupsWithState-Using MapGroupsWithState
- overview of, Advanced Stateful Operations
advanced streaming techniques
- algorithms for streaming approximation and sampling, Streaming Approximation and Sampling Algorithms-Stratified Sampling
- real-time machine learning, Real-Time Machine Learning-Streaming K-Means with Spark Streaming
aggregations
- capabilities of Spark Streaming, Time-Based Stream Processing
- defined, Transformations and Aggregations
- time-based (Structured Streaming API), Event Time–Based Stream Processing
- using composite keys, Using Composite Aggregation Keys
- using custom criteria, Advanced Stateful Operations
- window, Window Aggregations
- window (Spark Streaming API), Window Aggregations
Akka, Where to Find More Sources
algorithms (see also real-time machine learning)
- approximation algorithms, Approximation Algorithms
- CountMinSketches (CMS), Counting Element Frequency: Count Min Sketches-Computing Frequencies with a Count-Min Sketch
- exact versus approximation, Streaming Approximation and Sampling Algorithms
- exactness, real-time, and big data triangle, Exactness, Real Time, and Big Data-Big Data and Real Time
- hashing and sketching, Hashing and Sketching: An Introduction
- HyperLogLog (HLL), Counting Distinct Elements: HyperLogLog-Practical HyperLogLog in Spark
- LogLog, Role-Playing Exercise: If We Were a System Administrator
- sampling algorithms, Reducing the Number of Elements: Sampling
- streaming versus batch, Streaming Versus Batch Algorithms-Streaming Algorithms Can’t Be Guaranteed to Measure Well Against Batch Algorithms, Batch Analytics-Exploring the Data
- T-Digest, Ranks and Quantiles: T-Digest-T-Digest in Spark
Amazon Kinesis, Amazon Kinesis on AWS
Apache Bahir library, Where to Find More Sources
Apache Beam, Apache Beam/Google Cloud Dataflow
Apache CouchDB/Cloudant, Where to Find More Sources
Apache Flink, Apache Flink
Apache Kafka, Structured Streaming in Action, The Receiverless or Direct Model
Apache Spark (see also Spark Streaming API; Structured Streaming API)
- as a stream-processing engine, Apache Spark as a Stream-Processing Engine-Fast Implementation of Data Analysis
- benefits of, Introducing Apache Spark
- community support, Stay Plugged In
- components, Spark Components
- DataFrames and Dataset, The Second Wave: SQL
- distributed processing model, Spark’s Distributed Processing Model-The Disappearance of the Batch Interval
- installing, Installing Spark
- memory model of, The First Wave: Functional APIs, Spark’s Memory Usage
- resilience model of, Spark’s Resilience Model-Summary
- resources for learning, To Learn More About Spark
- unified programming model of, A Unified Engine
- version used, Installing Spark
Apache Spark Project, Contributing to the Apache Spark Project
Apache Storm, Apache Storm-Compared to Spark
append mode, outputMode-Understanding the append semantic
application monitoring (Structured Streaming)
- challenges of, Monitoring Structured Streaming Applications
- Spark metrics subsystem, The Spark Metrics Subsystem
- StreamingQuery instance, The StreamingQuery Instance-Getting Metrics with StreamingQueryProgress
- StreamingQueryListener interface, The StreamingQueryListener Interface-Implementing a StreamingQueryListener
approximation algorithms, Approximation Algorithms
arbitrary stateful processing, Advanced Stateful Operations
arbitrary stateful streaming computation
- mapWithState, Introducing Stateful Computation with mapwithState-Event-Time Stream Computation Using mapWithState
- statefulness at stream scale, Statefulness at the Scale of a Stream
- updateStateByKey, updateStateByKey-Memory Usage
at-least-once versus at-most-once processing, Data Delivery Semantics, Understanding Sinks
Azure Streaming Analytics, Microsoft Azure Stream Analytics

B

backpressure signaling, Backpressure
batch intervals
- as keystone of microbatching, Understanding Continuous Processing
- defined, DStreams as an Execution Model
- purpose of, The Bulk-Synchronous Architecture
- relationship to processing delay, The Relationship Between Batch Interval and Processing Delay
- tweaking for performance improvement, Tweaking the Batch Interval
- versus window length, Window Length Versus Batch Interval
- windows versus longer intervals, Using Windows Versus Longer Batch Intervals
batch processing
- classical batch analytics jobs, Batch Analytics-Batch Analytics
- continued relevance of, Summary
- in streaming applications, The Use of a Batch-Processing Component in a Streaming Application
- Spark core engine and, Throughput-Oriented Processing
- versus stream processing, Batch Versus Stream Processing
- streaming versus batch algorithms, Streaming Versus Batch Algorithms-Streaming Algorithms Can’t Be Guaranteed to Measure Well Against Batch Algorithms, Batch Analytics-Exploring the Data
best-effort execution, Microbatch in Structured Streaming
billing modernization, Some Examples of Stream Processing
bin-packing problem, Streaming Algorithms Can’t Be Guaranteed to Measure Well Against Batch Algorithms
block intervals, The Bulk-Synchronous Architecture
Bloom filters, Introducing Bloom Filters
bounded data, What Is Stream Processing?
broadcast joins, Join Optimizations
bulk-synchronous processing (BSP), Examples of Cluster Managers, Microbatching and One-Element-at-a-Time, The Bulk-Synchronous Architecture-The Bulk-Synchronous Architecture

C

cache hints, Cache Hints
caching, Caching
car fleet management solution, Example: Car Fleet Management
checkpointing
- alternatives to tuning, Checkpoint Tuning
- basics of, Understanding the Use of Checkpoints-Understanding the Use of Checkpoints
- benefits of, Checkpointing
- checkpoint tuning, Checkpoint Tuning-Checkpoint Tuning
- cost of, The Cost of Checkpointing
- DStreams, Checkpointing DStreams
- influence on processing time, Checkpoint Influence in Processing Time
- mandatory on stateful streams, updateStateByKey
- purpose of, Checkpointing
- recovery from checkpoints, Recovery from a Checkpoint
clock synchronization techniques, Understanding Event Time in Structured Streaming
cluster deployment mode, Cluster-mode deployment
cluster managers, Running Apache Spark with a Cluster Manager-Spark’s Own Cluster Manager, Cluster Manager Support for Fault Tolerance
code examples, obtaining and using, Using Code Examples
collision resistance, Hashing and Sketching: An Introduction
comments and questions, How to Contact Us
complete mode, outputMode
component reuse, Fast Implementation of Data Analysis
Console sink
- configurable options, Options
- limitations of, The Console Sink
- output modes, Output Modes
- outputting data to, Workarounds-format
- reliability of, Sinks for Experimentation
ConstantInputDStream, A Simpler Alternative to the Queue Source: The ConstantInputDStream-ConstantInputDStream as a random data generator
consuming the stream, Sources and Sinks
containers, Spark’s Own Cluster Manager
continuous processing, Understanding Continuous Processing-Limitations
countByValueAndWindow, countByValueAndWindow
countByWindow, countByWindow
counting transformations, Counting
CountMinSketches (CMS), Counting Element Frequency: Count Min Sketches-Computing Frequencies with a Count-Min Sketch
cryptographically secure hash functions, Hashing and Sketching: An Introduction
CSV File sink format, The CSV Format of the File Sink
CSV File source format, Specifying a File Format, CSV File Source Format
cumulative distribution function (CDF), Ranks and Quantiles: T-Digest

D

data
- at rest, The Notion of Time in Stream Processing, Dealing with Data at Rest-Using Join to Enrich the Input Stream
- bounded data, What Is Stream Processing?
- exactly once data delivery, Understanding Sinks
- in motion, The Notion of Time in Stream Processing
- limiting data ingress, Limiting the Data Ingress with Fixed-Rate Throttling
- outputting to screens, Workarounds
- outputting to sinks, Sinks: Output the Resulting Data-start()
- streams of data, What Is Stream Processing?
- structured data, Introducing Structured Streaming
- unbounded data, What Is Stream Processing?, Introducing Structured Streaming
data lakes, The File Sink
data platform components, Components of a Data Platform
data sources, Sources and Sinks
data streams, Introducing Structured Streaming
Dataframe API, Introducing Structured Streaming, Transforming Streaming Data-Workarounds
Dataset API, The Second Wave: SQL, Introducing Structured Streaming, Transforming Streaming Data-Workarounds
date and time formats, Date and time formats, Common Time and Date Formatting (CSV, JSON)
decision trees, Introducing Decision Trees-Hoeffding Trees in Spark, in Practice
delivery guarantees, Examples of Cluster Managers
Directed Acyclic Graph (DAG), Resilient Distributed Datasets in Spark
distributed commit log, The Receiverless or Direct Model
distributed in-memory cache, External Factors that Influence the Job’s Performance
distributed processing model
- data delivery semantics, Data Delivery Semantics
- dynamic batch interval, Dynamic Batch Interval
- microbatching and one-element-at-a-time, Microbatching and One-Element-at-a-Time-Bringing Microbatch and One-Record-at-a-Time Closer Together
- resilience and fault tolerance, Understanding Resilience and Fault Tolerance in a Distributed System-Cluster Manager Support for Fault Tolerance
- structured stream processing model, Structured Streaming Processing Model-The Disappearance of the Batch Interval
- using cluster managers, Running Apache Spark with a Cluster Manager-Spark’s Own Cluster Manager
distributed stream-processing
- Spark's distributed processing model, Spark’s Distributed Processing Model-The Disappearance of the Batch Interval
- stateful stream processing in, Stateful Stream Processing in a Distributed System
- versus MapReduce, Distributed Stream Processing, The Tale of Two APIs
domain specific language (DSL), Spark SQL
double-answering problem, Data Delivery Semantics
driver failure recovery, Driver Failure Recovery
driver/executor architecture, Data Delivery Semantics
DStream (Discretized Stream)
- as execution model, DStreams as an Execution Model
- as programming model, DStreams as a Programming Model
- block and batch structure, The Bulk-Synchronous Architecture
- checkpointing, Checkpointing DStreams
- defining, Defining a DStream
- overview of, The DStream Abstraction
- RDDs underlying, RDDs as the Underlying Abstraction for DStreams-RDDs as the Underlying Abstraction for DStreams
- slicing streams, Slicing Streams
- transformations, Understanding DStream Transformations
duplicates, removing, Stream deduplication, Understanding Sinks, Record Deduplication
dynamic batch interval, Dynamic Batch Interval, The Disappearance of the Batch Interval
dynamic throttling
- alternative strategies, A Note on Alternative Dynamic Handling Strategies
- custom rate estimator, Custom Rate Estimator
- overview of, Dynamic Throttling
- tuning backpressure PID, Tuning the Backpressure PID

E

element-centric DStream transformations, Element-Centric DStream Transformations
estimating room occupancy, Example: Estimating Room Occupancy by Using Ambient Sensors-Online Training
event time
- definition, The Notion of Time in Stream Processing
- discussion, Computing on Timestamped Events-Computing with a Watermark
- in Spark Streaming, Event-Time Stream Computation Using mapWithState
- in Structured Streaming, Understanding Event Time in Structured Streaming
- versus processing time, Event Time Versus Processing Time
- support for, Introducing Structured Streaming
event-time processing
- basis of, Timestamps as the Provider of the Notion of Time, Event Time–Based Stream Processing
- limitations of, Computing with a Watermark
- versus processing time, Processing Time
- record deduplication, Record Deduplication
- time-based window aggregations, Time-Based Window Aggregations-Interval offset
- using event time, Using Event Time
- watermarks, Watermarks
exactly once data delivery, Understanding Sinks
exactness, real-time, and big data triangle, Exactness, Real Time, and Big Data-Big Data and Real Time
execution model (Spark Streaming API)
- bulk-synchronous architecture, The Bulk-Synchronous Architecture-The Bulk-Synchronous Architecture
- receiver model, The Receiver Model-Enabling the WAL
- receiverless or direct model, The Receiverless or Direct Model
executors, Spark’s Own Cluster Manager

F

failure recovery, Failure Recovery
fault recovery, Fault Recovery
fault tolerance, The Lesson Learned: Scalability and Fault Tolerance, Fast Implementation of Data Analysis, Understanding Resilience and Fault Tolerance in a Distributed System-Cluster Manager Support for Fault Tolerance, Spark’s Fault-Tolerance Guarantees-Checkpointing
File sink
- configuration options, Common Configuration Options Across All Supported File Formats
- CSV File sink format, The CSV Format of the File Sink
- JSON File sink format, The JSON File Sink Format
- Parquet File sink format, The Parquet File Sink Format
- purpose and types of, The File Sink
- reliability of, Reliable Sinks
- text File sink format, The Text File Sink Format
- time and date formatting, Common Time and Date Formatting (CSV, JSON)
- using triggers with, Using Triggers with the File Sink-Using Triggers with the File Sink
File source
- common options, Common Options
- common text parsing options, Common Text Parsing Options (CSV, JSON)
- CSV File source format, CSV File Source Format
- data reliability, How It Works
- JSON File source format, JSON File Source Format-JSON parsing options
- operation of, How It Works
- Parquet File source format, Parquet File Source Format
- specifying file formats, Specifying a File Format
- StreamingContext methods, The File Source
- target filesystems, The File Source
- text File source format, Text File Source Format
- uses for, Basic Sources
file-based streaming sources, Available Sources, Available Sources-text and textFile
fileNameOnly option, Common Options
first fit decreasing strategy, Streaming Algorithms Can’t Be Guaranteed to Measure Well Against Batch Algorithms
fixed-rate throttling, Limiting the Data Ingress with Fixed-Rate Throttling
flatMapGroupsWith State, Advanced Stateful Operations, Using FlatMapGroupsWithState-When a timeout actually times out
fleet management, Some Examples of Stream Processing
Foreach sink
- ForeachWriter interface, The ForeachWriter Interface
- ForeachWriter troubleshooting, Troubleshooting ForeachWriter Serialization Issues
- text-based TCP sink example, TCP Writer Sink: A Practical ForeachWriter Example-The Moral of this Example
- uses for, Workarounds-format
foreachBatch sink, format
foreachRDD output operation, foreachRDD-Using foreachRDD as a Programmable Sink
forgetfulness, Online Data and K-Means
format method, Sources: Acquiring Streaming Data, Specifying a File Format
format option, format, Consuming a Streaming Source
function calls, Spark Components
function-passing style, Microbatching: An Application of Bulk-Synchronous Processing

G

getting help, Stay Plugged In-Read Books
Google Cloud Dataflow, Apache Beam/Google Cloud Dataflow
greenfield applications, The Use of a Batch-Processing Component in a Streaming Application
group with state operations, Understanding Group with State Operations

H

Hadoop Distributed File System (HDFS), The File Source, saveAsxyz, External Factors that Influence the Job’s Performance
hashing and sketching, Hashing and Sketching: An Introduction
Hoeffding trees, Hoeffding Trees-Hoeffding Trees in Spark, in Practice
HyperLogLog (HLL), Counting Distinct Elements: HyperLogLog-Practical HyperLogLog in Spark

I

idempotence, Data Delivery Semantics, Understanding Sinks
Internal Event Bus
- batch events, Batch events
- output operation events, Output operation events
- overview of, Monitoring Spark Streaming, The Internal Event Bus
- StreamingListener interface, The StreamingListener interface
- StreamingListener registration, StreamingListener registration
internal state flow, Internal State Flow
intervals
- batch intervals, Understanding Continuous Processing, DStreams as an Execution Model, The Bulk-Synchronous Architecture, Using Windows Versus Longer Batch Intervals, The Relationship Between Batch Interval and Processing Delay, Tweaking the Batch Interval
- block intervals, The Bulk-Synchronous Architecture
- dynamic batch intervals, Dynamic Batch Interval, The Disappearance of the Batch Interval
- window intervals, Understanding How Intervals Are Computed, Interval offset
- window versus batch, Window Length Versus Batch Interval
inverse reduce function, Invertible Window Aggregations-Invertible Window Aggregations
IoT (Internet of Things)-inspired streaming program
- application logic, Application Logic
- consuming streaming sources, Consuming a Streaming Source-Consuming a Streaming Source
- overview of, Structured Streaming in Action
- writing to streaming sinks, Writing to a Streaming Sink

J

Java Development Kit (JDK), Installing Spark
Java Virtual Machine (JVM), Installing Spark
jobs, Spark Components
join operations
- enriching input streams with, Using Join to Enrich the Input Stream-Using Join to Enrich the Input Stream
- join optimizations, Join Optimizations-Join Optimizations
JSON File sink format, The JSON File Sink Format
JSON File source format, Specifying a File Format, JSON File Source Format-JSON parsing options

K

K-means clustering
- decaying clusters, The Problem of Decaying Clusters
- function of, K-Means Clustering
- using with online data, Online Data and K-Means
- with Spark Streaming, Streaming K-Means with Spark Streaming
Kafka, Structured Streaming in Action, The Receiverless or Direct Model
Kafka sink, Reliable Sinks, The Kafka Sink-Choosing an encoding
Kafka source, Available Sources, The Kafka Source-Banned configuration options, The Kafka Source-How It Works
Kafka Streams, Kafka Streams
Kappa architecture, Architectural Models, The Kappa Architecture
knapsack problem, Streaming Algorithms Can’t Be Guaranteed to Measure Well Against Batch Algorithms

L

Lambda architecture, Architectural Models, The Lambda Architecture
latency, Understanding Latency
latestFirst option, Common Options
lazy evaluation, Lazy Evaluation
loan processing, Some Examples of Stream Processing
LogLog counting algorithm, Role-Playing Exercise: If We Were a System Administrator
low-latency streaming, Introducing continuous processing: A low-latency streaming mode

M

machine learning, Machine Learning-Online Training (see also real-time machine learning)
map-side joins, Join Optimizations
mapGroupsWithState, Advanced Stateful Operations, Using MapGroupsWithState-Using MapGroupsWithState
MapReduce, MapReduce-The Lesson Learned: Scalability and Fault Tolerance, The Tale of Two APIs
mapWithState, Introducing Stateful Computation with mapwithState-Event-Time Stream Computation Using mapWithState
maxFileAge option, Common Options
maxFilesPerTrigger option, Common Options
media recommendations, Some Examples of Stream Processing
Memory sink, format, Sinks for Experimentation, The Memory Sink
metrics
- Spark metrics subsystem, The Spark Metrics Subsystem
- StreamingQueryProgress, Getting Metrics with StreamingQueryProgress
Metrics Subsystem
- built-in implementations, The Metrics Subsystem
- metrics available, The Metrics Subsystem
- overview of, Monitoring Spark Streaming, The Metrics Subsystem
microbatching, Understanding Latency, Microbatching and One-Element-at-a-Time-Bringing Microbatch and One-Record-at-a-Time Closer Together, Understanding Continuous Processing
model serving, The challenge of model serving
monitoring
- customizable solutions, Summary
- device monitoring, Some Examples of Stream Processing
- fault monitoring, Some Examples of Stream Processing
- Spark Streaming applications, Monitoring Spark Streaming-Summary
- Structured Streaming applications, Monitoring Structured Streaming Applications-Implementing a StreamingQueryListener
Monitoring REST API
- information exposed by, Information Exposed by the Monitoring REST API
- overview of, Monitoring Spark Streaming, The Monitoring REST API
- using, Using the Monitoring REST API
movie review classifiers, Training a Movie Review Classifier
MQTT, Where to Find More Sources
Multinomial Naive Bayes, Streaming Classification with Naive Bayes
multitenancy, Running Apache Spark with a Cluster Manager

N

Naive Bayes classification, Streaming Classification with Naive Bayes-Training a Movie Review Classifier
network partitions, Data Delivery Semantics
Network Time Protocol (NTP), Understanding Event Time in Structured Streaming

O

occupancy information, Example: Estimating Room Occupancy by Using Ambient Sensors-Online Training
offset-based processing, Understanding Sources, Interval offset
one-at-a-time record processing, One-Record-at-a-Time Processing-Bringing Microbatch and One-Record-at-a-Time Closer Together
online training/online learning, Online Training
option method, option
options method, options
outer joins, Join Optimizations
output modes, specifying, Start the Stream Processing
output operations, Defining Output Operations, Understanding DStream Transformations, Output Operations-Third-Party Output Operations
outputMode, outputMode

P

Parquet
- File sink format, The Parquet File Sink Format
- File source format, Specifying a File Format, Parquet File Source Format
- writing to (Spark Streaming API), Example: Writing Streaming Data to Parquet-Saving DataFrames
parsing errors, Handing parsing errors
performance ratio, Streaming Algorithms Are Sometimes Completely Different in Nature
performance tuning
- backpressure signaling, Backpressure
- caching, Caching
- dynamic throttling, Dynamic Throttling-A Note on Alternative Dynamic Handling Strategies
- external factors influencing performance, External Factors that Influence the Job’s Performance
- improving performance, How to Improve Performance?
- limiting data ingress, Limiting the Data Ingress with Fixed-Rate Throttling
- performance balance of Spark Streaming, The Performance Balance of Spark Streaming-Checkpoint Influence in Processing Time
- speculative execution, Speculative Execution
- tweaking batch intervals, Tweaking the Batch Interval
pipelining, One-Record-at-a-Time Processing
print output operation, print
processing delay
- relationship to batch interval, The Relationship Between Batch Interval and Processing Delay
- scheduling delay and, Going Deeper: Scheduling Delay and Processing Delay
processing time, The Notion of Time in Stream Processing, Computing on Timestamped Events-Computing with a Watermark, Processing Time, Time-Based Stream Processing
programming model (Spark Streaming API)
- DStream transformations, Understanding DStream Transformations
- RDDs underlying DStreams, RDDs as the Underlying Abstraction for DStreams-RDDs as the Underlying Abstraction for DStreams
programming model (Structured Streaming API)
- acquiring streaming data, Sources: Acquiring Streaming Data-Available Sources
- initializing Spark, Initializing Spark
- outputting resulting data, Sinks: Output the Resulting Data-start()
- overview of, The Structured Streaming Programming Model
- transforming streaming data, Transforming Streaming Data-Workarounds
Proportional-Integral-Derivative (PID) controllers, Dynamic Throttling
provisioning, Running Apache Spark with a Cluster Manager
Publish/Subscribe (pub/sub) systems, The Kafka Source-Banned configuration options, The Kafka Sink-Choosing an encoding, The Kafka Source-How It Works, Monitoring Spark Streaming

Q

quantiles, Ranks and Quantiles: T-Digest-T-Digest in Spark
queries, creating, Creating a Query
queryName method, queryName
questions and comments, How to Contact Us
Queue source, Basic Sources, The Queue Source-Using a Queue Source for Unit Testing

R

random data generators, ConstantInputDStream as a random data generator
random sampling, Random Sampling
Rate source, Available Sources, The Rate Source
RDD-centric DStream transformations, RDD-Centric DStream Transformations
RDDs (Resilient Distributed Datasets), The First Wave: Functional APIs, Resilient Distributed Datasets in Spark, RDDs as the Underlying Abstraction for DStreams-RDDs as the Underlying Abstraction for DStreams
readStream method, Connecting to a Stream, Sources: Acquiring Streaming Data
real-time machine learning
- challenges of online classification and clustering, Real-Time Machine Learning
- decision trees, Introducing Decision Trees
- Hoeffding trees, Hoeffding Trees-Hoeffding Trees in Spark, in Practice
- K-means clustering, Streaming Clustering with Online K-Means-Streaming K-Means with Spark Streaming
- Naive Bayes classification, Streaming Classification with Naive Bayes-Training a Movie Review Classifier
real-time stream processing systems
- Amazon Kinesis, Amazon Kinesis on AWS
- Apache Beam, Apache Beam/Google Cloud Dataflow
- Apache Flink, Apache Flink
- Apache Storm, Apache Storm-Compared to Spark
- Azure Streaming Analytics, Microsoft Azure Stream Analytics
- concepts constraining, Exactness, Real Time, and Big Data-Big Data and Real Time
- Google Cloud Dataflow, Apache Beam/Google Cloud Dataflow
- Kafka Streams, Kafka Streams
- selecting, Other Distributed Real-Time Stream Processing Systems
receiver model
- clock ticks, The Bulk-Synchronous Architecture
- internal data resilience, The Internal Data Resilience
- overview of, The Receiver Model
- Receiver API, The Receiver API
- receiver data flow, The Receiver’s Data Flow
- receiver operation, How Receivers Work
- receiver parallelism, Receiver Parallelism
- resource balancing, Balancing Resources: Receivers Versus Processing Cores
- zero data loss, Achieving Zero Data Loss with the Write-Ahead Log
record deduplication, Stream deduplication, Record Deduplication
reduceByKeyAndWindow, reduceByKeyAndWindow
reduceByWindow, reduceByWindow
reference datasets
- joining with incoming data, Join Optimizations
- updating in streaming applications, Updating Reference Datasets in a Streaming Application-Runtime implications
referential streaming architectures, Referential Streaming Architectures-The Kappa Architecture
replayability, Reliable Sources Must Be Replayable
resilience, Understanding Resilience and Fault Tolerance in a Distributed System-Cluster Manager Support for Fault Tolerance, The Internal Data Resilience
resilience model
- fault-tolerance guarantees, Spark’s Fault-Tolerance Guarantees-Checkpointing
- Resilient Distributed Datasets (RDDs), Resilient Distributed Datasets in Spark
- Spark components, Spark Components
Resilient Distributed Datasets (RDDs), The First Wave: Functional APIs, Resilient Distributed Datasets in Spark, RDDs as the Underlying Abstraction for DStreams-RDDs as the Underlying Abstraction for DStreams
restarts, Task Failure Recovery

S

sampling algorithms
- random sampling, Random Sampling
- stratified sampling, Stratified Sampling
- uses for, Reducing the Number of Elements: Sampling
saveAs output operations, saveAsxyz
Scala, Learning Scala
scalability, The Lesson Learned: Scalability and Fault Tolerance
scheduling delay, Going Deeper: Scheduling Delay and Processing Delay
schema inference, Schema inference
schemas, Sources: Acquiring Streaming Data, Sources Must Provide a Schema-Defining schemas
serialization issues, Troubleshooting ForeachWriter Serialization Issues
show operation, Workarounds
shuffle service, Stage Failure Recovery
Simple Storage Service (Amazon S3), The File Source
sink API, The Sink API
sinks (Spark Streaming API)
- foreachRDD output operation, foreachRDD-Using foreachRDD as a Programmable Sink
- output operations, Output Operations-Output Operations
- overview of, Spark Streaming Sinks
- print, print
- saveAs, saveAsxyz
- third-party output operations, Third-Party Output Operations
sinks (Structured Streaming API)
- available sinks, format, Available Sinks
- characteristics of, Understanding Sinks
- Console sink, The Console Sink
- creating custom, The Sink API
- creating programmatically, The Sink API
- File sink, The File Sink-Options
- Foreach sink, The Foreach Sink-Troubleshooting ForeachWriter Serialization Issues
- Kafka sink, The Kafka Sink-Choosing an encoding
- Memory sink, The Memory Sink
- outputting data to, Sinks: Output the Resulting Data-start()
- purpose and types of, Introducing Structured Streaming
- reliability of, Reliable Sinks-Sinks for Experimentation
- specifying, Start the Stream Processing
sinks, definition of, Sources and Sinks
slicing streams, Slicing Streams
sliding windows, Sliding Windows, Sliding windows, Sliding Windows
socket connections, Connecting to a Stream
Socket source, Available Sources, The Socket Source
sources (Spark Streaming API)
- commonly used, Commonly Used Sources
- ConstantInputDStream, A Simpler Alternative to the Queue Source: The ConstantInputDStream-ConstantInputDStream as a random data generator
- File source, The File Source-How It Works
- finding additional, Where to Find More Sources
- Kafka source, The Kafka Source-How It Works
- overview of, Spark Streaming Sources
- Queue source, The Queue Source-Using a Queue Source for Unit Testing
- Socket source, The Socket Source
- types of, Types of Sources
sources (Structured Streaming API)
- acquiring streaming data, Sources: Acquiring Streaming Data-Sources: Acquiring Streaming Data
- available sources, Available Sources, Available Sources
- characteristics of, Understanding Sources-Defining schemas
- File source
  - common options, Common Options
  - common text parsing options, Common Text Parsing Options (CSV, JSON)
  - CSV, CSV File Source Format
  - JSON, JSON File Source Format
  - Parquet, Parquet File Source Format
  - specifying file formats, Specifying a File Format
  - text, Text File Source Format
- Kafka source, The Kafka Source-Banned configuration options
- Rate source, The Rate Source
- reliability of, Understanding Sources, Available Sources
- Socket source, The Socket Source-Operations
- TextSocketSource implementation, Connecting to a Stream
sources, definition of, Sources and Sinks, Spark Streaming Sources
Spark metrics subsystem, The Spark Metrics Subsystem
Spark MLLib, Learning Versus Exploiting
Spark Notebook, Using Code Examples
Spark shell, Initializing Spark
Spark SQL
- accessing from Spark Streaming, Accessing Spark SQL Functions from Spark Streaming-Saving DataFrames
- benefits of, The Second Wave: SQL, Spark Components, Spark SQL
- Dataset abstraction in, Introducing Structured Streaming
- dealing with data at rest, Dealing with Data at Rest-Using Join to Enrich the Input Stream
- join optimizations, Join Optimizations-Join Optimizations
- updating reference datasets, Updating Reference Datasets in a Streaming Application-Runtime implications
- use cases, Spark SQL
Spark Streaming API
- algorithms for streaming approximation and sampling, Streaming Approximation and Sampling Algorithms-Stratified Sampling
- application structure and characteristics, The Structure of a Spark Streaming Application-Stopping the Streaming Process
- arbitrary stateful streaming computation, Arbitrary Stateful Streaming Computation-Event-Time Stream Computation Using mapWithState
- checkpointing, Checkpointing-Checkpoint Tuning
- DStream (Discretized Stream), The DStream Abstraction-DStreams as an Execution Model
- execution model, The Spark Streaming Execution Model-Summary
- main task of, Introducing Spark Streaming
- monitoring Spark Streaming applications, Monitoring Spark Streaming-Summary
- overview of, Spark Streaming, The Tale of Two APIs
- performance tuning, Performance Tuning-Speculative Execution
- programming model, The Spark Streaming Programming Model-Summary
- real-time machine learning, Real-Time Machine Learning-Streaming K-Means with Spark Streaming
- sinks, Spark Streaming Sinks-Third-Party Output Operations
- sources, Spark Streaming Sources-Where to Find More Sources
- Spark SQL, Working with Spark SQL-Summary
- stability of, Looking Ahead
- time-based stream processing, Time-Based Stream Processing-Summary
Spark Streaming Context, The Structure of a Spark Streaming Application
Spark Streaming scheduler, Using foreachRDD as a Programmable Sink
Spark-Cassandra Connector, Third-Party Output Operations
spark-kafka-writer, Third-Party Output Operations
SparkPackages, Third-Party Output Operations
SparkSession, Initializing Spark
speculative execution, Speculative Execution
stage failure recovery, Stage Failure Recovery
stages, Spark Components
start() method, start()
stateful stream processing
- advanced stateful operations, Advanced Stateful Operations-Summary
- arbitrary stateful streaming computation, Arbitrary Stateful Streaming Computation-Event-Time Stream Computation Using mapWithState
- bounding the size of state, Stateful Streams
- in distributed systems, Stateful Stream Processing in a Distributed System
- overview of, Stateful Streams
StateSpec function, Introducing Stateful Computation with mapwithState
StateSpec object, Using mapWithState
stratified sampling, Stratified Sampling
stream processing fundamentals
- Apache Spark for stream processing, Apache Spark as a Stream-Processing Engine-Summary
- introduction to stream processing, Introducing Stream Processing-Where Next?
- selecting a processing system, Other Distributed Real-Time Stream Processing Systems, Looking Ahead
- Spark's distributed processing model, Spark’s Distributed Processing Model-The Disappearance of the Batch Interval
- Spark's resilience model, Spark’s Resilience Model-Summary
- stream processing model, Stream-Processing Model-Summary
- streaming architectures, Streaming Architectures-Summary
stream processing model
- immutable streams defined from one another, Immutable Streams Defined from One Another
- local stateful computation in Scala, An Example: Local Stateful Computation in Scala-A Stateless Definition of the Fibonacci Sequence as a Stream Transformation
- sliding windows, Sliding Windows
- sources and sinks, Sources and Sinks
- stateful streams, Stateless and Stateful Processing
- stateless versus stateful streaming, Stateless or Stateful Streaming
- time-based stream processing, The Effect of Time-Computing with a Watermark
- transformations and aggregations, Transformations and Aggregations
- tumbling windows, Tumbling Windows
- window aggregations, Window Aggregations
streamDM library, streamDM Introduction
streaming architectures
- architectural models, Architectural Models
- batch-processing components in, The Use of a Batch-Processing Component in a Streaming Application
- data platform components, Components of a Data Platform
- overview of, Streaming Architectures
- referential streaming architectures, Referential Streaming Architectures-The Kappa Architecture
- streaming versus batch algorithms, Streaming Versus Batch Algorithms-Streaming Algorithms Can’t Be Guaranteed to Measure Well Against Batch Algorithms, Batch Analytics-Exploring the Data
streaming DataFrames, TCP Writer Sink: A Practical ForeachWriter Example
streaming sinks, Sources and Sinks
Streaming UI
- batch details, Batch Details
- benefits of, Monitoring Spark Streaming
- configuring, The Streaming UI
- elements comprising the UI, The Streaming UI
- Input Rate chart, Input Rate Chart
- overview of, Monitoring Spark Streaming
- processing time chart, Processing Time Chart
- Scheduling Delay chart, Scheduling Delay Chart
- Total Delay chart, Total Delay Chart
- understanding job performance using, Understanding Job Performance Using the Streaming UI
StreamingListener interface, The StreamingListener interface, StreamingListener registration
StreamingQuery instance, The StreamingQuery Instance-Getting Metrics with StreamingQueryProgress
StreamingQueryListener interface, The StreamingQueryListener Interface-Implementing a StreamingQueryListener
streams of data, What Is Stream Processing?
structure-changing transformations, Structure-Changing Transformations
structured data, Introducing Structured Streaming
Structured Streaming API
- advanced stateful operations, Advanced Stateful Operations-Summary
- continuous processing, Understanding Continuous Processing-Limitations
- event time-based stream processing, Event Time–Based Stream Processing-Summary
- first steps into Structured Streaming, Streaming Analytics-Exploring the Data
- introduction to, Introducing Structured Streaming-Summary
- IoT-inspired example program, Structured Streaming in Action-Summary
- machine learning, Machine Learning-Online Training
- maturity of, Looking Ahead
- monitoring, Monitoring Structured Streaming Applications-Implementing a StreamingQueryListener
- overview of, Structured Streaming, The Tale of Two APIs, Structured Streaming Processing Model, Introducing Structured Streaming
- programming model, The Structured Streaming Programming Model-Summary
- sinks, Structured Streaming Sinks-Troubleshooting ForeachWriter Serialization Issues
- sources, Structured Streaming Sources-Options

T

T-Digest, Ranks and Quantiles: T-Digest-T-Digest in Spark
task failure recovery, Task Failure Recovery
task schedulers, Spark’s Own Cluster Manager
tasks, Spark Components
text File sink format, The Text File Sink Format
text File source format, Specifying a File Format, Text File Source Format
text parsing options, Common Text Parsing Options (CSV, JSON)
TextSocketSource implementation, Connecting to a Stream
throttling
- dynamic, Dynamic Throttling-A Note on Alternative Dynamic Handling Strategies
- fixed-rate, Limiting the Data Ingress with Fixed-Rate Throttling
throughput-oriented processing, Throughput-Oriented Processing
time-based stream processing
- computing on time-stamped events, Computing on Timestamped Events
- computing with watermarks, Computing with a Watermark, Watermarks
- event time versus processing time, Event Time Versus Processing Time, Understanding Event Time in Structured Streaming-Processing Time
- Spark Streaming API, Time-Based Stream Processing-Summary
- Structured Streaming API, Event Time–Based Stream Processing-Summary
- timestamps as the notion of time, Timestamps as the Provider of the Notion of Time
timeout processing, When a timeout actually times out
Timestamp type, Using Event Time
timestampFormat, Date and time formats
transformations, The First Wave: Functional APIs, Transformations and Aggregations, Spark Components, Transforming Streaming Data-Workarounds, Understanding DStream Transformations
Transmission Control Protocol (TCP), The Socket Source
trigger method, trigger, Using Triggers with the File Sink-Using Triggers with the File Sink, Using Continuous Processing
tumbling windows, Tumbling Windows, Tumbling and Sliding Windows, Tumbling Windows
Twitter, Where to Find More Sources

U

unbounded data, What Is Stream Processing?, Introducing Structured Streaming
unit testing, Using a Queue Source for Unit Testing
update mode, outputMode
updateStateByKey, updateStateByKey-Memory Usage, Introducing Stateful Computation with mapwithState
use cases
- billing modernization, Some Examples of Stream Processing
- car fleet management solution, Example: Car Fleet Management
- device monitoring, Some Examples of Stream Processing
- estimating room occupancy, Example: Estimating Room Occupancy by Using Ambient Sensors-Online Training
- faster loans, Some Examples of Stream Processing
- fault monitoring, Some Examples of Stream Processing
- fleet management, Some Examples of Stream Processing
- IoT-inspired example program, Structured Streaming in Action-Summary
- media recommendations, Some Examples of Stream Processing
- movie review classifiers, Training a Movie Review Classifier
- unit testing, Using a Queue Source for Unit Testing
- vote counting, Stateful Stream Processing in a Distributed System
- web log analytics, First Steps with Structured Streaming-Exploring the Data

V

virtual machines (VMs), Spark’s Own Cluster Manager
vote counting, Stateful Stream Processing in a Distributed System

W

watermarks, Computing with a Watermark, Introducing Structured Streaming, Watermarks, Event-Time Stream Computation Using mapWithState
web log analytics, First Steps with Structured Streaming-Exploring the Data
window aggregations, Window Aggregations, Time-Based Window Aggregations-Interval offset, Using MapGroupsWithState, Window Aggregations
window reductions, Window Reductions-countByValueAndWindow
word count application, Resilient Distributed Datasets in Spark
Write-Ahead Log (WAL), Achieving Zero Data Loss with the Write-Ahead Log, Receiver-Based Sources, External Factors that Influence the Job’s Performance
writeStream method, Sinks: Output the Resulting Data

Z

zero data loss, Achieving Zero Data Loss with the Write-Ahead Log
ZeroMQ, Where to Find More Sources

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.