Index
A
- AARFF (attribute-relation file format), streamDM Introduction
- actions, The First Wave: Functional APIs, Spark Components
- advanced stateful operations
- advanced streaming techniques
- aggregations
- Akka, Where to Find More Sources
- algorithms (see also real-time machine learning)
- approximation algorithms, Approximation Algorithms
- CountMinSketches (CMS), Counting Element Frequency: Count Min Sketches-Computing Frequencies with a Count-Min Sketch
- exact versus approximation, Streaming Approximation and Sampling Algorithms
- exactness, real-time, and big data triangle, Exactness, Real Time, and Big Data-Big Data and Real Time
- hashing and sketching, Hashing and Sketching: An Introduction
- HyperLogLog (HLL), Counting Distinct Elements: HyperLogLog-Practical HyperLogLog in Spark
- LogLog, Role-Playing Exercise: If We Were a System Administrator
- sampling algorithms, Reducing the Number of Elements: Sampling
- streaming versus batch, Streaming Versus Batch Algorithms-Streaming Algorithms Can’t Be Guaranteed to Measure Well Against Batch Algorithms, Batch Analytics-Exploring the Data
- T-Digest, Ranks and Quantiles: T-Digest-T-Digest in Spark
- Amazon Kinesis, Amazon Kinesis on AWS
- Apache Bahir library, Where to Find More Sources
- Apache Beam, Apache Beam/Google Cloud Dataflow
- Apache CouchDB/Cloudant, Where to Find More Sources
- Apache Flink, Apache Flink
- Apache Kafka, Structured Streaming in Action, The Receiverless or Direct Model
- Apache Spark (see also Spark Streaming API; Structured Streaming API)
- as a stream-processing engine, Apache Spark as a Stream-Processing Engine-Fast Implementation of Data Analysis
- benefits of, Introducing Apache Spark
- community support, Stay Plugged In
- components, Spark Components
- DataFrames and Dataset, The Second Wave: SQL
- distributed processing model, Spark’s Distributed Processing Model-The Disappearance of the Batch Interval
- installing, Installing Spark
- memory model of, The First Wave: Functional APIs, Spark’s Memory Usage
- resilience model of, Spark’s Resilience Model-Summary
- resources for learning, To Learn More About Spark
- unified programming model of, A Unified Engine
- version used, Installing Spark
- Apache Spark Project, Contributing to the Apache Spark Project
- Apache Storm, Apache Storm-Compared to Spark
- append mode, outputMode-Understanding the append semantic
- application monitoring (Structured Streaming)
- approximation algorithms, Approximation Algorithms
- arbitrary stateful processing, Advanced Stateful Operations
- arbitrary stateful streaming computation
- at-least-once versus at-most-once processing, Data Delivery Semantics, Understanding Sinks
- Azure Streaming Analytics, Microsoft Azure Stream Analytics
B
- backpressure signaling, Backpressure
- batch intervals
- batch processing
- best-effort execution, Microbatch in Structured Streaming
- billing modernization, Some Examples of Stream Processing
- bin-packing problem, Streaming Algorithms Can’t Be Guaranteed to Measure Well Against Batch Algorithms
- block intervals, The Bulk-Synchronous Architecture
- Bloom filters, Introducing Bloom Filters
- bounded data, What Is Stream Processing?
- broadcast joins, Join Optimizations
- bulk-synchronous processing (BSP), Examples of Cluster Managers, Microbatching and One-Element-at-a-Time, The Bulk-Synchronous Architecture-The Bulk-Synchronous Architecture
C
- cache hints, Cache Hints
- caching, Caching
- car fleet management solution, Example: Car Fleet Management
- checkpointing
- alternatives to tuning, Checkpoint Tuning
- basics of, Understanding the Use of Checkpoints-Understanding the Use of Checkpoints
- benefits of, Checkpointing
- checkpoint tuning, Checkpoint Tuning-Checkpoint Tuning
- cost of, The Cost of Checkpointing
- DStreams, Checkpointing DStreams
- influence on processing time, Checkpoint Influence in Processing Time
- mandatory on stateful streams, updateStateByKey
- purpose of, Checkpointing
- recovery from checkpoints, Recovery from a Checkpoint
- clock synchronization techniques, Understanding Event Time in Structured Streaming
- cluster deployment mode, Cluster-mode deployment
- cluster managers, Running Apache Spark with a Cluster Manager-Spark’s Own Cluster Manager, Cluster Manager Support for Fault Tolerance
- code examples, obtaining and using, Using Code Examples
- collision resistance, Hashing and Sketching: An Introduction
- comments and questions, How to Contact Us
- complete mode, outputMode
- component reuse, Fast Implementation of Data Analysis
- Console sink
- ConstantInputDStream, A Simpler Alternative to the Queue Source: The ConstantInputDStream-ConstantInputDStream as a random data generator
- consuming the stream, Sources and Sinks
- containers, Spark’s Own Cluster Manager
- continuous processing, Understanding Continuous Processing-Limitations
- countByValueAndWindow, countByValueAndWindow
- countByWindow, countByWindow
- counting transformations, Counting
- CountMinSketches (CMS), Counting Element Frequency: Count Min Sketches-Computing Frequencies with a Count-Min Sketch
- cryptographically secure hash functions, Hashing and Sketching: An Introduction
- CSV File sink format, The CSV Format of the File Sink
- CSV File source format, Specifying a File Format, CSV File Source Format
- cumulative distribution function (CDF), Ranks and Quantiles: T-Digest
D
- data
- at rest, The Notion of Time in Stream Processing, Dealing with Data at Rest-Using Join to Enrich the Input Stream
- bounded data, What Is Stream Processing?
- exactly once data delivery, Understanding Sinks
- in motion, The Notion of Time in Stream Processing
- limiting data ingress, Limiting the Data Ingress with Fixed-Rate Throttling
- outputting to screens, Workarounds
- outputting to sinks, Sinks: Output the Resulting Data-start()
- streams of data, What Is Stream Processing?
- structured data, Introducing Structured Streaming
- unbounded data, What Is Stream Processing?, Introducing Structured Streaming
- data lakes, The File Sink
- data platform components, Components of a Data Platform
- data sources, Sources and Sinks
- data streams, Introducing Structured Streaming
- Dataframe API, Introducing Structured Streaming, Transforming Streaming Data-Workarounds
- Dataset API, The Second Wave: SQL, Introducing Structured Streaming, Transforming Streaming Data-Workarounds
- date and time formats, Date and time formats, Common Time and Date Formatting (CSV, JSON)
- decision trees, Introducing Decision Trees-Hoeffding Trees in Spark, in Practice
- delivery guarantees, Examples of Cluster Managers
- Directed Acyclic Graph (DAG), Resilient Distributed Datasets in Spark
- distributed commit log, The Receiverless or Direct Model
- distributed in-memory cache, External Factors that Influence the Job’s Performance
- distributed processing model
- distributed stream-processing
- domain specific language (DSL), Spark SQL
- double-answering problem, Data Delivery Semantics
- driver failure recovery, Driver Failure Recovery
- driver/executor architecture, Data Delivery Semantics
- DStream (Discretized Stream)
- duplicates, removing, Stream deduplication, Understanding Sinks, Record Deduplication
- dynamic batch interval, Dynamic Batch Interval, The Disappearance of the Batch Interval
- dynamic throttling
F
- failure recovery, Failure Recovery
- fault recovery, Fault Recovery
- fault tolerance, The Lesson Learned: Scalability and Fault Tolerance, Fast Implementation of Data Analysis, Understanding Resilience and Fault Tolerance in a Distributed System-Cluster Manager Support for Fault Tolerance, Spark’s Fault-Tolerance Guarantees-Checkpointing
- File sink
- configuration options, Common Configuration Options Across All Supported File Formats
- CSV File sink format, The CSV Format of the File Sink
- JSON File sink format, The JSON File Sink Format
- Parquet File sink format, The Parquet File Sink Format
- purpose and types of, The File Sink
- reliability of, Reliable Sinks
- text File sink format, The Text File Sink Format
- time and date formatting, Common Time and Date Formatting (CSV, JSON)
- using triggers with, Using Triggers with the File Sink-Using Triggers with the File Sink
- File source
- common options, Common Options
- common text parsing options, Common Text Parsing Options (CSV, JSON)
- CSV File source format, CSV File Source Format
- data reliability, How It Works
- JSON File source format, JSON File Source Format-JSON parsing options
- operation of, How It Works
- Parquet File source format, Parquet File Source Format
- specifying file formats, Specifying a File Format
- StreamingContext methods, The File Source
- target filesystems, The File Source
- text File source format, Text File Source Format
- uses for, Basic Sources
- file-based streaming sources, Available Sources, Available Sources-text and textFile
- fileNameOnly option, Common Options
- first fit decreasing strategy, Streaming Algorithms Can’t Be Guaranteed to Measure Well Against Batch Algorithms
- fixed-rate throttling, Limiting the Data Ingress with Fixed-Rate Throttling
- flatMapGroupsWith State, Advanced Stateful Operations, Using FlatMapGroupsWithState-When a timeout actually times out
- fleet management, Some Examples of Stream Processing
- Foreach sink
- foreachBatch sink, format
- foreachRDD output operation, foreachRDD-Using foreachRDD as a Programmable Sink
- forgetfulness, Online Data and K-Means
- format method, Sources: Acquiring Streaming Data, Specifying a File Format
- format option, format, Consuming a Streaming Source
- function calls, Spark Components
- function-passing style, Microbatching: An Application of Bulk-Synchronous Processing
I
- idempotence, Data Delivery Semantics, Understanding Sinks
- Internal Event Bus
- internal state flow, Internal State Flow
- intervals
- batch intervals, Understanding Continuous Processing, DStreams as an Execution Model, The Bulk-Synchronous Architecture, Using Windows Versus Longer Batch Intervals, The Relationship Between Batch Interval and Processing Delay, Tweaking the Batch Interval
- block intervals, The Bulk-Synchronous Architecture
- dynamic batch intervals, Dynamic Batch Interval, The Disappearance of the Batch Interval
- window intervals, Understanding How Intervals Are Computed, Interval offset
- window versus batch, Window Length Versus Batch Interval
- inverse reduce function, Invertible Window Aggregations-Invertible Window Aggregations
- IoT (Internet of Things)-inspired streaming program
K
- K-means clustering
- Kafka, Structured Streaming in Action, The Receiverless or Direct Model
- Kafka sink, Reliable Sinks, The Kafka Sink-Choosing an encoding
- Kafka source, Available Sources, The Kafka Source-Banned configuration options, The Kafka Source-How It Works
- Kafka Streams, Kafka Streams
- Kappa architecture, Architectural Models, The Kappa Architecture
- knapsack problem, Streaming Algorithms Can’t Be Guaranteed to Measure Well Against Batch Algorithms
M
- machine learning, Machine Learning-Online Training (see also real-time machine learning)
- map-side joins, Join Optimizations
- mapGroupsWithState, Advanced Stateful Operations, Using MapGroupsWithState-Using MapGroupsWithState
- MapReduce, MapReduce-The Lesson Learned: Scalability and Fault Tolerance, The Tale of Two APIs
- mapWithState, Introducing Stateful Computation with mapwithState-Event-Time Stream Computation Using mapWithState
- maxFileAge option, Common Options
- maxFilesPerTrigger option, Common Options
- media recommendations, Some Examples of Stream Processing
- Memory sink, format, Sinks for Experimentation, The Memory Sink
- metrics
- Metrics Subsystem
- microbatching, Understanding Latency, Microbatching and One-Element-at-a-Time-Bringing Microbatch and One-Record-at-a-Time Closer Together, Understanding Continuous Processing
- model serving, The challenge of model serving
- monitoring
- Monitoring REST API
- movie review classifiers, Training a Movie Review Classifier
- MQTT, Where to Find More Sources
- Multinomial Naive Bayes, Streaming Classification with Naive Bayes
- multitenancy, Running Apache Spark with a Cluster Manager
O
- occupancy information, Example: Estimating Room Occupancy by Using Ambient Sensors-Online Training
- offset-based processing, Understanding Sources, Interval offset
- one-at-a-time record processing, One-Record-at-a-Time Processing-Bringing Microbatch and One-Record-at-a-Time Closer Together
- online training/online learning, Online Training
- option method, option
- options method, options
- outer joins, Join Optimizations
- output modes, specifying, Start the Stream Processing
- output operations, Defining Output Operations, Understanding DStream Transformations, Output Operations-Third-Party Output Operations
- outputMode, outputMode
P
- Parquet
- parsing errors, Handing parsing errors
- performance ratio, Streaming Algorithms Are Sometimes Completely Different in Nature
- performance tuning
- pipelining, One-Record-at-a-Time Processing
- print output operation, print
- processing delay
- processing time, The Notion of Time in Stream Processing, Computing on Timestamped Events-Computing with a Watermark, Processing Time, Time-Based Stream Processing
- programming model (Spark Streaming API)
- programming model (Structured Streaming API)
- Proportional-Integral-Derivative (PID) controllers, Dynamic Throttling
- provisioning, Running Apache Spark with a Cluster Manager
- Publish/Subscribe (pub/sub) systems, The Kafka Source-Banned configuration options, The Kafka Sink-Choosing an encoding, The Kafka Source-How It Works, Monitoring Spark Streaming
R
- random data generators, ConstantInputDStream as a random data generator
- random sampling, Random Sampling
- Rate source, Available Sources, The Rate Source
- RDD-centric DStream transformations, RDD-Centric DStream Transformations
- RDDs (Resilient Distributed Datasets), The First Wave: Functional APIs, Resilient Distributed Datasets in Spark, RDDs as the Underlying Abstraction for DStreams-RDDs as the Underlying Abstraction for DStreams
- readStream method, Connecting to a Stream, Sources: Acquiring Streaming Data
- real-time machine learning
- real-time stream processing systems
- Amazon Kinesis, Amazon Kinesis on AWS
- Apache Beam, Apache Beam/Google Cloud Dataflow
- Apache Flink, Apache Flink
- Apache Storm, Apache Storm-Compared to Spark
- Azure Streaming Analytics, Microsoft Azure Stream Analytics
- concepts constraining, Exactness, Real Time, and Big Data-Big Data and Real Time
- Google Cloud Dataflow, Apache Beam/Google Cloud Dataflow
- Kafka Streams, Kafka Streams
- selecting, Other Distributed Real-Time Stream Processing Systems
- receiver model
- record deduplication, Stream deduplication, Record Deduplication
- reduceByKeyAndWindow, reduceByKeyAndWindow
- reduceByWindow, reduceByWindow
- reference datasets
- referential streaming architectures, Referential Streaming Architectures-The Kappa Architecture
- replayability, Reliable Sources Must Be Replayable
- resilience, Understanding Resilience and Fault Tolerance in a Distributed System-Cluster Manager Support for Fault Tolerance, The Internal Data Resilience
- resilience model
- Resilient Distributed Datasets (RDDs), The First Wave: Functional APIs, Resilient Distributed Datasets in Spark, RDDs as the Underlying Abstraction for DStreams-RDDs as the Underlying Abstraction for DStreams
- restarts, Task Failure Recovery
S
- sampling algorithms
- saveAs output operations, saveAsxyz
- Scala, Learning Scala
- scalability, The Lesson Learned: Scalability and Fault Tolerance
- scheduling delay, Going Deeper: Scheduling Delay and Processing Delay
- schema inference, Schema inference
- schemas, Sources: Acquiring Streaming Data, Sources Must Provide a Schema-Defining schemas
- serialization issues, Troubleshooting ForeachWriter Serialization Issues
- show operation, Workarounds
- shuffle service, Stage Failure Recovery
- Simple Storage Service (Amazon S3), The File Source
- sink API, The Sink API
- sinks (Spark Streaming API)
- sinks (Structured Streaming API)
- available sinks, format, Available Sinks
- characteristics of, Understanding Sinks
- Console sink, The Console Sink
- creating custom, The Sink API
- creating programmatically, The Sink API
- File sink, The File Sink-Options
- Foreach sink, The Foreach Sink-Troubleshooting ForeachWriter Serialization Issues
- Kafka sink, The Kafka Sink-Choosing an encoding
- Memory sink, The Memory Sink
- outputting data to, Sinks: Output the Resulting Data-start()
- purpose and types of, Introducing Structured Streaming
- reliability of, Reliable Sinks-Sinks for Experimentation
- specifying, Start the Stream Processing
- sinks, definition of, Sources and Sinks
- slicing streams, Slicing Streams
- sliding windows, Sliding Windows, Sliding windows, Sliding Windows
- socket connections, Connecting to a Stream
- Socket source, Available Sources, The Socket Source
- sources (Spark Streaming API)
- sources (Structured Streaming API)
- acquiring streaming data, Sources: Acquiring Streaming Data-Sources: Acquiring Streaming Data
- available sources, Available Sources, Available Sources
- characteristics of, Understanding Sources-Defining schemas
- File source
- Kafka source, The Kafka Source-Banned configuration options
- Rate source, The Rate Source
- reliability of, Understanding Sources, Available Sources
- Socket source, The Socket Source-Operations
- TextSocketSource implementation, Connecting to a Stream
- sources, definition of, Sources and Sinks, Spark Streaming Sources
- Spark metrics subsystem, The Spark Metrics Subsystem
- Spark MLLib, Learning Versus Exploiting
- Spark Notebook, Using Code Examples
- Spark shell, Initializing Spark
- Spark SQL
- Spark Streaming API
- algorithms for streaming approximation and sampling, Streaming Approximation and Sampling Algorithms-Stratified Sampling
- application structure and characteristics, The Structure of a Spark Streaming Application-Stopping the Streaming Process
- arbitrary stateful streaming computation, Arbitrary Stateful Streaming Computation-Event-Time Stream Computation Using mapWithState
- checkpointing, Checkpointing-Checkpoint Tuning
- DStream (Discretized Stream), The DStream Abstraction-DStreams as an Execution Model
- execution model, The Spark Streaming Execution Model-Summary
- main task of, Introducing Spark Streaming
- monitoring Spark Streaming applications, Monitoring Spark Streaming-Summary
- overview of, Spark Streaming, The Tale of Two APIs
- performance tuning, Performance Tuning-Speculative Execution
- programming model, The Spark Streaming Programming Model-Summary
- real-time machine learning, Real-Time Machine Learning-Streaming K-Means with Spark Streaming
- sinks, Spark Streaming Sinks-Third-Party Output Operations
- sources, Spark Streaming Sources-Where to Find More Sources
- Spark SQL, Working with Spark SQL-Summary
- stability of, Looking Ahead
- time-based stream processing, Time-Based Stream Processing-Summary
- Spark Streaming Context, The Structure of a Spark Streaming Application
- Spark Streaming scheduler, Using foreachRDD as a Programmable Sink
- Spark-Cassandra Connector, Third-Party Output Operations
- spark-kafka-writer, Third-Party Output Operations
- SparkPackages, Third-Party Output Operations
- SparkSession, Initializing Spark
- speculative execution, Speculative Execution
- stage failure recovery, Stage Failure Recovery
- stages, Spark Components
- start() method, start()
- stateful stream processing
- StateSpec function, Introducing Stateful Computation with mapwithState
- StateSpec object, Using mapWithState
- stratified sampling, Stratified Sampling
- stream processing fundamentals
- stream processing model
- immutable streams defined from one another, Immutable Streams Defined from One Another
- local stateful computation in Scala, An Example: Local Stateful Computation in Scala-A Stateless Definition of the Fibonacci Sequence as a Stream Transformation
- sliding windows, Sliding Windows
- sources and sinks, Sources and Sinks
- stateful streams, Stateless and Stateful Processing
- stateless versus stateful streaming, Stateless or Stateful Streaming
- time-based stream processing, The Effect of Time-Computing with a Watermark
- transformations and aggregations, Transformations and Aggregations
- tumbling windows, Tumbling Windows
- window aggregations, Window Aggregations
- streamDM library, streamDM Introduction
- streaming architectures
- streaming DataFrames, TCP Writer Sink: A Practical ForeachWriter Example
- streaming sinks, Sources and Sinks
- Streaming UI
- batch details, Batch Details
- benefits of, Monitoring Spark Streaming
- configuring, The Streaming UI
- elements comprising the UI, The Streaming UI
- Input Rate chart, Input Rate Chart
- overview of, Monitoring Spark Streaming
- processing time chart, Processing Time Chart
- Scheduling Delay chart, Scheduling Delay Chart
- Total Delay chart, Total Delay Chart
- understanding job performance using, Understanding Job Performance Using the Streaming UI
- StreamingListener interface, The StreamingListener interface, StreamingListener registration
- StreamingQuery instance, The StreamingQuery Instance-Getting Metrics with StreamingQueryProgress
- StreamingQueryListener interface, The StreamingQueryListener Interface-Implementing a StreamingQueryListener
- streams of data, What Is Stream Processing?
- structure-changing transformations, Structure-Changing Transformations
- structured data, Introducing Structured Streaming
- Structured Streaming API
- advanced stateful operations, Advanced Stateful Operations-Summary
- continuous processing, Understanding Continuous Processing-Limitations
- event time-based stream processing, Event Time–Based Stream Processing-Summary
- first steps into Structured Streaming, Streaming Analytics-Exploring the Data
- introduction to, Introducing Structured Streaming-Summary
- IoT-inspired example program, Structured Streaming in Action-Summary
- machine learning, Machine Learning-Online Training
- maturity of, Looking Ahead
- monitoring, Monitoring Structured Streaming Applications-Implementing a StreamingQueryListener
- overview of, Structured Streaming, The Tale of Two APIs, Structured Streaming Processing Model, Introducing Structured Streaming
- programming model, The Structured Streaming Programming Model-Summary
- sinks, Structured Streaming Sinks-Troubleshooting ForeachWriter Serialization Issues
- sources, Structured Streaming Sources-Options
T
- T-Digest, Ranks and Quantiles: T-Digest-T-Digest in Spark
- task failure recovery, Task Failure Recovery
- task schedulers, Spark’s Own Cluster Manager
- tasks, Spark Components
- text File sink format, The Text File Sink Format
- text File source format, Specifying a File Format, Text File Source Format
- text parsing options, Common Text Parsing Options (CSV, JSON)
- TextSocketSource implementation, Connecting to a Stream
- throttling
- throughput-oriented processing, Throughput-Oriented Processing
- time-based stream processing
- timeout processing, When a timeout actually times out
- Timestamp type, Using Event Time
- timestampFormat, Date and time formats
- transformations, The First Wave: Functional APIs, Transformations and Aggregations, Spark Components, Transforming Streaming Data-Workarounds, Understanding DStream Transformations
- Transmission Control Protocol (TCP), The Socket Source
- trigger method, trigger, Using Triggers with the File Sink-Using Triggers with the File Sink, Using Continuous Processing
- tumbling windows, Tumbling Windows, Tumbling and Sliding Windows, Tumbling Windows
- Twitter, Where to Find More Sources
U
- unbounded data, What Is Stream Processing?, Introducing Structured Streaming
- unit testing, Using a Queue Source for Unit Testing
- update mode, outputMode
- updateStateByKey, updateStateByKey-Memory Usage, Introducing Stateful Computation with mapwithState
- use cases
- billing modernization, Some Examples of Stream Processing
- car fleet management solution, Example: Car Fleet Management
- device monitoring, Some Examples of Stream Processing
- estimating room occupancy, Example: Estimating Room Occupancy by Using Ambient Sensors-Online Training
- faster loans, Some Examples of Stream Processing
- fault monitoring, Some Examples of Stream Processing
- fleet management, Some Examples of Stream Processing
- IoT-inspired example program, Structured Streaming in Action-Summary
- media recommendations, Some Examples of Stream Processing
- movie review classifiers, Training a Movie Review Classifier
- unit testing, Using a Queue Source for Unit Testing
- vote counting, Stateful Stream Processing in a Distributed System
- web log analytics, First Steps with Structured Streaming-Exploring the Data
W
- watermarks, Computing with a Watermark, Introducing Structured Streaming, Watermarks, Event-Time Stream Computation Using mapWithState
- web log analytics, First Steps with Structured Streaming-Exploring the Data
- window aggregations, Window Aggregations, Time-Based Window Aggregations-Interval offset, Using MapGroupsWithState, Window Aggregations
- window reductions, Window Reductions-countByValueAndWindow
- word count application, Resilient Distributed Datasets in Spark
- Write-Ahead Log (WAL), Achieving Zero Data Loss with the Write-Ahead Log, Receiver-Based Sources, External Factors that Influence the Job’s Performance
- writeStream method, Sinks: Output the Resulting Data
..................Content has been hidden....................
You can't read the all page of ebook, please click
here login for view all page.