Index
A
B
C
- caching / Persistence and caching
- catalog
- Catalyst / Architecture of Spark SQL
- checkpointing
- Classic MapReduce
- classification
- Cloudera Distribution for Hadoop (CDH)
- clustering algorithms
- cluster resource managers
- collaborative filtering
- column pruning / Working with ORC
- common Dataset/DataFrame operations
- components, Spark SQL
- SQL / Introducing SQL, Datasources, DataFrame, and Dataset APIs
- Data Sources API / Introducing SQL, Datasources, DataFrame, and Dataset APIs
- DataFrame API / Introducing SQL, Datasources, DataFrame, and Dataset APIs
- Dataset API / Introducing SQL, Datasources, DataFrame, and Dataset APIs
- compression formats
- configuration parameters, for submitting applications
- connected components
- content-based filtering
- continuous bag of words
- CSV
- custom sources
D
- DAG (Directed Acyclic Graph) / Lineage Graph
- data
- dataflows
- DataFrame
- DataFrame API
- DataFrames
- about / Spark's stack
- evolution / Evolution of DataFrames and Datasets
- using, scenarios / When to use RDDs, Datasets, and DataFrames?
- creating / Creating DataFrames
- creating, from structured data files / Creating DataFrames from structured data files
- creating, from RDDs / Creating DataFrames from RDDs
- creating, from Hive tables / Creating DataFrames from tables in Hive
- creating, from external databases / Creating DataFrames from external databases
- converting, to RDDs / Converting DataFrames to RDDs
- converting, to Datasets / Converting a DataFrame to a Dataset
- creating, for recommendation system with MLlib / Exploring the data with DataFrames
- DataFrames, benefits
- data locality / Data locality
- Dataset API
- Datasets
- DataSources API
- Data Sources API
- data types, Spark MLlib
- Decision Trees
- dense vector
- Dimensionality Reduction
- direct approach, Kafka
- Directed Acyclic Graph (DAG) / RDD Transformations versus Dataset and DataFrames Transformations, Optimization
- Discretized Stream
- distributed matrix
- Domain Specific Language (DSL / Common Dataset/DataFrame operations
- domain specific language (DSL) / History of Spark SQL
- Domain Specific Language (DSL)
- Domain Specific Language (DSL) functions
- driver failures
- DStream
E
F
- fault-tolerance, Spark Streaming
- feature extraction and transformation
- file formats
- flight data
G
- Gradient-boosted Trees
- graph
- graph databases
- GraphFrames
- graph processing
- graph processing systems
- graph transformation
- GraphX
- GraphX, algorithms
- GraphX operations
- groupEdges operator
H
- H2O
- H2O Flow
- Hadoop
- Hadoop Distributed File System (HDFS)
- Hadoop Distributed File System (HDFS), features
- Hadoop file formats
- Hadoop plus Spark clusters
- Hadoop User Experience (Hue)
- HBase
- Hive
- Hivemall
- Hivemall for Spark
- Hive on Spark project / Hive on Spark
- Hive query language (HiveQL)
- Hive tables
- Hortonworks DataFlow (HDF)
- Hortonworks Data Platform (HDP)
- Hortonworks Data Platform (HDP) Sandbox
- Hue Notebook
- Hue Notebooks
I
- Idempotent updates
- implicit feedback
- input sources
- integrated development environment (IDE)
- interactive session
- Internet of Things (IOT)
- interpreter binding
- Inverse Document Frequency (IDF)
- IPython kernel
- item-based collaborative filtering
J
- Java Management Extensions (JMX)
- Java serialization / Serialization
- JDBC
- join operation
- JSON
- Jupyter
K
- k-means model
- Kafka
- Kerberos Security Enabled Spark Cluster
- Kinesis Client Library (KCL)
- Kryo serialization / Serialization
L
- Latent Dirichlet Allocation (LDA)
- lazy evaluation / Lazy evaluation
- Lineage Graph / Lineage Graph
- Livy REST job server
- local DataFrame
- logistic regression
M
- machine learning
- machine learning algorithms
- Machine Learning Library (MLlib)
- machine learning pipelines
- Mahout
- Mahout shell
- MapR
- MapR Control System (MCS) / Working with HDP, MapR, and Spark pre-built packages
- Mapr Control System (MCS)
- MapReduce (MR)
- MapReduce (MR), features
- MapReduce v1
- MapR Sandbox
- mapWithState operation
- Markdowns
- mask operator
- Mesos
- Message Passing Interface (MPI)
- metadata
- MLlib
- modes, for running Spark
- motif finding algorithm
- MR job
N
O
- Online Analytical Processing (OLAP) / Tools and techniques
- optimization algorithms
- Optimized Row Columnar (ORC)
- output operations
- output stores
P
R
- R
- Random Forests
- RDD actions
- RDD operations
- RDDs
- RDD transformations
- RDD Transformations
- Read, Evaluate, Print, and Loop (REPL)
- real-life use cases
- real-time processing
- receiver-based approach, Kafka
- receivers
- recommendation system, with MLlib
- building / A recommendation system with MLlib
- environment, preparing / Preparing the environment
- RDDs, creating / Creating RDDs
- data, exploring with DataFrames / Exploring the data with DataFrames
- testing dataset, creating / Creating training and testing datasets
- training dataset, creating / Creating training and testing datasets
- model, creating / Creating a model
- predictions, creating / Making predictions
- model, evaluating with testing data / Evaluating the model with testing data
- model accuracy, checking / Checking the accuracy of the model
- explicit feedback, versus implicit feedback / Explicit versus implicit feedback
- recommendation systems
- recommender systems
- Record Columnar File (RCFile)
- regression
- Relational Database Management Service (RDBMS) / Evolution of DataFrames and Datasets
- Relational Database Management Systems (RDBMS) / Big Data analytics and the role of Hadoop and Spark
- reliable receiver
- REPL (read-eval-print loop) / Spark Shell
- resilient distributed dataset (RDD) / MapReduce issues
- Resilient Distributed Dataset (RDD) / Learning Spark core concepts
- Resilient Distributed Dataset (RDDs)
- Resilient Distributed Datasets (RDD)
- ResourceManager
- REST API
- reverse operator
- R project
- RStudio
S
- Samsara
- scheduling modes, Mesos
- Schema-on-Read (SOR) approach / Big Data analytics and the role of Hadoop and Spark
- Schema-on-Write approach / Big Data analytics and the role of Hadoop and Spark
- SchemaRDD
- search tool
- sequence file
- serialization / Serialization
- shared variables / Shared variables
- Shark
- Singular Value Decomposition (SVD)
- skip-gram
- spam detection
- Spark
- Spark-on-HBase connector / DataFrame based Spark-on-HBase connector
- spark-sql CLI
- spark.mllib package
- spark.ml package
- Spark applications
- SparkConf
- Spark configuration
- Spark context
- Spark Contexts
- Spark Core
- Spark daemons
- Sparkling Water
- Sparkling Water project
- SparkMagic
- Spark MLlib
- Spark packages
- Spark pre-built package
- Spark program
- SparkR
- Spark resource managers
- SparkR shell
- Spark Scala shell
- SparkSession
- Spark shell
- Spark SQL
- Spark SQL Thrift Server
- Spark Streaming
- SparkSubmit
- sparse vector
- SQL
- standalone mode, Spark cluster resource managers / Standalone
- standalone resource manager
- standard compression formats
- stateful stream processing
- stateless stream processing
- storage levels, Spark
- storage options, Apache Hadoop
- Streaming DataFrames
- Streaming Datasets
- StreamingListener API
- structured data files
- Structured Streaming
- subgraph operator
- supervised learning
T
- Tachyon
- term frequency (TF)
- terminologies, Spark
- test data
- text files
- thrift
- training data
- Transactional updates
- transformations
- transformations, Spark Streaming
- transform operation
- triangle counting
- Tungsten
U
- union operation
- universal recommendation system
- unreliable receiver
- unsupervised learning
- updateStateByKey operation
- user-based collaborative filtering
- User Defined Functions (UDFs)
- User Defined Table Functions (UDTFs)
V
- VertexRDD operations
- virtual machines (VM)
W
- WAL
- web-based notebooks
- window operations
- write-ahead logs (WAL)
- Write Once and Read Many (WORM)
X
Y
- YARN
- YARN settings
- Yet Another Resource Negotiator (YARN)
Z
- Zeppelin
- ZeppelinHub Viewer
- Zeppelin notebooks
- Zookeeper
..................Content has been hidden....................
You can't read the all page of ebook, please click
here login for view all page.