Home Page Icon
Home Page
Table of Contents for
Cover
Close
Cover
by Saurabh Chhajed, Marek Rogoziński, Rafał Kuć, Bharvi Dixit
Elasticsearch: A Complete Guide
Elasticsearch: A Complete Guide
Table of Contents
Elasticsearch: A Complete Guide
Elasticsearch: A Complete Guide
Credits
Preface
What this learning path covers
What you need for this learning path
Who this learning path is for
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions
1. Module 1
1. Getting Started with Elasticsearch
Introducing Elasticsearch
The primary features of Elasticsearch
Understanding REST and JSON
What is REST?
What is JSON?
Elasticsearch common terms
Understanding Elasticsearch structure with respect to relational databases
Installing and configuring Elasticsearch
Installing Elasticsearch on Ubuntu through Debian package
Installing Elasticsearch on Centos through the RPM package
Understanding the Elasticsearch installation directory layout
Configuring basic parameters
Adding another node to the cluster
Installing Elasticsearch plugins
Checking for installed plugins
Installing the Head plugin for Elasticsearch
Installing Sense for Elasticsearch
Basic operations with Elasticsearch
Creating an Index
Indexing a document in Elasticsearch
Fetching documents
Get a complete document
Getting part of a document
Updating documents
Updating a whole document
Updating documents partially
Deleting documents
Checking documents' existence
Summary
2. Understanding Document Analysis and Creating Mappings
Text search
TF-IDF
Inverted indexes
Document analysis
Introducing Lucene analyzers
Creating custom analyzers
Changing a default analyzer
Putting custom analyzers into action
Elasticsearch mapping
Document metadata fields
Data types and index analysis options
Configuring data types
String
Number
Date
Boolean
Arrays
Objects
Indexing the same field in different ways
Putting mappings in an index
Viewing mappings
Updating mappings
Summary
3. Putting Elasticsearch into Action
CRUD operations using elasticsearch-py
Setting up the environment
Installing Pip
Installing virtualenv
Installing elasticsearch-py
Performing CRUD operations
Request timeouts
Creating indexes with settings and mappings
Indexing documents
Retrieving documents
Updating documents
Replacing the value of a field completely
Appending a value in an array
Updates using doc
Checking document existence
Deleting a document
CRUD operations using Java
Connecting with Elasticsearch
Indexing a document
Fetching a document
Updating a document
Updating a document using doc
Updating a document using script
Deleting documents
Creating a search database
Elasticsearch Query-DSL
Understanding Query-DSL parameters
Query types
Full-text search queries
match_all
match query
Phrase search
multi match
query_string
Term-based search queries
Term query
Terms query
Range queries
Exists queries
Missing queries
Compound queries
Bool queries
Not queries
Search requests using Python
Search requests using Java
Parsing search responses
Sorting your data
Sorting documents by field values
Sorting on more than one field
Sorting multivalued fields
Sorting on string fields
Document routing
Summary
4. Aggregations for Analytics
Introducing the aggregation framework
Aggregation syntax
Extracting values
Returning only aggregation results
Metric aggregations
Computing basic stats
Combined stats
Computing stats separately
Computing extended stats
Finding distinct counts
Bucket aggregations
Terms aggregation
Range aggregation
Date range aggregation
Histogram aggregation
Date histogram aggregation
Filter-based aggregation
Combining search, buckets, and metrics
Memory pressure and implications
Summary
5. Data Looks Better on Maps: Master Geo-Spatiality
Introducing geo-spatial data
Working with geo-point data
Mapping geo-point fields
Indexing geo-point data
Querying geo-point data
Geo distance query
Geo distance range query
Geo bounding box query
Understanding bounding boxes
Sorting by distance
Geo-aggregations
Geo distance aggregation
Using bounding boxes with geo distance aggregation
Geo-shapes
Point
Linestring
Circles
Polygons
Envelops
Mappings geo-shape fields
Indexing geo-shape data
Querying geo-shape data
Summary
6. Document Relationships in NoSQL World
Relational data in the document-oriented NoSQL world
Managing relational data in Elasticsearch
Working with nested objects
Creating nested mappings
Indexing nested data
Querying nested type data
Nested aggregations
Nested aggregation
Understanding nested aggregation syntax:
Reverse nested aggregation
Parent-child relationships
Creating parent-child mappings
Indexing parent-child documents
Querying parent-child documents
has_child query
has_parent query
Considerations for using document relationships
Summary
7. Different Methods of Search and Bulk Operations
Introducing search types in Elasticsearch
Cheaper bulk operations
Bulk create
Bulk indexing
Bulk updating
Bulk deleting
Multi get and multi search APIs
Multi get
Multi searches
Data pagination
Pagination with scoring
Pagination without scoring
Scrolling and re-indexing documents using scan-scroll
Practical considerations for bulk processing
Summary
8. Controlling Relevancy
Introducing relevant searches
The Elasticsearch out-of-the-box tools
An example: why defaults are not enough
Controlling relevancy with custom scoring
The function_score query
weight
field_value_factor
script_score
Decay functions - linear, exp, and gauss
Summary
9. Cluster Scaling in Production Deployments
Node types in Elasticsearch
Client node
Data node
Master node
Introducing Zen-Discovery
Multicasting discovery
Unicasting discovery
Configuring unicasting discovery
Minimum number of master nodes: preventing split-brain
An initial list of hosts to ping
Ping timeout
Node upgrades without downtime
Upgrading Elasticsearch version
Best Elasticsearch practices in production
Creating a cluster
Scaling your clusters
When to scale
Metrics to watch
CPU utilization
Memory utilization
Disk I/O utilization
Disk low watermark
How to scale
Summary
10. Backups and Security
Introducing backup and restore mechanisms
Backup using snapshot API
Creating an NFS drive
Configuring the NFS host server
Configuring client machines
Creating a snapshot
Registering the repository path
Registering the shared file system repository in Elasticsearch
Create your first snapshot
Getting snapshot information
Deleting snapshots
Restoring snapshots
Restoring multiple indices
Renaming indices
Partial restore
Changing index settings during restore
Restoring to a different cluster
Manual backups
Manual restoration
Securing Elasticsearch
Setting up basic HTTP authentication
Setting up Nginx
Securing critical access
Restricting DELETE requests
Restricting endpoints
Load balancing using Nginx
Summary
II. Module 2
1. Introduction to Elasticsearch
Introducing Apache Lucene
Getting familiar with Lucene
Overall architecture
Getting deeper into Lucene index
Norms
Term vectors
Posting formats
Doc values
Analyzing your data
Indexing and querying
Lucene query language
Understanding the basics
Querying fields
Term modifiers
Handling special characters
Introducing Elasticsearch
Basic concepts
Index
Document
Type
Mapping
Node
Cluster
Shard
Replica
Key concepts behind Elasticsearch architecture
Workings of Elasticsearch
The startup process
Failure detection
Communicating with Elasticsearch
Indexing data
Querying data
The story
Summary
2. Power User Query DSL
Default Apache Lucene scoring explained
When a document is matched
TF/IDF scoring formula
Lucene conceptual scoring formula
Lucene practical scoring formula
Elasticsearch point of view
An example
Query rewrite explained
Prefix query as an example
Getting back to Apache Lucene
Query rewrite properties
Query templates
Introducing query templates
Templates as strings
The Mustache template engine
Conditional expressions
Loops
Default values
Storing templates in files
Handling filters and why it matters
Filters and query relevance
How filters work
Bool or and/or/not filters
Performance considerations
Post filtering and filtered query
Choosing the right filtering method
Choosing the right query for the job
Query categorization
Basic queries
Compound queries
Not analyzed queries
Full text search queries
Pattern queries
Similarity supporting queries
Score altering queries
Position aware queries
Structure aware queries
The use cases
Example data
Basic queries use cases
Searching for values in range
Simplified query for multiple terms
Compound queries use cases
Boosting some of the matched documents
Ignoring lower scoring partial queries
Not analyzed queries use cases
Limiting results to given tags
Efficient query time stopwords handling
Full text search queries use cases
Using Lucene query syntax in queries
Handling user queries without errors
Pattern queries use cases
Autocomplete using prefixes
Pattern matching
Similarity supporting queries use cases
Finding terms similar to a given one
Finding documents with similar field values
Score altering queries use cases
Favoring newer books
Decreasing importance of books with certain value
Pattern queries use cases
Matching phrases
Spans, spans everywhere
Structure aware queries use cases
Returning parent documents having a certain nested document
Affecting parent document score with the score of nested documents
Summary
3. Not Only Full Text Search
Query rescoring
What is query rescoring?
An example query
Structure of the rescore query
Rescore parameters
Choosing the scoring mode
To sum up
Controlling multimatching
Multimatch types
Best fields matching
Cross fields matching
Most fields matching
Phrase matching
Phrase with prefixes matching
Significant terms aggregation
An example
Choosing significant terms
Multiple values analysis
Significant terms aggregation and full text search fields
Additional configuration options
Controlling the number of returned buckets
Background set filtering
Minimum document count
Execution hint
More options
There are limits
Memory consumption
Shouldn't be used as top-level aggregation
Counts are approximated
Floating point fields are not allowed
Documents grouping
Top hits aggregation
An example
Additional parameters
Relations between documents
The object type
The nested documents
Parent–child relationship
Parent–child relationship in the cluster
A few words about alternatives
Scripting changes between Elasticsearch versions
Scripting changes
Security issues
Groovy – the new default scripting language
Removal of MVEL language
Short Groovy introduction
Using Groovy as your scripting language
Variable definition in scripts
Conditionals
Loops
An example
There is more
Scripting in full text context
Field-related information
Shard level information
Term level information
More advanced term information
Lucene expressions explained
The basics
An example
There is more
Summary
4. Improving the User Search Experience
Correcting user spelling mistakes
Testing data
Getting into technical details
Suggesters
Using the _suggest REST endpoint
Understanding the REST endpoint suggester response
Including suggestion requests in query
The term suggester
Configuration
Common term suggester options
Additional term suggester options
The phrase suggester
Usage example
Configuration
Basic configuration
Configuring smoothing models
Configuring candidate generators
Configuring direct generators
The completion suggester
The logic behind the completion suggester
Using the completion suggester
Indexing data
Querying data
Custom weights
Additional parameters
Improving the query relevance
Data
The quest for relevance improvement
The standard query
The multi match query
Phrases comes into play
Let's throw the garbage away
Now, we boost
Performing a misspelling-proof search
Drill downs with faceting
Summary
5. The Index Distribution Architecture
Choosing the right amount of shards and replicas
Sharding and overallocation
A positive example of overallocation
Multiple shards versus multiple indices
Replicas
Routing explained
Shards and data
Let's test routing
Indexing with routing
Routing in practice
Querying
Aliases
Multiple routing values
Altering the default shard allocation behavior
Allocation awareness
Forcing allocation awareness
Filtering
What include, exclude, and require mean
Runtime allocation updating
Index level updates
Cluster level updates
Defining total shards allowed per node
Defining total shards allowed per physical server
Inclusion
Requirement
Exclusion
Disk-based allocation
Query execution preference
Introducing the preference parameter
Summary
6. Low-level Index Control
Altering Apache Lucene scoring
Available similarity models
Setting a per-field similarity
Similarity model configuration
Choosing the default similarity model
Configuring the chosen similarity model
Configuring the TF/IDF similarity
Configuring the Okapi BM25 similarity
Configuring the DFR similarity
Configuring the IB similarity
Configuring the LM Dirichlet similarity
Configuring the LM Jelinek Mercer similarity
Choosing the right directory implementation – the store module
The store type
The simple filesystem store
The new I/O filesystem store
The MMap filesystem store
The hybrid filesystem store
The memory store
Additional properties
The default store type
The default store type for Elasticsearch 1.3.0 and higher
The default store type for Elasticsearch versions older than 1.3.0
NRT, flush, refresh, and transaction log
Updating the index and committing changes
Changing the default refresh time
The transaction log
The transaction log configuration
Near real-time GET
Segment merging under control
Choosing the right merge policy
The tiered merge policy
The log byte size merge policy
The log doc merge policy
Merge policies' configuration
The tiered merge policy
The log byte size merge policy
The log doc merge policy
Scheduling
The concurrent merge scheduler
The serial merge scheduler
Setting the desired merge scheduler
When it is too much for I/O – throttling explained
Controlling I/O throttling
Configuration
The throttling type
Maximum throughput per second
Node throttling defaults
Performance considerations
The configuration example
Understanding Elasticsearch caching
The filter cache
Filter cache types
Node-level filter cache configuration
Index-level filter cache configuration
The field data cache
Field data or doc values
Node-level field data cache configuration
Index-level field data cache configuration
The field data cache filtering
Adding field data filtering information
Filtering by term frequency
Filtering by regex
Filtering by regex and term frequency
The filtering example
Field data formats
String-based fields
Numeric fields
Geographical-based fields
Field data loading
The shard query cache
Setting up the shard query cache
Using circuit breakers
The field data circuit breaker
The request circuit breaker
The total circuit breaker
Clearing the caches
Index, indices, and all caches clearing
Clearing specific caches
Summary
7. Elasticsearch Administration
Discovery and recovery modules
Discovery configuration
Zen discovery
Multicast Zen discovery configuration
The unicast Zen discovery configuration
Master node
Configuring master and data nodes
Configuring data-only nodes
Configuring master-only nodes
Configuring the query processing-only nodes
The master election configuration
Zen discovery fault detection and configuration
The Amazon EC2 discovery
The EC2 plugin installation
The EC2 plugin's generic configuration
Optional EC2 discovery configuration options
The EC2 nodes scanning configuration
Other discovery implementations
The gateway and recovery configuration
The gateway recovery process
Configuration properties
Expectations on nodes
The local gateway
Low-level recovery configuration
Cluster-level recovery configuration
Index-level recovery settings
The indices recovery API
The human-friendly status API – using the Cat API
The basics
Using the Cat API
Common arguments
The examples
Getting information about the master node
Getting information about the nodes
Backing up
Saving backups in the cloud
The S3 repository
The HDFS repository
The Azure repository
Federated search
The test clusters
Creating the tribe node
Using the unicast discovery for tribes
Reading data with the tribe node
Master-level read operations
Writing data with the tribe node
Master-level write operations
Handling indices conflicts
Blocking write operations
Summary
8. Improving Performance
Using doc values to optimize your queries
The problem with field data cache
The example of doc values usage
Knowing about garbage collector
Java memory
The life cycle of Java objects and garbage collections
Dealing with garbage collection problems
Turning on logging of garbage collection work
Using JStat
Creating memory dumps
More information on the garbage collector work
Adjusting the garbage collector work in Elasticsearch
Using a standard start up script
Service wrapper
Avoid swapping on Unix-like systems
Benchmarking queries
Preparing your cluster configuration for benchmarking
Running benchmarks
Controlling currently run benchmarks
Very hot threads
Usage clarification for the Hot Threads API
The Hot Threads API response
Scaling Elasticsearch
Vertical scaling
Horizontal scaling
Automatically creating replicas
Redundancy and high availability
Cost and performance flexibility
Continuous upgrades
Multiple Elasticsearch instances on a single physical machine
Preventing the shard and its replicas from being on the same node
Designated nodes' roles for larger clusters
Query aggregator nodes
Data nodes
Master eligible nodes
Using Elasticsearch for high load scenarios
General Elasticsearch-tuning advices
Choosing the right store
The index refresh rate
Thread pools tuning
Adjusting the merge process
Data distribution
Advices for high query rate scenarios
Filter caches and shard query caches
Think about the queries
Using routing
Parallelize your queries
Field data cache and breaking the circuit
Keeping size and shard_size under control
High indexing throughput scenarios and Elasticsearch
Bulk indexing
Doc values versus indexing speed
Keep your document fields under control
The index architecture and replication
Tuning write-ahead log
Think about storage
RAM buffer for indexing
Summary
9. Developing Elasticsearch Plugins
Creating the Apache Maven project structure
Understanding the basics
The structure of the Maven Java project
The idea of POM
Running the build process
Introducing the assembly Maven plugin
Creating custom REST action
The assumptions
Implementation details
Using the REST action class
The constructor
Handling requests
Writing response
The plugin class
Informing Elasticsearch about our REST action
Time for testing
Building the REST action plugin
Installing the REST action plugin
Checking whether the REST action plugin works
Creating the custom analysis plugin
Implementation details
Implementing TokenFilter
Implementing the TokenFilter factory
Implementing the class custom analyzer
Implementing the analyzer provider
Implementing the analysis binder
Implementing the analyzer indices component
Implementing the analyzer module
Implementing the analyzer plugin
Informing Elasticsearch about our custom analyzer
Testing our custom analysis plugin
Building our custom analysis plugin
Installing the custom analysis plugin
Checking whether our analysis plugin works
Summary
III. Module 3
1. Introduction to ELK Stack
The need for log analysis
Issue debugging
Performance analysis
Security analysis
Predictive analysis
Internet of things and logging
Challenges in log analysis
Non-consistent log format
Tomcat logs
Apache access logs – combined log format
IIS logs
Variety of time formats
Decentralized logs
Expert knowledge requirement
The ELK Stack
Elasticsearch
Logstash
Kibana
ELK data pipeline
ELK Stack installation
Installing Elasticsearch
Running Elasticsearch
Elasticsearch configuration
Network Address
Paths
The cluster name
The node name
Elasticsearch plugins
Installing Logstash
Running Logstash
Logstash with file input
Logstash with Elasticsearch output
Configuring Logstash
Installing Logstash forwarder
Logstash plugins
Input plugin
Filters plugin
Output plugin
Installing Kibana
Configuring Kibana
Running Kibana
Kibana interface
Discover
Visualize
Dashboard
Settings
Summary
2. Building Your First Data Pipeline with ELK
Input dataset
Data format for input dataset
Configuring Logstash input
Filtering and processing input
Putting data to Elasticsearch
Visualizing with Kibana
Running Kibana
Kibana visualizations
Building a line chart
Building a bar chart
Building a Metric
Building a data table
Summary
3. Collect, Parse and Transform Data with Logstash
Configuring Logstash
Logstash plugins
Listing all plugins in Logstash
Data types for plugin properties
Array
Boolean
Codec
Hash
String
Comments
Field references
Logstash conditionals
Types of Logstash plugins
Input plugins
file
Configuration options
add_field
codec
delimiter
exclude
path
sincedb_path
sincedb_write_interval
start_position
tags
type
stdin
Configuration options
add_field
codec
tags
type
twitter
Configuration options
add_field
codec
consumer_key
consumer_secret
full_tweet
keywords
oauth_token
oauth_token_secret
tags
type
lumberjack
Configuration options
add_field
codec
host
port
ssl_certificate
ssl_key
ssl_key_passphrase
tags
type
redis
Configuration options
add_field
codec
data_type
host
key
password
port
Output plugins
csv
Configuration options
codec
csv_options
fields
gzip
path
file
Configuration options
email
Configuration options
attachments
body
cc
from
to
htmlbody
replyto
subject
elasticsearch
Configuration options
ganglia
Configuration options
metric
unit
value
jira
Configuration options
kafka
Configuration options
topic_id
lumberjack
Configuration options
hosts
port
ssl_certificate
redis
Configuration options
rabbitmq
stdout
mongodb
Configuration options
collection
database
uri
Filter plugins
csv
Configuration options
date
Configuration options
drop
Configuration options
geoip
Configuration options
source
grok
Custom grok patterns
mutate
Configuration options
sleep
Codec plugins
json
line
multiline
plain
rubydebug
Summary
4. Creating Custom Logstash Plugins
Logstash plugin management
Plugin lifecycle management
Installing a plugin
Updating a plugin
Uninstalling a plugin
Structure of a Logstash plugin
Required dependencies
Class declaration
Configuration name
Configuration options setting
Plugin methods
Input plugin
Filter plugin
Output plugin
Codec plugin
Writing a Logstash filter plugin
Building the plugin
Summary
5. Why Do We Need Elasticsearch in ELK?
Why Elasticsearch?
Elasticsearch basic concepts
Index
Document
Field
Type
Mapping
Shard
Primary shard and replica shard
Cluster
Node
Exploring the Elasticsearch API
Listing all available indices
Listing all nodes in a cluster
Checking the health of the cluster
Health status of the cluster
Creating an index
Retrieving the document
Deleting documents
Deleting an index
Elasticsearch Query DSL
Elasticsearch plugins
Bigdesk plugin
Elastic-Hammer plugin
Head plugin
Summary
6. Finding Insights with Kibana
Kibana 4 features
Search highlights
Elasticsearch aggregations
Scripted fields
Dynamic dashboards
Kibana interface
Discover page
Time filter
Quick time filter
Relative time filter
Absolute time filter
Kibana Auto-refresh setting
Querying and searching data
Freetext search
AND
OR
NOT
Groupings
Wildcard searches
Field searches
Range searches
Special characters escaping
New search
Saving the search
Loading a search
Field searches using field list
Summary
7. Kibana – Visualization and Dashboard
Visualize page
Creating a visualization
Visualization types
Metrics and buckets aggregations
Buckets
Date Histogram
Histogram
Range
Date Range
Terms
Metrics
Count
Average, Sum, Min, and Max
Unique Count
Advanced options
Visualizations
Area chart
Data table
Line chart
Markdown widget
Metric
Pie chart
Tile map
Vertical bar chart
Dashboard page
Building a new dashboard
Saving and loading a dashboard
Sharing a dashboard
Summary
8. Putting It All Together
Input dataset
Configuring Logstash input
Grok pattern for access logs
Visualizing with Kibana
Running Kibana
Searching on the Discover page
Visualizations – charts
Building a Line chart
Building an Area chart
Building a Bar chart
Building a Markdown
Dashboard page
Summary
9. ELK Stack in Production
Prevention of data loss
Data protection
System scalability
Data retention
ELK Stack implementations
ELK Stack at LinkedIn
Problem statement
Criteria for solution
Solution
Kafka at LinkedIn
Operational challenges
Logging using Kafka at LinkedIn
ELK at SCA
How is ELK used in SCA?
How is it helping in analytics?
ELK for monitoring at SCA
ELK at Cliffhanger Solutions
Kibana demo – Packetbeat dashboard
Summary
10. Expanding Horizons with ELK
Elasticsearch plugins and utilities
Curator for index management
Curator commands
Curator installation
Shield for security
Shield installation
Adding users and roles
Using Kibana4 on shield protected Elasticsearch
Marvel to monitor
Marvel installation
Marvel dashboards
ELK roadmap
Elasticsearch roadmap
Logstash roadmap
Event persistence capability
End-to-end message acknowledgement
Logstash monitoring and management API
Kibana roadmap
Summary
A. Bibliography
Index
Search in book...
Toggle Font Controls
Playlists
Add To
Create new playlist
Name your new playlist
Playlist description (optional)
Cancel
Create playlist
Sign In
Email address
Password
Forgot Password?
Create account
Login
or
Continue with Facebook
Continue with Google
Sign Up
Full Name
Email address
Confirm Email Address
Password
Login
Create account
or
Continue with Facebook
Continue with Google
Next
Next Chapter
Table of Contents
Add Highlight
No Comment
..................Content has been hidden....................
You can't read the all page of ebook, please click
here
login for view all page.
Day Mode
Cloud Mode
Night Mode
Reset