Welcome to Getting Started with Elastic Stack 8.0. The Elastic Stack has exploded in popularity over the last couple of years, becoming the de facto standard for centralized logging and "big data"-related use cases. The stack is leveraged by organizations, both big and small, across the world to solve a range of data-related problems. Hunting for adversaries in your network, looking for fraudulent transactions, real-time monitoring and alerting in systems, and searching for relevant products in catalogs are some of the real-world applications of the Elastic Stack.
The Elastic Stack is a bundle of multiple core products that integrate with each other. We will look at each product briefly in this chapter, and then dive into each one in later chapters in this book. The Elastic Stack attracts a great deal of interest from developers and architects that are working for organizations of all sizes. This book aims to serve as the go-to guide for those looking to get started with the best practices when it comes to building real-world search, security, and observability platforms using this technology.
In this chapter, you will learn a little bit about each component that makes up the Elastic Stack, and how they can be leveraged for your use cases. Those of you who are beginners or intermediary learners of this subject will benefit from this content to gain useful context for Chapter 3, Indexing and Searching for Data, to Chapter 13, Architecting Workloads on the Elastic Stack, of this book.
Specifically, we will cover the following topics:
The Elastic Stack is made up of four core products:
This diagram shows how the core components of the Elastic Stack work together to ingest, store, and search on data:
Each core component solves a single, common data-related problem. This genericity makes the stack flexible and domain-agnostic, allowing it to be adopted in multiple solution areas. Most users start with a simple logging use case where data is collected, parsed, and stored in Elasticsearch to create dashboards and alerts. Others might create more sophisticated capabilities, such as a workplace search to make information across a range of data sources accessible to your team; leveraging SIEM and machine learning to look for anomalous user/machine behavior and hunt for adversaries on your company network; understanding performance bottlenecks in applications; and monitoring infrastructure logs/metrics to respond to issues on critical systems.
Multiple independent projects have evolved over the years to create the present-day version of the Elastic Stack. Knowing how these components evolved indicates some of the functional gaps that existed in the big data space and how the Elastic Stack components come together to solve these challenges. Let's take a look:
The core components of the stack are open source software projects, licensed under a mix of the Apache 2, Elastic License version 2 (ELv2), and Server Side Public License (SSPL) licensing agreements. The LICENSE.txt file in the root of each product's GitHub repository should explain how the code is licensed.
A paid license is not required to learn about and explore the Elastic Stack features covered in this book. A trial license can be activated for full access to all the features for a limited period upon installing the software.
To focus on learning about the features and technical aspects of the product, there will be no notes on licensing implications after this section. Please refer to the Elastic Subscriptions page to understand what kind of license you might need for a production deployment of the technology:
https://www.elastic.co/subscriptions
Elasticsearch is often described as a distributed search engine that can be used to search through and aggregate enormous amounts of data. Some describe Elasticsearch as an analytics engine, while others have used the term document store or NoSQL database. The reason for the wide-ranging definitions for Elasticsearch is that it is quite a flexible product. It can be used to store JSON documents, with or without a predefined schema (allowing for unstructured data); it can be used to compute aggregations on document values (to calculate metrics or group data into buckets), and it can be used to implement relevant, free text search functionality across a large corpus.
Elasticsearch builds on top of Apache Lucene, a popular and fast full-text search library for Java applications. Lucene is not distributed in any way and does not manage resources/handle requests natively. At its core, Elasticsearch abstracts away the complexities and intricacies of working directly with a library such as Lucene by providing user-friendly APIs to help index, search for, and aggregate data. It also introduces concepts such as the following:
Elasticsearch exposes RESTful APIs using JSON format, allowing for interoperability between different programming languages and technology stacks.
Elasticsearch today is a feature-rich and complex piece of software. Do not worry if you do not fully understand or appreciate some of the terms used to explain Elasticsearch. We will dive into these, as well as the features on offer, in Chapter 3, Indexing and Searching for Data.
Selecting the right tool for the job is an important aspect of any project. This section describes some scenarios where Elasticsearch may be suited for use.
Elasticsearch is a horizontally scalable data store where additional nodes can easily be added to a cluster to increase the available resources. Each node can store multiple primary shards on data, and each shard can be replicated (as replica shards) on other nodes. Primary shards handle read and write requests, while replica shards only handle read requests.
The following diagram shows how primary and replica shards are distributed across Elasticsearch nodes to achieve scalable and redundant reading and writing of data:
The preceding diagram shows the following:
High volume ingest can mean either of the following things:
Elasticsearch can also store large volumes of data for search and aggregation. To retain data costs efficiently over long retention periods, clusters can be architected with hot, warm, and cold tiers of data. During its life cycle, data can be moved across nodes with different disk or Input/Output Operations Per Second (IOPS) specifications to take advantage of slower disk drives and their associated lower costs. We will look at these sorts of architectures in Chapter 3, Indexing and Searching for Data and Chapter 13, Architecting Workloads on the Elastic Stack.
Some examples of where you need to handle large volumes of data include the following:
As we discussed previously, Elasticsearch leverages Lucene for indexing and searching operations. As documents are ingested into Elasticsearch, unstructured textual components from the document are analyzed to extract some structure in the form of terms. Terms are maintained in an inverted index data structure. In simple terms, an index (such as the table of contents in a book) is a list of topics (or documents) and the corresponding page numbers for each topic. An index is great for retrieving page content, given you already know what the chapter is called. An inverted index, however, is a collection of words (or terms) in topics and a corresponding list of pages that contain them. Therefore, an inverted index can make it easier to find all the relevant pages, given the search term you are interested in.
The following table visualizes an inverted index for a collection of documents containing recipes:
A search string containing multiple terms goes through a similar process of analysis to extract terms, to then look up all the matching terms and their occurrences in the inverted index. A score is calculated for each field match based on the similarity module. By default, the BM25 ranking function (based on term frequency/inverse document frequency) is used to estimate the relevance of a document for a search query. Elasticsearch then returns a union of the results if an OR operator is used (by default) or an intersection of the results if an AND operator is used. The results are sorted by score, with the highest score appearing first.
Elasticsearch can aggregate large volumes of data with speed thanks to its distributed nature. There are primarily two types of aggregations:
Tools such as Kibana heavily use aggregations to visualize the data on Elasticsearch. We will dive deeper into aggregations in Chapter 4, Leveraging Insights and Managing Data on Elasticsearch.
One of the benefits of quickly ingesting and retrieving data is the ability to respond to the latest information quickly. Imagine a scenario where uptime information for business-critical services is ingested into Elasticsearch. Alerting would work by continually querying Elasticsearch (at a predefined interval) to return any documents that indicate degrading service performance or downtime. If the query returns any results, actions can be configured to alert a Site Reliability Engineer (SRE) or trigger automated remediation processes as appropriate.
Watcher and Kibana alerting are two ways in which this can be achieved; we will look at this in detail in Chapter 4, Leveraging Insights and Managing Data on Elasticsearch, and Chapter 8, Interacting with Your Data on Kibana.
Elasticsearch does not require predefined schemas for documents you want to work with. Schemas on indices can be preconfigured if they're known to control storage/memory consumption and know how the data will be used later on. Schemas (also known as index mappings) can be dynamically or strictly configured, depending on your flexibility and the maturity of your document's structure.
By default, Elasticsearch will dynamically update these index mappings based on the documents that have been ingested. Where no mapping exists for a field, Elasticsearch will guess the data type based on its value. This flexibility makes it extremely easy for developers to get up and running, while also making it suitable for use in environments where document schemas may evolve over time.
We'll look at index mappings in Chapter 3, Indexing and Searching for Data.
Elasticsearch can be configured to work as a distributed system where groups of nodes (Elasticsearch instances) work together to form a cluster. Clusters can be set up for the various architectural characteristics when deployed in mission-critical environments. We will take a look at some of these in this section.
As we mentioned previously, Elasticsearch is a horizontally scalable system. Read/write throughput, as well as storage capacity, can be increased almost linearly by adding additional nodes to the Elasticsearch cluster. Adding nodes to a cluster is relatively effortless and can be done without any downtime. The cluster can automatically redistribute shards evenly across nodes (subject to shard allocation filtering rules) as the number of nodes available changes to optimize performance and improve resource utilization across nodes.
A primary shard in Elasticsearch can handle both read and write operations, while a replica shard is a read-only copy of a primary shard. By default, Elasticsearch will allocate one replica for every primary shard on different nodes in the cluster, making Elasticsearch a highly available system where requests can still be completed when one or more nodes experience failures.
If a node holding a primary shard is lost, a replica shard will be selected and promoted to become a primary shard, and a replica shard will be allocated to another node in the cluster.
If a node holding a replica shard is lost, the replica shard will simply be allocated to another node in the cluster.
Indexing and search requests can be handled seamlessly while shards are being allocated, with clients experiencing little to no downtime. Even if a search request fails, subsequent search requests will likely succeed because of this architecture.
Shard allocation on Elasticsearch can also consider node attributes to help us make more informed allocation decisions. For example, a cluster deployed in a cloud region with three availability zones can be configured so that replicas are always allocated on a different availability zone (or even a server rack in an on-premises data center) to the primary shard to protect against failures at the zone level.
Elasticsearch allows us to persistently store or snapshot data, making it a recoverable system in the event of a disaster. Snapshots can be configured to write data to a traditional filesystem or an object store such as AWS S3. Snapshots are a point-in-time copy of the data and must be taken at regular intervals, depending on your Recovery Point Objective (RPO), for an effective disaster recovery plan to be created.
Elasticsearch can search for and replicate data across remote clusters to enable more sophisticated architectural patterns.
Cross-Cluster Search (CCS) is a feature that allows you to search data that resides on an external or remote Elasticsearch cluster. A single search request can be run on the local cluster, as well as one or more remote clusters. Each cluster will run the search independently on its own data before returning a response to the coordinator node (the node handling the search request). The coordinator nodes then combine the results from the different clusters into a single response for the client. The local node does not join remote clusters, allowing for higher network latencies for inter-cluster communication, compared to intracluster communication. This is useful in scenarios where multiple independent clusters in different geographic regions need to search on each other to have a unified search capability.
The following diagram shows how Elasticsearch clusters can search across multiple clusters and combine results into a single search response for the user:
Cross-cluster replication (CCR) allows an index to be replicated in a local cluster to a remote cluster. CCR implements a leader/follower model, where all the changes that have been made to a leader index are replicated on the follower index. This feature allows for fast searching on the same dataset in different geographical regions by replicating data closer to where it will be consumed. CCR is also sometimes used for cross-region redundancy requirements:
CCS and CCR enable more complex use cases where multiple regional clusters can be used to independently store and search for data, while also allowing unified search and geographical redundancy.
Elasticsearch offers security features to help authenticate and authorize user requests, as well as encrypting network communications to and within the cluster:
It is also important to understand the limitations of Elasticsearch. This section describes some scenarios where Elasticsearch alone may not be the best tool for the job.
Elasticsearch, unlike databases such as MySQL, was not designed to handle relational data. Elasticsearch allows you to have simple relationships in your data, such as parent-child and nested relationships, at a performance cost (at search time and indexing time, respectively). Data on Elasticsearch must be de-normalized (duplicating or adding redundant fields to documents, to avoid having to join data) to improve search and indexing/update performance.
If you need to have the database manage relationships and enforce rules of consistency across different types of linked data, as well as maintaining normalized records of data, Elasticsearch may not be the right tool for the job.
Individual requests in Elasticsearch support ACID properties. However, Elasticsearch does not have the concept of transactions, so it does not offer ACID transactions.
At the individual request level, ACID properties can be achieved as follows:
If ACID transactions are important to your use case, Elasticsearch may not be suitable for you.
Important Note
In the case of relational data or ACID transaction requirements, Elasticsearch is often used alongside a traditional RDBMS solution such as MySQL. In such architectures, the RDBMS would act as the source of truth and handle writes/updates from the application. These updates can then be replicated to Elasticsearch using tools such as Logstash for fast/relevant searches and visualization/analytics use cases.
With that, we have explored some of the core concepts of Elasticsearch and the role it plays in ingesting and storing our data. Now, let's look at how we can interact with the data on Elasticsearch using Kibana.
Kibana was created primarily as a visualization tool for data residing on Elasticsearch and is bundled together as part of the stack. Since its inception, Kibana has evolved to cater to use cases such as alerting, reporting, and monitoring Elastic Stack components, as well as administrating and managing the Elasticsearch cluster in use.
More importantly, Kibana provides the interface and functionality for the solutions that Elastic Stack offers, in addition to administration and management options for the core components. Functionality in Kibana is organized using applications (or apps, for short).
Apps on Kibana can be solution-specific or part of the general stack. The SIEM app, for example, powers the security solution, enabling security analysts and threat hunters to defend their organization from attacks. The APM app is another solution-specific app that, in this case, allows developers and SREs to observe their applications to look for issues or performance bottlenecks.
On the other hand, general Kibana apps such as Discover, Visualize, and Dashboard can be used to explore, interrogate, and visualize data, regardless of the solution the data enables. Ingest Manager is another example of an app that allows you to configure Elastic Agent to collect any kind of data from across an environment, agnostic of the solution the data may be used in.
Solution-specific apps on Kibana provide a great out-of-the-box user experience, as well as targeted features and functionality for the solution in question. General or stack-based applications provide powerful but unified capabilities that can be used across all solutions, even custom solutions that you might build on the Elastic Stack. General Kibana apps such as Discover and Dashboard are useful for all use cases, while solution-specific apps such as Observability and Security provide curated out-of-the-box experiences for the solution area. Kibana is usually considered a core component of the Elastic Stack and is often installed, even if the cluster is not used for data analysis.
We will dive deeper into Kibana's features in Chapter 8, Interacting with Your Data on Kibana. Now, let's look at how data can be collected and ingested into Elasticsearch using Logstash and Beats.
So far, we have looked at Elasticsearch, a scalable search and analytics engine for all kinds of data. We have also got Kibana to interface with Elasticsearch to help us explore and use our data effectively. The final capability to make it all work together is ingestion.
The Elastic Stack provides two products for ingestion, depending on your use cases.
Useful data is generated all over the place in present-day environments, often from varying technology stacks, as well as legacy and new systems. As such, it makes sense to collect data directly from, or closer to, the source system and ship it into your centralized logging or analytics platform. This is where Beats come in; Beats are lightweight applications (also referred to as agents) that can collect and ship several types of data to destinations such as Elasticsearch, Logstash, or Kafka.
Elastic offers a few types of Beats today for various use cases:
Beats use an open source library called libbeat that provides generic APIs for configuring inputs and destinations for data output. Beats implement the data collection functionality that's specific to the type of data (such as logs and metrics) that they collect. A range of community-developed Beats are available, in addition to the officially produced Beats agents.
The modules that are available in Beats allow you to collect consistent datasets and the distribution of out-of-the-box dashboards, machine learning jobs, and alerts for users to leverage in their use cases.
One of the most important aspects of ingesting data into a centralized logging platform is paying attention to the data format in use. A Unified Data Model (UDM) is an especially useful tool to have, ensuring data can be easily consumed by end users once ingested into a logging platform. Enterprises typically follow a mixture of two approaches to ensure the log data complies with their unified data model:
This approach is often considerably costly to implement, maintain, and scale. Changes in the log schema at the source can also have unintended downstream implications in other applications consuming the data. It is common to see UDMs evolve quite rapidly as the nature and the content of the logs that have been collected change. The use of different technology stacks or frameworks in an organization can also make it challenging to log with consistency and uniformity across the environment.
(a) Parsers need to be maintained and constantly updated to make sure the logs are extracted and stored correctly.
(b) Most of the parsing work usually needs to be done (or overlooked) by a central function in the organization (because of the centralized nature of the transformation), rather than by following a self-service or DevOps-style operating model.
The Elastic Common Schema (ECS) is a unified data model set by Elastic. The following ECS specifications have a few advantages over a custom or internal UDM:
Beats modules can automatically convert logs and metrics from various supported data sources into an ECS-compliant schema. Beats modules also ship with out-of-the-box dashboards, machine learning jobs, and alerts. This makes it incredibly easy to onboard a new data source onto Elasticsearch using a Beat, and immediately being able to consume this data as part of a value-adding use case in your organization. There is a growing list of supported Filebeat and Metricbeat modules available on the Elastic integration catalog.
Managing a range of Beats agents can come with significant administrative overhead, especially in large and complex environments. Onboarding new data sources would require updating configuration files, which then need to be deployed on the right host machine.
Elastic Agent is a single, unified agent that can be used to collect logs, metrics, and uptime data from your environment. Elastic Agent orchestrates the core Beats agents under the hood but simplifies the deployment and configuration process, given teams now need to manage just one agent.
Elastic Agent can also work with a component called Fleet Server to simplify the ongoing management of the agents and the data they collect. Fleet can be used to centrally push policies to control data collection and manage agent version upgrades, without any additional administrative effort. We take look at Elastic Agent in more detail in Chapter 9, Managing Data Onboarding with Elastic Agent.
While Beats make it very convenient to onboard a new data source, they are designed to be lightweight in terms of performance footprint. As such, Beats do not provide a great deal of heavy processing, transformation, and enrichment capabilities. This is where Logstash comes in to help your ingestion architecture.
Logstash is a general-purpose ETL tool designed to input data from any number of source systems/communication protocols. The data is then processed through a set of filters, where you can mutate, add, enrich, or remove fields as required. Finally, events can be sent to several destination systems. This configuration is defined as a Logstash parser. We will dive deeper into Logstash and how it can be used for various ETL use cases in Chapter 7, Using Logstash to Extract, Transform, and Load Data.
Beats and Logstash are designed to serve specific requirements when collecting and ingesting data. Users are often confused when deciding between Beats or Logstash when onboarding a new data source. The following list aims to make this clearer.
Beats should be used when the following points apply to your use case:
(a) Collecting web logs from a dynamic group of hundreds of web servers
(b) Collecting logs from a large number of microservices running on a container orchestration platform such as Kubernetes
(c) Collecting metrics from a group of MySQL instances in cloud/on-premises locations
Logstash should be used when you have the following requirements:
It is worth noting that Beats agents are continually updated and enhanced with every release. The gap between the capabilities of Logstash and Beats has closed considerably over the last few releases.
It is quite common for organizations to get the best of both worlds by leveraging Beats and Logstash together. This allows data to be collected from a range of sources while enabling centralized processing and transformation of events.
Now that we understand how we can ingest data into the Elastic Stack, let's look at the options that are available when running the stack.
The Elastic Stack can be orchestrated and deployed in several different ways, depending on the following:
The following diagram illustrates the different orchestration options available and what level of management is provided:
Standalone deployments run on your own infrastructure and must be managed by you. In this book, we will run our code and examples on a standalone deployment of the Elastic Stack. These will run the same on the Elastic Cloud options.
As standalone deployments do not come with out-of-the-box orchestration capabilities, they can be difficult to manage and upgrade, especially at scale. Teams will often need to invest in custom tooling or automation to manage such deployments.
You might consider using Elastic Cloud for large production deployments that will benefit from orchestration and management capabilities.
Elastic Cloud is an offering that provides orchestration, administration, and management functionality for Elastic Stack components. There are three products available under the broader Elastic Cloud offering:
Now that we know how we can orchestrate an Elastic Stack deployment, let's look at what kind of solutions we can run on the stack.
As described previously, the Elastic Stack provides access to core components for search, visualization, and ingestion. These components can be used to solve a wide array of data-related problems, and it was up to the users of the products to create these solutions in the past. Over the last couple of releases, Elastic has moved to provide more out-of-the-box solutions built on top of the core components of the stack. Elastic provides solutions for Enterprise Search-, Security-, and Observability-related use cases at the time of writing. Users are free to build on top of the solutions on offer or build their own solutions for other use cases.
The Enterprise Search offering by Elastic provides access to an array of features to make it as easy and seamless as possible to add search functionality for your application, website, or workplace. Let's take a look:
The Security offering from Elastic builds on top of the solid foundation of the big data search, exploration, and visualization capabilities Kibana has offered for a long time. Elastic Security offers features for implementing security analytics capabilities using a SIEM solution, as well as EDR capabilities:
The Observability solution provides features that help developers and engineers gain better visibility and insight into the performance and operation of their applications and infrastructure.
The observability solution is comprised of the Logs, Metrics, Uptime, and APM apps on Kibana:
In this chapter, we gained a high-level overview of the Elastic Stack and its core components. We understood the role each component plays and how they can be used together to solve complex data-related problems. We also looked at what each of these components is inherently good at doing, and where they might not best serve the problem at hand. Finally, we looked at the out-of-the-box solutions in the areas of Enterprise Search, Security, and Observability that are offered by the stack.
In the next chapter, we will look at a range of options for installing and running components of the Elastic Stack to help you get started with your use cases.