In the previous chapter, we looked at how large volumes of data can be managed and leveraged for analytical insight. We looked at how changes in data can be detected and responded to using rules (also called alerts). This chapter explores the use of machine learning techniques to look for unknowns in data and understand trends that cannot be captured using a rule-based approach.
Machine learning is a dense subject with a wide range of theoretical and practical concepts to cover. In this chapter, we will focus on some of the more important aspects of running machine learning jobs on Elasticsearch. Specifically, we will cover the following:
To use machine learning features, ensure that the Elasticsearch cluster contains at least one node with the role ml. This enables the running of machine learning jobs on the cluster:
node.roles: [data, ml]
Elasticsearch is a powerful tool when it comes to storing, searching, and aggregating large volumes of data. Dashboards and visualizations help with user-driven interrogation and exploration of data, while tools such as Watcher and Kibana alerting allow users to take automatic action when data changes in a predefined or expected manner.
However, a lot of data sources can often represent trends or insights that are hard to capture as a predefined rule or query. Consider the following example:
This sort of failure can be very hard to detect using standard alerting logic but can be extremely easy to spot when using anomaly detection. Machine learning models can build a baseline for log volumes per data source over a period of time (or a season), learning the normal changes and variations in volumes. If a failure was to occur, the new values observed would fall outside the expected range on the model, resulting in an anomaly.
Consider a slightly different example (and one that we will use in subsequent sections of this chapter).
A web server logs all requests made to an internet-facing web application. The application is used by various employees who are working from home (and therefore outside the company network) and administrators who look after the system. The logs look as follows:
Navigate to Chapter5/dataset in the code repository and ingest the web application logs into your Elasticsearch cluster, as follows:
./load.sh
logstash-8.0.0/bin/logstash -f webapp-logstash.conf < webapp.csv
GET webapp/_search
Here is the output:
Data views on Kibana map to one or more indices on Elasticsearch and provide Kibana with information on fields and mappings to enable visualization features. We will further explore Kibana concepts and functionality in Chapter 8, Interacting with Your Data on Kibana.
Open up the Discover app from the navigation menu and click on the calendar icon to select a time range for your data. This dataset in particular contains events from March 6, 2021 to March 11, 2021. Click on the Update button once selected to view your data.
The next section looks at preparing data for use in machine learning jobs.
In order for machine learning jobs to analyze document field values when building baselines and identifying anomalies, it is important to ensure the index mappings are accurately defined. Furthermore, it is useful to parse out complex fields (using ETL tools or ingest pipelines) into their own subfields to use in machine learning jobs.
The machine learning application provides useful functionality to visualize the index you're looking to run jobs on, and ensure mappings and values are as expected. The UI lists all fields, data types, and some sample values where appropriate.
Navigate to the machine learning app on Kibana and perform the following steps:
Now that we've prepared our dataset, let's look at some core machine learning concepts that make up the capabilities of the Elastic Stack.
The machine learning technique or approach you use depends on the data and use case you're looking to solve; it comes down to the question you're asking of your data. Broadly, the approaches can be broken down into the following.
Unsupervised learning is a machine learning approach that learns patterns and trends in data without any external labeling or tagging. The approach can be used to extract otherwise hard-to-find behaviors in the data without human intervention.
At a high level, the technique works by analyzing functions of field values (over time, or a series of documents) to build a behavior baseline (a norm). New field values are compared to the baseline, including a margin of error. Data points that fall outside the expected range are classified as anomalies. Assuming the model has analyzed sufficient data to capture the seasonality of the dataset, it can also be used to forecast values for fields in future time ranges.
Some use cases for unsupervised learning include the following:
The Elastic Stack implements the aforementioned unsupervised learning use cases as part of the time series anomaly detection and outlier detection features.
Supervised learning is an approach where a machine learning model is trained with labeled training data that tells the algorithm about an outcome or observed value, given the input data. The approach is useful when you know the input and output to a given problem, but not necessarily how to solve it.
Supervised learning works by analyzing key features and the corresponding output value in a dataset and producing a mapping for them. This mapping can then be used to predict output values given new and unseen inputs.
Examples of use cases that can leverage supervised learning include the following:
The Data Frame Analytics tab on the machine learning app offers classification (to predict or classify data into categories) and regression (predicting values for fields based on their relationships) as supervised learning features.
Next, we will look at how machine learning can be used to detect anomalies in time series datasets.
Given the logs in the webapp index, there is some concern that there was some potentially undesired activity happening on the application. This could be completely benign or have malicious consequences. This section will look at how a series of machine learning jobs can be implemented to better understand and analyze the activity in the logs.
We will use a single-metric machine learning job to build a baseline for the number of log events generated by the application during normal operation.
Follow these steps to configure the job:
The chart displays the following information:
Given the information from this job, we can come to the following conclusions:
We know that the previous job indicated an unexpected and anomalous spike in the number of requests received by the web app (event rate). One option to further understand the implications of this activity is to analyze the amount of data sent and received by the application. We will use a multi-metric anomaly detection job to achieve this outcome.
Follow these instructions to configure the job:
From the results produced by the completed machine learning job, we can make the following observations and conclusions:
Note
The following graphs can be viewed by clicking on a cell in the anomaly timeline heatmap on the Kibana interface.
We've detected anomalies in overall event rates as well as data transfer volumes for different URL paths on the web app. From the second job, it was evident that a lot of the anomalies originated from a singular IP address. This leads to the question of how anomalous the activity originating from this IP address was, compared to the rest of the source IP addresses in the dataset (or population of source IPs). We will use a population job to analyze this activity.
Follow these instructions to configure this job:
As expected, there is a singular source IP that stands out with anomalous activity, both in terms of event rate and data transfer volumes compared to the rest of the population.
You can also view it by the user_agent.name value and see that most of the anomalous requests came from curl and a version of Firefox.
From looking at the anomalies detected, we can conclude the following:
Now that we have gathered enough data, we can use the Discover tab to explore the raw events. Type the following query into the search bar to filter for HTTP 200 response codes for successful authentication requests and remove the noise from any failed authentication requests:
source.ip: "210.19.10.10" AND (http.response.status_code : "200" OR http.response.status_code:"201")
We can now see the events from the potentially malicious IP address successfully authenticating with the application and exfiltrating customer data over the course of 3 days.
Now that we know how various types of unsupervised anomaly detection jobs can be used in analyzing logs, the next section will focus on using supervised machine learning to train models and classify your data.
Unsupervised anomaly detection is useful when looking for abnormal or unexpected behavior in a dataset to guide investigation and analysis. It can unearth silent faults, unexpected usage patterns, resource abuse, or malicious user activity. This is just one class of use cases enabled by machine learning.
It is common to have historical data where, with post analysis, it is rather easy to label or tag this data with a meaningful value. For example, if you have access to service usage data for your subscription-based online application along with a record of canceled subscriptions, you could tag snapshots of the usage activity with a label indicating whether the customer churned.
Consider a different example where an IT team has access to web application logs where, with post analysis, given the request payloads are different to normal requests originating from the application, they can label events that indicate malicious activity, such as password spraying attempts (as the request payloads are different to normal requests originating from the application).
Training a machine learning model with input data and a labeled outcome (did the customer churn, or was the request maliciously crafted?) is a useful tool for taking proactive and timely data-driven action. Users likely to churn can be offered a discount on the service or access to tutorials so they can leverage the service better. IP addresses responsible for maliciously crafted requests can be blocked proactively without manual intervention, and the security team can be alerted to take broader action (such as enforcing two-factor authentication or asking users to change their passwords).
Data frame analytics on elastic machine learning provides two main features to help with such use cases:
We will implement a classification job to analyze features in web application requests to predict whether a request is malicious in nature. Looking at the nature of the requests as well as some of the findings from the anomaly detection jobs, we know that request/response sizes for malicious requests are not the same as the standard, user-generated requests.
Before configuring the job, follow these steps to ingest a tagged dataset where an additional malicious Boolean column is introduced to the CSV file:
logstash-8.0.0/bin/logstash -f webapp-tagged-logstash.conf < webapp-tagged.csv
GET webapp-tagged/_search
Follow these instructions to configure the classification job:
The Model evaluation pane displays a confusion matrix. This visualization displays the following:
In our example, the model classified all malicious events as malicious and non-malicious events as not malicious, producing the following matrix:
The results pane provides fine-grained detail for documents with a prediction and a probability score. Navigate to the Data Frame Analytics page and click on Models to see the trained model.
Now that we have a trained model, the next section will look at using the model to infer classification for new incoming data.
As we learned in Chapter 4, Leveraging Insights and Managing Data on Elasticsearch, ingest pipelines can be used to transform, process, and enrich incoming documents before indexing. Ingest pipelines provide an inference processor to run new documents through a trained machine learning model to infer classification or regression results.
Follow these instructions to create and test an ingest pipeline to run inference using the trained machine learning model:
PUT _ingest/pipeline/ml-malicious-request
{
"processors": [
{
"inference": {
"model_id": "classification-request-payloads-1615680927179",
"inference_config": {
"classification": {
"num_top_classes": 2,
"results_field": "prediction",
"top_classes_results_field": "probabilities"
}
}
}
}
]
}
The inference_config setting can be used to configure the behavior of the inference processor in the ingest pipeline. Detailed configuration settings for the inference processor can be found on the Elasticsearch guide:
https://www.elastic.co/guide/en/elasticsearch/reference/8.0/inference-processor.html.
Pre-prepared simulate requests can be found in the simulate-malicious-docs.json and simulate-not-malicious-docs.json files:
POST _ingest/pipeline/ml-malicious-request/_simulate
{
"docs": [
{
...
"_source": {
"user.name": "u110191",
"source": {
"geo": {
"country_name": "United States",
...
},
"ip": "64.213.79.243"
},
"url": {
...
"full": "https://host-systems.net/user/manage.php"
},
"http.request.bytes": "132",
"event.action": "view",
"@timestamp": "2021-03-15T10:20:00.000Z",
"http.request.method": "POST",
"http.response.bytes": "1417",
"event.kind": "request",
"http.response.status_code": "200",
"event.dataset": "weblogs",
...
}
}
]
}
The API should return output as shown in the following screenshot. The ml object contains inference results, including the predicted label, a prediction score (or confidence), and the probabilities for each class:
PUT _ingest/pipeline/webapp-pipeline
{
"processors": [
...
{
"dissect": { ... }
},
{
"remove": { ... }
},
{
"inference": {
"model_id": "classification-request-payloads-1615680927179",
"inference_config": {
"classification": {
"num_top_classes": 2,
"results_field": "prediction",
"top_classes_results_field": "probabilities"
}
}
}
}
]
}
New documents indexed using the pipeline should now contain inference results from the trained model.
logstash-8.0.0/bin/logstash -f webapp-logstash.conf < webapp-new-data.csv
logstash-8.0.0/bin/logstash -f webapp-logstash.conf < webapp-password-spraying-attemps.csv
GET webapp/_search?size=1000
{
"query": {
"exists": {
"field": "ml"
}
}
}
The preceding events indicate password spraying attempts with the machine learning model correctly labeling the events as malicious. Using this information, administrators or analysts can set up alerts or automatic actions (using tools such as Kibana alerting or Watcher) to respond to potential system abuse in the future. We will look at how alerts can be set up to Kibana in Chapter 8, Interacting with Your Data on Kibana.
In this chapter, we looked at applying supervised and unsupervised machine learning techniques on data in Elasticsearch for various use cases.
First, we explored the use of unsupervised learning to look for anomalous behavior in time series data. We used single-metric, multi-metric, and population jobs to analyze a dataset of web application logs to look for potentially malicious activity.
Next, we looked at the use of supervised learning to train a machine learning model for classifying to classify requests to the web application as malicious using features in the request (primarily the HTTP request/response size values).
Finally, we looked at how the inference processor in ingest pipelines can be used to run continuous inference using a trained model for new data.
In the next chapter, we will move our focus to Beats and their role in the data pipeline. We will look at how different types of events can be collected by Beats agents and sent to Elasticsearch or Logstash for processing.