Chapter 3

Analytics and Incident Response

Abstract

This chapter focuses on the analysis of data so that you have the skillset to analyze the large volume of most organizations already collect. We cover the use of open-source tools, such as Hadoop, MapReduce, R, and Mahout, so that you have a solid foundation for understanding the analytical process on any platform. Specifically, we demonstrate the power of these tools by applying them to a typical security scenario—analyzing very large collections of server log files. The tools and methods can be applied to the analysis of data from other systems sources as well.

Keywords

Hadoop; Mahout; MapReduce; R; Security analytics; Server logs
Information in This Chapter
▪ Scenarios and Challenges in Intrusions and Incident Identification
▪ Use of Text Mining and Outlier Detection
▪ Case Study: Step by step guide how to use statistical programming tools to find Intrusions and Incidents (Case study will be about server log investigation using Hadoop and R)
▪ Other Applicable Security Areas and Scenarios

Introduction

Server security is a top concern, as widely publicized data breaches are increasingly in the news. After a data breach occurs, forensic analysis of server logs is necessary to identify vulnerabilities, perform damage assessments, prescribe mitigation measures, and collect evidence. However, the increasing amount of Internet traffic, accompanied by growing numbers of Web servers in data centers, often produces massive collections of server log data, which are difficult to analyze with traditional, nonparallelized methods.
By using the Hadoop, MapReduce, and Hive software stack, you have the ability to simultaneously analyze very large collections of server logs. Hadoop and MapReduce together provide a distributed file structure, and parallel processing framework, while Hive provides the ability to query and analyze data with an SQL-like syntax. R gives you the essential analytical tools that can be used on moderate-sized data sets, or on big data that have been aggregated or filtered down to a more manageable size utilizing Hadoop and MapReduce.
There are commercial tools available to help you with querying log-file data. Some, such as Splunk, are capable of handling big data as well. However, we will be focusing the examples in this chapter on open-source and freely available tools and analytical platforms. By writing your own scripts, you can fully customize your analysis to your situation and can build repeatable processes as well. Open-source tools, such as R, offer thousands of analytical packages to choose from, including very sophisticated and cutting-edge methods that may not be available in commercial toolsets.
Commercial tools can be quite expensive, and not all organizations and departments have the budget for them. However, if you have access to commercial tools, by all means, use them to their fullest advantage. Commercial tools allow for very rapid exploration of your data, utilizing graphical user interfaces, which can make them well worth the cost. Even though scripts are great for reproducibility, which can be a huge advantage when you need to retrace your steps, or rerun your analysis on new data, they do take some time and effort to write. Therefore, it is difficult to beat a good graphical interface for rapid initial searches through your data.
Given that commercial tools and open-source tools each have their advantages, they should be viewed as complementary rather than competing technologies. If you can afford it, why not utilize both? Once you learn how to perform analysis using open-source tools, such as Hadoop, MapReduce, R, and Mahout, you will have a solid foundation for understanding the analytical process on any platform. This will help you in learning other tools, including commercial tools.
In this chapter, we will explore scenarios and examples for using analytical methods to uncover potential security breaches. The methods covered in this section are not intended to be an exhaustive catalog of the possibilities. Rather, we hope they will help you to develop some creative ideas of your own.

Scenarios and Challenges in Intrusions and Incident Identification

Perhaps the greatest challenge in identifying intrusion attempts is that “we do not know what we do not know.” It is difficult to uncover the unknown-unknowns: new attack patterns that we cannot foresee and which can circumvent existing defenses. Software programs used for preventing intrusions in real time are essential, but they have a significant shortcoming. In general, they only detect known patterns of attack—or, known attack vectors, in security parlance. Real-time intrusion detection and prevention tends to focus on the known-unknowns, rather than the unknown-unknowns.
While deploying real-time intrusion detection and prevention defenses is essential, it is not enough. Analysts need to use creative efforts to uncover new attacks that successfully circumvent existing defenses. It involves analyzing data gathered from systems, such as the log files from servers and network appliances and the drives from personal computing devices.
In this chapter, we will focus on the analysis of data, rather than the gathering of data. There are many good texts and online resources available for ideas on how to gather data. Since most systems already gather much data about network and server traffic, the greater challenge is in knowing what to do with the data, rather than in collecting it. Whether the data source consists of server logs, network data from software such as Wireshark, or from some other sources, the analytical methods are generally the same. For example, regardless of the data source, outlier detection methods are very likely to be useful, in any situation.

Analyzing a Collection of Server Logs with Big Data

In this section, we will examine how to analyze multiple server logs simultaneously, using big data technologies.

Conducting Analysis with Big Data Tools

We will focus on a variety of queries in the Hive Query Language (HiveQL) to assist in performing forensic analysis of Apache server log files. We will also include some analysis with other software tools, such as R and Mahout. Since HiveQL is very similar to basic ANSI SQL, it should be easily grasped by those who are already familiar with querying relational databases.
In fact, most of the queries here can be run with little or only minor modifications on log-file data that have already been parsed and stored in a relational database. If you have a small enough collection of log files, a relational database may be all that you need. However, for large collections of logs, the parallel processing afforded by Hive running on top of Hadoop could turn an otherwise impossible analysis into one that is doable.
The log files used in the examples that follow are in the popular Apache combined format. This code can be easily adapted to other formats as well.

Analysis of Log Files

Although there is no single standard for server log formats, there are, however, a few formats that are relatively common. Examples include log-file formats for Windows Event Logs, IIS logs, firewall logs, VPN access logs, and various UNIX logs for authentication, FTP, SSH, and so on. However, open-source server software from the Apache Foundation is very common and produces log files in a couple formats: common log format and combined log format. Although these formats can be modified by the user, it is fairly common to see these two formats used without modifications. The combined format is the same as the common format, except that two fields are added. These are the referrer and user agent fields. The referrer field indicates the site that the client was referred or linked from. The user agent shows identifying information on the client browser. The server examples in this book will generally use the combined format, although the methods we will examine can be adapted to any format.

Common Log File Fields

▪ Remote hostname or IP address of the user
▪ User’s remote logname
▪ The authenticated username
▪ The date and time that the request was made
▪ The URL request string sent from the client
▪ The http status code that was returned by the server to the client
▪ The size in number of bytes of the document that was transferred from the server to the client

Combined Log File Fields

▪ Remote hostname or IP address of the user
▪ User’s remote logname
▪ The authenticated username
▪ The date and time that the request was made
▪ The URL request string sent from the client
▪ The http status code that was returned by the server to the client
▪ The size in number of bytes of the document that was transferred from the server to the client
▪ URL of the site that the client was referred from
▪ Identifying information on the client browser or user agent

Methods

Methods of analysis include the following:
▪ Perform fuzzy searches for keywords and terms related to known attack vectors, using LIKE operators. These vectors can include injection attacks, directory and path traversal intrusions, cache poisoning, file inclusion or execution, and denial of service attacks.
▪ Produce time aggregations of Web log variables for trending, such as host activity, requests, status codes, file sizes, and agents.
▪ Sort, filter, and combine data to identify potential problem sources.
▪ Create analytical data sets suitable for further analysis with R and Mahout.

Additional Data and Software Needed to Run these Examples

We have included all of the data required to run this analysis in the Web site of supplementary materials for this book. The data consist of Apache, combined format server log files.
Files one through six are from Amazon’s sample collection. However, as there are no known or readily apparent security breaches within these files, an additional file has been added to this collection, containing examples of log entries that are typical of known security breach incidents. This additional file is called, “access_log_7.” Some of these incidents were found by searching the Internet for sources. Others were added and were derived from a small collection of examples from some actual Web forensic efforts (Talabis, 2013). For security and privacy reasons, no confidential or personal identifiers remain in these log entries.

SQL-like Analysis with Hive

Hive was used for most of the examples, due to the large number of analysts who are familiar with SQL-like syntax, and the flexibility of Hive’s built-in functions and operators. Also, since there are so many potential attack vectors, with new ones being created continually, security analyst needs tools to enable ad hoc, customized analysis. SQL-style tools such as Hive and HiveQL fill this need very nicely.

Loading the Data

The first few steps are basic and involve the following: starting hive, setting up the data, creating the main table, and loading it. It is the analysis part that will become a little more interesting. However, we must first deal with the data setup.
Place the log files in a folder that is shared with your virtual machine you are running. Or, you could also place them in a local directory in Amazon’s AWS environment. All of the following examples are demonstrated in a Cloudera virtual machine on my own computer. The files should be loaded into a folder called, “ApacheLogData.” We next navigate to the parent folder, which we called “Project1,” and start Hive from there by typing “hive” at our Bash shell command line.
[cloudera@localhost Project1]$ hive.
Logging initialized using configuration in jar:file:/usr/lib/hive/
lib/hive-common-0.10.0-cdh4.2.0.jar!/hive-log4j.properties
Hive history file=/tmp/cloudera/hive_job_log_cloudera_201305061902_843121553.txt
hive>
We need to reference the jar file needed for the deserializer, which is used for parsing the server logs. We do this by adding it on the hive command line, as follows:
hive> add jar /usr/lib/hive/lib/hive-contrib-0.10.0-cdh4.2.0.jar;
Added /usr/lib/hive/lib/hive-contrib-0.10.0-cdh4.2.0.jar to class path
Added resource: /usr/lib/hive/lib/hive-contrib-0.10.0-cdh4.2.0.jar
hive>
Next, we add a setting that enables column headers to be seen in the output. This provides a reference making the query results a little easier to read, especially for those who may not be very familiar with the Apache log-file format.
hive> set hive.cli.print.header=true;
In the next step, we create our base table structure and load it with data.
hive> CREATE TABLE apachelog (
 > host STRING,
 > identity STRING,
 > user STRING,
 > time STRING,
 > request STRING,
 > status STRING,
 > size STRING,
 > referer STRING,
 > agent STRING)
 > ROW FORMAT SERDE ‘org.apache.hadoop.hive.contrib.serde2.RegexSerDe’
 > WITH SERDEPROPERTIES (“input.regex” = “([^ ]∗) ([^ ]∗) ([^ ]∗) (-
 > |\[[^\]]∗\]) ([^ ”]∗|”[^”]∗”) (-|[0-9]∗) (-|[0-9]∗)(?: ([^
 > ”]∗|”[^”]∗”) ([^ ”]∗|”[^”]∗”))?”, “output.format.string” =
 > “%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s”)
 > STORED AS TEXTFILE;
OK
Time taken: 0.029 seconds
Now we have an empty table and are ready to load all seven of the log files. As previously mentioned, the seventh file is one that we created, containing examples of security breaches. The other six were extracted from Amazon’s examples.
hive> LOAD DATA LOCAL INPATH “ApacheLogData/access∗” INTO TABLE apachelog;
Copying data from file:/mnt/hgfs/BigDataAnalytics/Project1/ApacheLogData/access∗
Copying file: file:/mnt/hgfs/BigDataAnalytics/Project1/ApacheLogData/access_log_1
Copying file: file:/mnt/hgfs/BigDataAnalytics/Project1/ApacheLogData/access_log_2
Copying file: file:/mnt/hgfs/BigDataAnalytics/Project1/ApacheLogData/access_log_3
Copying file: file:/mnt/hgfs/BigDataAnalytics/Project1/ApacheLogData/access_log_4
Copying file: file:/mnt/hgfs/BigDataAnalytics/Project1/ApacheLogData/access_log_5
Copying file: file:/mnt/hgfs/BigDataAnalytics/Project1/ApacheLogData/access_log_6
Copying file: file:/mnt/hgfs/BigDataAnalytics/Project1/ApacheLogData/access_log_7
Loading data to table default.apachelog
Table default.apachelog stats: [num_partitions: 0, num_files: 7, num_rows: 0, total_size: 53239106, raw_data_size: 0]
OK
Time taken: 0.614 seconds
For anyone who may be interested in extracting the Amazon example log files directly, we have provided the following instructions. There is more than one way to do this, but here is the way that we did it. Within Amazon’s Elastic MapReduce environment, we made a new directory called, “temp.” We then copied all of the sample log files that are stored in Amazon’s examples S3 bucket to our new temp directory. The examples are kept in a bucket located at “3n://elasticmapreduce/samples/pig-apache/input/.”
hadoop@domU-12-31-39-00-88-72:∼$ hadoop dfs -mkdir temp
hadoop@domU-12-31-39-00-88-72:∼$ hadoop dfs -cp ‘s3n:// elasticmapreduce/samples/pig-apache/input/∗’ temp
13/04/22 19:17:43 INFO s3native.NativeS3FileSystem: Opening ‘s3n:// elasticmapreduce/samples/pig-apache/input/access_log_1’ for reading
13/04/22 19:17:46 INFO s3native.NativeS3FileSystem: Opening ‘s3n:// elasticmapreduce/samples/pig-apache/input/access_log_2’ for reading
13/04/22 19:17:48 INFO s3native.NativeS3FileSystem: Opening ‘s3n:// elasticmapreduce/samples/pig-apache/input/access_log_3’ for reading
13/04/22 19:17:49 INFO s3native.NativeS3FileSystem: Opening ‘s3n:// elasticmapreduce/samples/pig-apache/input/access_log_4’ for reading
13/04/22 19:17:50 INFO s3native.NativeS3FileSystem: Opening ‘s3n:// elasticmapreduce/samples/pig-apache/input/access_log_5’ for reading
13/04/22 19:17:52 INFO s3native.NativeS3FileSystem: Opening ‘s3n:// elasticmapreduce/samples/pig-apache/input/access_log_6’ for reading
We then moved the files from the temp directory to our own S3 bucket for easier access, in the event that we want to retrieve or modify them in the future. Our S3 bucket was named, “Project1E185.”
hadoop@domU-12-31-39-00-88-72:∼$ hadoop dfs -cp temp/∗ ‘s3n://Project1E185/’
13/04/22 19:19:36 INFO s3native.NativeS3FileSystem: Creating new file ‘s3n://Project1E185/access_log_1’ in S3
13/04/22 19:19:40 INFO s3native.Jets3tNativeFileSystemStore: s3.putObject Project1E185 access_log_1 8754118
13/04/22 19:19:40 INFO s3native.NativeS3FileSystem: Creating new file ‘s3n://Project1E185/access_log_2’ in S3
13/04/22 19:19:42 INFO s3native.Jets3tNativeFileSystemStore: s3.putObject Project1E185 access_log_2 8902171
13/04/22 19:19:42 INFO s3native.NativeS3FileSystem: Creating new
file ‘s3n://Project1E185/access_log_3’ in S3
13/04/22 19:19:44 INFO s3native.Jets3tNativeFileSystemStore: s3.putObject Project1E185 access_log_3 8896201
13/04/22 19:19:44 INFO s3native.NativeS3FileSystem: Creating new file ‘s3n://Project1E185/access_log_4’ in S3
13/04/22 19:19:46 INFO s3native.Jets3tNativeFileSystemStore: s3.putObject Project1E185 access_log_4 8886636
13/04/22 19:19:46 INFO s3native.NativeS3FileSystem: Creating new file ‘s3n://Project1E185/access_log_5’ in S3
13/04/22 19:19:48 INFO s3native.Jets3tNativeFileSystemStore: s3.putObject Project1E185 access_log_5 8902365
13/04/22 19:19:48 INFO s3native.NativeS3FileSystem: Creating new file ‘s3n://Project1E185/access_log_6’ in S3
13/04/22 19:19:50 INFO s3native.Jets3tNativeFileSystemStore: s3.putObject Project1E185 access_log_6 8892828
We then downloaded these files from our S3 bucket to the shared folder on our computer called, “ApacheLogData.” This is so that we can access them from our Cloudera Hadoop installation.

Discovery Process for Specific Attack Vectors

The most direct means of discovering attack attempts within server logs is to find patterns within the “request” field. The request field shows the URL information for the resource or Web page requested by the user of a client browser, or other agent. Many attacks leave revealing fingerprints or signatures behind within this field, through the use of a LIKE operator, using HQL. Or, if more fine-grained control is required for the search, either the REGEXP or RLIKE operator may be used with a regular expression.
You can do these searches with Perl, Java, or any other tool that can handle regular expressions. However, these tools do not scale up to handling large and numerous log files, as can the combined forces of the Hive, MapReduce, and Hadoop software stack. Following are some examples of direct searches and the attacks they are intended to find.

SQL Injection Attack

In an SQL injection attempt, the attacker tries to insert SQL code within a resource request. When this happens, multiple attempts may be made, with error messages occasionally providing clues to available fields within the database. For example, including a nonexistent variable within a “SELECT” statement on certain databases will produce an error stating that the variable does not exist, followed by a listing of available variables. Through trial and error, the attacker may be able to get into the database to retrieve valuable information or to do damage to the system.
The LIKE statement in Hive turns out to be the same syntax that you would be familiar with on most SQL-based, relational databases. In this case, we want to search the request URL string for terms that would be essential to any SQL query: select, from, where, case, if, having, and when. The code below also uses the Hive function, “LOWER(),” to ensure that the LIKE operator will recognize the term, whether or not it is capitalized (making it non-case sensitive). It is important to note that Hive’s LIKE operator departs from its counterpart as seen in most relational databases because it is case sensitive. Therefore, we use the LOWER() function to ensure that case sensitivity is not an issue for our query.
SELECT ∗ FROM apachelog
WHERE LOWER(request) LIKE ‘% like %’
OR LOWER(request) LIKE ‘%select %’
OR LOWER(request) LIKE ‘% from %’
OR LOWER(request) LIKE ‘% where %’
OR LOWER(request) LIKE ‘% if %’
OR LOWER(request) LIKE ‘% having %’
OR LOWER(request) LIKE ‘% case %’
OR LOWER(request) LIKE ‘% when %’;
It is important to note that these are just a few possibilities given for example sake. There are many other possibilities, and attack vectors are always changing. You can use an Internet search engine to search on keywords, such as sqli or “sql injection examples,” to search for updated attack information and adjust your queries accordingly. Also, you should investigate how information is stored on your server logs. Unlike the logs in this example, you may find that your log files do not show any spaces between the keywords. URLs cannot show spaces, so they may appear encoded in your log file as either %20 or with a plus sign, as in “+.” To catch these cases, you can simply repeat the above lines in the “WHERE” clause, but eliminate the spaces. For example, “% select %” becomes “%select%.”
Running this code segment produces the below output.
hive> SELECT ∗ FROM apachelog
 > WHERE LOWER(request) LIKE ‘% like %’
 > OR LOWER(request) LIKE ‘%select %’
 > OR LOWER(request) LIKE ‘% from %’
 > OR LOWER(request) LIKE ‘% where %’
 > OR LOWER(request) LIKE ‘% if %’
 > OR LOWER(request) LIKE ‘% having %’
 > OR LOWER(request) LIKE ‘% case %’
 > OR LOWER(request) LIKE ‘% when %’;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there’s no reduce operator

Starting Job = job_201305061901_0002, Tracking URL = 
http://localhost.localdomain:50030/jobdetails.jsp?jobid=job_201305061901_0002
Kill Command = /usr/lib/hadoop/bin/hadoop job-kill job_201305061901_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2013-05-06 20:16:31,416 Stage-1 map = 0%,reduce = 0%
2013-05-06 20:16:39,459 Stage-1 map = 100%,reduce = 0%, Cumulative CPU 4.82 sec
2013-05-06 20:16:40,471 Stage-1 map = 100%,reduce = 100%, Cumulative CPU 4.82 sec
MapReduce Total cumulative CPU time: 4 seconds 820 msec
Ended Job = job_201305061901_0002
MapReduce Jobs Launched:
Job 0: Map: 1Cumulative CPU: 4.82 sec HDFS Read: 53239663 HDFS Write: 218 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 820 msec
OK
host identity user time size referer status agent
216.185.64.79 - - [18/Sep/2009:00:00:55 -0800] “GET /SELECT ∗ FROM users WHERE username = ‘’ having 1=1-- HTTP/1.1” 200 3164 “-” “Mozilla/5.0 (compatible) Feedfetcher-Google; (+http://www.google.com/feedfetcher.html)”
Time taken: 11.572 seconds
hive>
We can see that there was one entry that had a commonly used method of SQL injection within the request field. The injection code appears simply as: ““GET /SELECT ∗ FROM users WHERE username = “ having 1=1--.” Generally, the first element in the string would be a reference to a Web page, but the principle is the same. We simplified this example for clarity in explaining to you what was happening.
In this case, the main advantage of using Hive and other similar searches is that our search found the single known incident within all of these many server logs in very little time. Otherwise, it could be a seemingly endless search, with a “needle in a haystack” effort.

Directory Traversal and File Inclusion

Attackers may also attempt to add additional elements at the end of a URL query line to traverse the server’s file system. Once the attackers locate the key folders and files, valuable information such as passwords may be retrieved, executable files could be added to the system, or the system could be vandalized.
We use the below-listed query searches within the request field for keywords related to directories at the root level of the file system.One part of the query also searches for the ubiquitous double dot characters (hidden folders), which are often used in these kinds of attacks. Although we mainly focus on terms and characters related to a Linux operating system, we also include some Windows-based terms and characters such as “c:,” “.exe,” and “.ini.”
SELECT ∗ FROM apachelog
WHERE LOWER(request) LIKE ‘%usr/%’
OR LOWER(request) LIKE ‘%∼/%’
OR LOWER(request) LIKE ‘%.exe%’
OR LOWER(request) LIKE ‘%.ini%’
OR LOWER(request) LIKE ‘%usr/%’
OR LOWER(request) LIKE ‘%etc/%’
--OR LOWER(request) LIKE ‘%home/%’
--OR LOWER(request) LIKE ‘%bin/%’
OR LOWER(request) LIKE ‘%dev/%’
OR LOWER(request) LIKE ‘%opt/%’
OR LOWER(request) LIKE ‘%root/%’
OR LOWER(request) LIKE ‘%sys/%’
OR LOWER(request) LIKE ‘%boot/%’
OR LOWER(request) LIKE ‘%mnt/%’
OR LOWER(request) LIKE ‘%proc/%’
OR LOWER(request) LIKE ‘%sbin/%’
OR LOWER(request) LIKE ‘%srv/%’
OR LOWER(request) LIKE ‘%var/%’
OR LOWER(request) LIKE ‘%c:\%’
OR LOWER(request) LIKE ‘%..%’;
The results from running the query are shown below.
hive> SELECT ∗ FROM apachelog
 > WHERE LOWER(request) LIKE ‘%usr/%’
 > OR LOWER(request) LIKE ‘%∼/%’
 > OR LOWER(request) LIKE ‘%.exe%’
 > OR LOWER(request) LIKE ‘%.ini%’
 > OR LOWER(request) LIKE ‘%usr/%’
 > OR LOWER(request) LIKE ‘%etc/%’
 > --OR LOWER(request) LIKE ‘%home/%’
 > --OR LOWER(request) LIKE ‘%bin/%’
 > OR LOWER(request) LIKE ‘%dev/%’
 > OR LOWER(request) LIKE ‘%opt/%’
 > OR LOWER(request) LIKE ‘%root/%’
 > OR LOWER(request) LIKE ‘%sys/%’
 > OR LOWER(request) LIKE ‘%boot/%’
 > OR LOWER(request) LIKE ‘%mnt/%’
 > OR LOWER(request) LIKE ‘%proc/%’
 > OR LOWER(request) LIKE ‘%sbin/%’
 > OR LOWER(request) LIKE ‘%srv/%’
 > OR LOWER(request) LIKE ‘%var/%’
 > OR LOWER(request) LIKE ‘%c:\%’
 > OR LOWER(request) LIKE ‘%..%’;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there’s no reduce operator
Starting Job = job_201305061901_0003, Tracking URL = http://localhost.localdomain:50030/jobdetails.jsp?jobid=job_201305061901_0003>
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_201305061901_0003
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2013-05-06 20:32:02,894 Stage-1 map = 0%, reduce = 0%
2013-05-06 20:32:10,931 Stage-1 map = 83%, reduce = 0%
2013-05-06 20:32:11,937 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 7.58 sec
2013-05-06 20:32:12,944 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 7.58 sec
2013-05-06 20:32:13,956 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 7.58 sec
MapReduce Total cumulative CPU time: 7 seconds 580 msec
Ended Job = job_201305061901_0003
MapReduce Jobs Launched:
Job 0: Map: 1 Cumulative CPU: 7.58 sec HDFS Read: 53239663 HDFS Write: 855 SUCCESS
Total MapReduce CPU Time Spent: 7 seconds 580 msec
OK
host identity user time size referer status agent
10.255.255.124 - - [25/Apr/2013:15:31:46 -0400] “GET /cgi-bin/ powerup/r.cgi?FILE=../../../../../../../../../../etc/passwd HTTP/1.1” 404 539 “-” “Mozilla/4.75 (Nikto/2.1.4) (Evasions:None) (Test:003175)”
10.255.255.124 - - [25/Apr/2013:15:31:46 -0400] “GET /cgi-bin/ r.cgi?FILE=../../../../../../../../../../etc/passwd HTTP/1.1” 404 531 “-” “Mozilla/4.75 (Nikto/2.1.4) (Evasions:None) (Test:003176)”
216.185.64.79 - - [18/Sep/2009:00:00:55 -0800] “GET /example.com/ doc/..%5c../Windows/System32/cmd.exe?/c+dir+c: HTTP/1.1” 200 3164 “-” “Mozilla/5.0 (compatible) Feedfetcher-Google; (+”>http://www. google.com/feedfetcher.html)”
216.185.64.79 - - [18/Sep/2009:00:00:55 -0800] “GET /example.com/ example.asp?display=../../../../../Windows/system.ini HTTP/1.1” 200 3164 “-” “Mozilla/5.0 (compatible) Feedfetcher-Google; (+”>http:// www.google.com/feedfetcher.html)”
Time taken: 13.626 seconds
hive>
We found several examples of this attack attempt that were found with this query. Note that the status showed errors for a couple of the attempts, as seen by the status codes 404 and 531. However, you can see that two of the attempts succeeded, as evidenced by the status code 200. Notice also, the keywords “etc,” “exe,” “ini,” and the use of the double dot navigation. Additionally, the file traversal to the “etc” directory appears to be related to an attempt to get at the “passwd” file—a commonly attempted attack vector.

Cross-site Request Forgery

The keywords found in this attack pertain to the browser’s JavaScript alert notice.
SELECT ∗ FROM apachelog
WHERE LOWER(request) LIKE ‘%>alert%’
OR LOWER(request) LIKE ‘%vulnerable%’;
hive> SELECT ∗ FROM apachelog
 > WHERE LOWER(request) LIKE ‘%>alert%’
 > OR LOWER(request) LIKE ‘%vulnerable%’;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there’s no reduce operator
Starting Job = job_201305071923_0007, Tracking URL = http:// localhost.localdomain:50030/jobdetails.jsp?jobid=job_201305071923_0007
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_201305071923_0007
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2013-05-07 22:37:45,751 Stage-1 map = 0%, reduce = 0%
2013-05-07 22:37:53,784 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 7.01 sec
2013-05-07 22:37:54,796 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 7.01 sec
MapReduce Total cumulative CPU time: 7 seconds 10 msec
Ended Job = job_201305071923_0007
MapReduce Jobs Launched:
Job 0: Map: 1 Cumulative CPU: 7.01 sec HDFS Read: 159723693 HDFS Write: 2428 SUCCESS
Total MapReduce CPU Time Spent: 7 seconds 10 msec
OK
host identity user time request status size referer agent
10.255.255.124 - - [25/Apr/2013:15:31:46 -0400] “GET /options. php?optpage=<script>alert(‘Vulnerable!’)</script> HTTP/1.1” 404 529 “-” “Mozilla/4.75 (Nikto/2.1.4) (Evasions:None) (Test:003171)”
10.255.255.124 - - [25/Apr/2013:15:31:46 -0400] “GET /search.php?ma ilbox=INBOX&what=x&where=<script>alert(‘Vulnerable!’)</script>& submit=Search HTTP/1.1” 404 528 “-” “Mozilla/4.75 (Nikto/2.1.4) (Evasions:None) (Test:003172)”
10.255.255.124 - - [25/Apr/2013:15:31:46 -0400] “GET /help. php?chapter=<script>alert(‘Vulnerable’)</script> HTTP/1.1” 404 526 “-” “Mozilla/4.75 (Nikto/2.1.4) (Evasions:None) (Test:003173)”
You can see there are several log entries having triggered the JavaScript alert notice.

Command Injection

This attack tries to disguise commands with HTML URL encoding. The following query includes keywords for some common examples.
SELECT ∗ FROM apachelog
WHERE LOWER(request) LIKE ‘%&comma%’
OR LOWER(request) LIKE ‘%20echo%’
OR LOWER(request) LIKE ‘%60id%’;
hive> SELECT ∗ FROM apachelog
 > WHERE LOWER(request) LIKE ‘%&comma%’
 > OR LOWER(request) LIKE ‘%20echo%’
 > OR LOWER(request) LIKE ‘%60id%’;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there’s no reduce operator
Starting Job = job_201305071923_0005, Tracking URL = http://localhost. localdomain:50030/jobdetails.jsp?jobid=job_201305071923_0005>
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_201305071923_0005
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2013-05-07 22:28:42,343 Stage-1 map = 0%, reduce = 0%
2013-05-07 22:28:51,378 Stage-1 map = 83%, reduce = 0%
2013-05-07 22:28:52,384 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 8.09 sec
2013-05-07 22:28:53,394 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 8.09 sec
MapReduce Total cumulative CPU time: 8 seconds 90 msec
Ended Job = job_201305071923_0005
MapReduce Jobs Launched:
Job 0: Map: 1 Cumulative CPU: 8.09 sec HDFS Read: 159723693 HDFS Write: 6080 SUCCESS
Total MapReduce CPU Time Spent: 8 seconds 90 msec
OK
host identity user time request status size referer agent
10.255.255.124 - - [25/Apr/2013:15:31:46 -0400] “GET / forumscalendar.php?calbirthdays=1&action=getday&day=2001-8-15&comma=%22; echo%20’’;%20echo%20%60id%20%60;die();echo%22 HTTP/1.1” 404 536 “-” “Mozilla/4.75 (Nikto/2.1.4) (Evasions:None) (Test:003039)”
10.255.255.124 - - [25/Apr/2013:15:31:46 -0400] “GET / forumzcalendar.php?calbirthdays=1&action=getday&day=2001-8-15&comma=%22; echo%20’’;%20echo%20%60id%20%60;die();echo%22 HTTP/1.1” 404 536 “-” “Mozilla/4.75 (Nikto/2.1.4) (Evasions:None) (Test:003040)”
10.255.255.124 - - [25/Apr/2013:15:31:46 -0400] “GET / htforumcalendar.php?calbirthdays=1&action=getday&day=2001-8-15&comma=%22; echo%20’’;%20echo%20%60id%20%60;die();echo%22 HTTP/1.1” 404 537 “-” “Mozilla/4.75 (Nikto/2.1.4) (Evasions:None) (Test:003041)”
10.255.255.124 - - [25/Apr/2013:15:31:46 -0400] “GET /vbcalendar. php?calbirthdays=1&action=getday&day=2001-8-15&comma=%22;echo%20
’’;%20echo%20%60id%20%60;die();echo%22 HTTP/1.1” 404 532 “-” 
“Mozilla/4.75 (Nikto/2.1.4) (Evasions:None) (Test:003042)”
10.255.255.124 - - [25/Apr/2013:15:31:46 -0400] “GET / vbulletincalendar.php?calbirthdays=1&action=getday&day=2001-8-15&comma=%22; echo%20’’;%20echo%20%60id%20%60;die();echo%22 HTTP/1.1” 404 539 “-” “Mozilla/4.75 (Nikto/2.1.4) (Evasions:None) (Test:003043)”
10.255.255.124 - - [25/Apr/2013:15:31:46 -0400] “GET /cgi-bin/ calendar.php?calbirthdays=1&action=getday&day=2001-8-15&comma=%22; echo%20’’;%20echo%20%60id%20%60;die();echo%22 HTTP/1.1” 404 538 “-” “Mozilla/4.75 (Nikto/2.1.4) (Evasions:None) (Test:003044)”
Time taken: 13.51 seconds

MySQL Charset Switch and MS-SQL DoS Attack

This attack involves altering the character set to evade the built-in validation functionality of databases and can be used in a denial of service attack (DoS). Keywords in this attack that are related to changing the character set of a table or column include “alter,” “character,” and “set.” Other keywords that are commonly found in DoS attacks include “waitfor,” “time,” and “goto.”
SELECT ∗ FROM apachelog
WHERE LOWER(request) LIKE ‘%alter%’
AND LOWER(request) LIKE ‘%character%’
AND LOWER(request) LIKE ‘%set%’;
SELECT ∗ FROM apachelog
WHERE LOWER(request) LIKE ‘%waitfor%’
AND LOWER(request) LIKE ‘%time%’;
SELECT ∗ FROM apachelog
WHERE LOWER(request) LIKE ‘%goto%’;
All three queries were run together at the same time from the Hive command line. The results are shown below. You will see a single log entry, where the attacker changed the character set to gbk_chinese_ci. We did not find any examples from which to test the other two queries, but they should function similarly.
hive> SELECT ∗ FROM apachelog
 > WHERE LOWER(request) LIKE ‘%alter%’
 > AND LOWER(request) LIKE ‘%character%’
 > AND LOWER(request) LIKE ‘%set%’;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there’s no reduce operator
Starting Job = job_201305061901_0005, Tracking URL = http://localhost. localdomain:50030/jobdetails.jsp?jobid=job_201305061901_0005
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_201305061901_0005
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2013-05-06 21:07:52,856 Stage-1 map = 0%, reduce = 0%
2013-05-06 21:07:57,880 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.82 sec
2013-05-06 21:07:58,886 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.82 sec
2013-05-06 21:07:59,897 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.82 sec
MapReduce Total cumulative CPU time: 2 seconds 820 msec
Ended Job = job_201305061901_0005
MapReduce Jobs Launched:
Job 0: Map: 1 Cumulative CPU: 2.82 sec HDFS Read: 53239663 HDFS Write: 277 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 820 msec OK
216.185.64.79 - - [18/Sep/2009:00:00:55 -0800] “GET /ALTER TABLE ‘users’ CHANGE ‘password’ ‘password’ VARCHAR(255) CHARACTER SET gbk COLLATE gbk_chinese_ci NOT NULL HTTP/1.1” 200 3164 “-” “Mozilla/5.0 (compatible) Feedfetcher-Google; (+http://www.google.com/ feedfetcher.html)”
Time taken: 9.927 seconds
hive>
 > SELECT ∗ FROM apachelog
 > WHERE LOWER(request) LIKE ‘%waitfor%’
 > AND LOWER(request) LIKE ‘%time%’;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there’s no reduce operator
Starting Job = job_201305061901_0006, Tracking URL = http://localhost. localdomain:50030/jobdetails.jsp?jobid=job_201305061901_0006
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_201305061901_0006
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2013-05-06 21:08:02,294 Stage-1 map = 0%, reduce = 0%
2013-05-06 21:08:08,318 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.75 sec
2013-05-06 21:08:09,328 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.75 sec
MapReduce Total cumulative CPU time: 2 seconds 750 msec
Ended Job = job_201305061901_0006
MapReduce Jobs Launched:
Job 0: Map: 1 Cumulative CPU: 2.75 sec HDFS Read: 53239663 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 750 msec
OK
Time taken: 9.437 seconds
hive>
 > SELECT ∗ FROM apachelog
 > WHERE LOWER(request) LIKE ‘%goto%’;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there’s no reduce operator
Starting Job = job_201305061901_0007, Tracking URL = http://localhost. localdomain:50030/jobdetails.jsp?jobid=job_201305061901_0007
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_201305061901_0007
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2013-05-06 21:08:23,417 Stage-1 map = 0%, reduce = 0%
2013-05-06 21:08:28,438 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.72 sec
2013-05-06 21:08:29,450 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.72 sec
MapReduce Total cumulative CPU time: 2 seconds 720 msec
Ended Job = job_201305061901_0007
MapReduce Jobs Launched:
Job 0: Map: 1 Cumulative CPU: 2.72 sec HDFS Read: 53239663 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 720 msec
OK
Time taken: 8.413 seconds

Tallying and Tracking Failed Request Statuses

As was seen in one of the query results above, attack attempts can result in status codes indicating both success and failure. However, often an attacker may experience a number of failures before hitting upon the right combination that results in a success.
We can sort hosts sending requests to determine which IP addresses produce the most failures. While a high number of failures are not a definitive indication that an attack has taken place, it could serve as a starting point for further investigation. Aside from security concerns, it could also help to identify where users are having difficulty with the system, as well as those IP addresses that may be putting the system under the greatest stress.

Hosts with the most Failed Requests

The status indicators in server logs are shown as three-digit codes. Codes in the 100, 200, or 300 series range indicate successful client requests and server responses. However, while codes in the 400 range indicate failed requests, the 500 range indicates failed responses due to problems on the server side. A series of failed requests could be an indication of an attacker using trial-and-error tactics until an attack succeeds. Failed server responses could be an indication that an attack has succeeded in doing server-side damage.
The following query groups failed requests by host, using Hive’s case statements and the substring function, SUBSTR(). The function is written as “substr(status,1,1).” The function queries the status field for the first character, and also ends with the first character. In other words, the first digit indicates at which character the query should begin, and the second digit indicates how many characters to include, from left to right. Cases where the first digit in the status code is either a “4” or a “5” are coded as a “1” in a newly added “failedaccess” column, while all other status codes are coded as a “0.” As you will see, the resulting new view enables easy tallying of all failed requests as 1’s and all successful requests as 0’s.
You will see how a view is created in Hive in the same manner as a table. However, unlike a table, the view does not store any data within the Hadoop File System (HDFS), but only stores the query code. A view is used in this case as a step among a series of queries to generate the final results. One query builds upon the others. Also, you will see that Hive has a command to set the output to CSV when storing the results externally: set hive.io.output.fileformat = CSVTextFile. This command is used in the second query.
CREATE VIEW IF NOT EXISTS statusgroupings AS
SELECT host, identity, user, time, request, status, size, referer, agent, CASE substr(status,1,1)
WHEN ‘1’ THEN ‘0’
WHEN ‘2’ THEN ‘0’
WHEN ‘3’ THEN ‘0’
WHEN ‘4’ THEN ‘1’
WHEN ‘5’ THEN ‘1’
ELSE ‘0’
END
AS failedaccess
FROM apachelog;
--store the results of the preceding view
set hive.io.output.fileformat = CSVTextFile;
INSERT OVERWRITE LOCAL DIRECTORY ‘/mnt/hgfs/BigDataAnalytics/Project1/ ApacheLog’ SELECT ∗ FROM statusgroupings;
--sum total failed access attempts by host
CREATE VIEW IF NOT EXISTS FailedHostAttempts AS
SELECT host, SUM(failedaccess) AS failedAttempts
FROM statusgroupings
GROUP BY host
ORDER BY failedAttempts DESC;
--count total host attempts to access
CREATE VIEW IF NOT EXISTS TotalHostAttempts AS
SELECT host, count(host) AS hostAccessAttempts
FROM statusgroupings
GROUP BY host
ORDER BY hostAccessAttempts DESC;
From these views we can calculate the percentage of failed requests by host, as shown in the query below.
--top 20 proportions of failed attempts
SELECT a.host, failedAttempts, hostAccessAttempts, failedAttempts / hostAccessAttempts AS percentFailed
FROM TotalHostAttempts a
JOIN FailedHostAttempts b
ON a.host = b.host
WHERE failedAttempts / hostAccessAttempts > 0
ORDER BY percentFailed DESC
LIMIT 20;
Since the chain of views above produces a large amount of command line, diagnostic related output from MapReduce, only the actual view results that are generated from the above query are shown below. As one might imagine, the highest percentages of failed requests tend to occur with hosts having made only a few requests or even one request. In most cases, these entries are not likely to be an issue. However, if we were to see a very large number of these requests, it might be an indication of a distributed DoS attack, where a bot is programmed to send access requests from many different IP addresses. As you can see, no such pattern is evident here in Amazon’s data, and no such patterns were loaded into the sample attack log, access_log_7.
However, the second listed host, IP 10.255.255.124, does appear a bit suspicious with 12 access attempts and all 12 failing. This could warrant further investigation with an ad hoc query to drill into the requests made by this host.
Total MapReduce CPU Time Spent: 17 seconds 260 msec
OK
host failedattempts hostaccessattempts percentfailed
189.106.160.140 1.0 1 1.0
10.255.255.124 12.0 12 1.0
79.11.25.143 1.0 1 1.0
99.157.209.86 1.0 1 1.0
24.21.134.171 3.0 4 0.75
81.193.181.206 2.0 3 0.6666666666666666
87.238.130.200 4.0 7 0.5714285714285714
165.166.103.33 1.0 2 0.5
124.177.134.93 2.0 4 0.5
88.38.60.152 2.0 4 0.5
84.203.44.110 2.0 4 0.5
115.74.146.80 1.0 2 0.5
71.49.39.235 1.0 2 0.5
68.96.161.201 1.0 2 0.5
62.150.156.229 1.0 2 0.5
58.69.168.155 1.0 2 0.5
58.68.8.190 2.0 4 0.5
41.100.138.220 1.0 2 0.5
4.131.17.243 1.0 2 0.5
201.43.250.154 1.0 2 0.5
Time taken: 64.786 seconds
The following query modifies the query above so that only the hosts having at least 20 log entries are shown. The top listed IP, 221.221.9.60, has had 28% of its requests fail. This could also warrant further investigation. If upon further querying we were to find many failed access attempts, followed by a string of successes, this could be an indication of an attack.
--top 20 proportions of failed attempts where more than 20 attempts total
SELECT a.host, failedAttempts, hostAccessAttempts, failedAttempts / hostAccessAttempts AS percentFailed
FROM TotalHostAttempts a
JOIN FailedHostAttempts b
ON a.host = b.host
WHERE failedAttempts / hostAccessAttempts > 0
AND hostAccessAttempts > 20
ORDER BY percentFailed DESC
LIMIT 20;
Total MapReduce CPU Time Spent: 16 seconds 880 msec
OK
host failedattempts hostaccessattempts percentfailed
221.221.9.60 14.0 50 0.28
41.202.75.35 5.0 26 0.19230769230769232
41.178.112.197 17.0 89 0.19101123595505617
121.205.226.76 63.0 334 0.18862275449101795
114.200.199.144 5.0 27 0.18518518518518517
78.178.233.22 5.0 27 0.18518518518518517
216.24.131.152 12.0 66 0.18181818181818182
68.42.128.66 4.0 22 0.18181818181818182
201.2.78.10 4.0 24 0.16666666666666666
194.141.2.1 4.0 24 0.16666666666666666
194.110.194.1 5.0 30 0.16666666666666666
69.122.96.121 4.0 24 0.16666666666666666
68.117.200.60 4.0 24 0.16666666666666666
76.122.81.132 4.0 25 0.16
79.66.21.60 6.0 38 0.15789473684210525
24.183.197.231 3.0 21 0.14285714285714285
68.193.125.219 4.0 28 0.14285714285714285
149.6.164.150 6.0 43 0.13953488372093023
99.141.88.139 3.0 22 0.13636363636363635
15.203.233.77 3.0 22 0.13636363636363635
Time taken: 65.271 seconds
For those who may be curious, we also ran a query on IP 221.221.9.60, ordering it by month and day. Time aggregations will be discussed later in this chapter. However, all of the failed requests came from a “404 file not found” error, which occurred repeatedly, while trying to access “favicon.ico.” Modern browsers will automatically send requests to seek an icon that would appear to the left of the URL, which is known as a favicon. If the browser cannot find a favicon, an error message could be generated and is likely the cause of these “404 file not found” errors. The drill-down query was in the form, “select ∗ from by_month where host in(‘221.221.9.60’) order by monthday;”. The by_month view and the monthday field will be discussed in more detail later in the chapter.

Bot Activity

A great deal of bot activity can be found within server logs, most of it legitimate. Bots such as those from search engines crawl the Internet to build indexes, such as those deployed by Google and Microsoft. The bots advertise themselves as bots in the agent log field. While an illegitimate bot may be a bit more clandestine, masquerading as a legitimate bot could also be an effective attack strategy.
The next query we will introduce searches for the key term “bot” in the agent field. It returns a list of bot agents, along with a count of the number of log entries for each agent. Even if the bots are legitimate, it is interesting to see the volume of bot activity on these servers. Some of the agents are easily recognizable, while others may not be as recognizable. As such, some may warrant further research as to their origins.
--bot activity
SELECT agent, count(agent) AS hits
FROM apachelog
WHERE agent LIKE ‘%bot%’
GROUP BY agent
ORDER BY hits DESC;
Total MapReduce CPU Time Spent: 6 seconds 920 msec
OK
agent hits
“Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)” 45528
“msnbot/2.0b (+http://search.msn.com/msnbot.htm)” 9248 “msnbot/1.1 (+http://search.msn.com/msnbot.htm)” 8400.
“Mozilla/5.0 (Twiceler-0.9 http://www.cuil.com/twiceler/robot.html)” 4571
“Googlebot-Image/1.0” 4543
“Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1; aggregator: Spinn3r (Spinn3r 3.1);http://spinn3r.com/robot) Gecko/20021130” 2431
“Mozilla/5.0 (Windows; U; Windows NT 5.1; fr; rv:1.8.1) VoilaBot BETA 1.2 ([email protected])” 382
“OOZBOT/0.20 (http://www.setooz.com/oozbot.html; agentname at setooz dot_com )” 169“Mozilla/5.0 (compatible; Tagoobot/3.0; +http://www.tagoo.ru)” 102
“Mozilla/5.0 (compatible; discobot/1.1; +http://discoveryengine.com/discobot.html)” 56
“Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;
http://www.changedetection.com/bot.html)” 49
“SAMSUNG-SGH-E250/1.0 Profile/MIDP-2.0 Configuration/CLDC-1.1 UP.Browser/6.2.3.3.c.1.101 (GUI) MMP/2.0 (compatible; Googlebot-Mobile/ 2.1; +http://www.google.com/bot.html)” 32
“Mozilla/5.0 (compatible; Exabot/3.0; +http://www.exabot.com/go/robot)” 29
“Gaisbot/3.0+([email protected];+http://gais.cs.ccu.edu.tw/robot.php)” 27
“FollowSite Bot (http://www.followsite.com/bot.html)” 18
“Gigabot/3.0 (http://www.gigablast.com/spider.html)” 13
“MLBot (www.metadatalabs.com/mlbot)” 10
“Mozilla/5.0 (compatible; MJ12bot/v1.2.5; http://www.majestic12.co.uk/bot.php?+)” 10
“Yeti/1.0 (NHN Corp.; http://help.naver.com/robots/)” 9
“psbot/0.1 (+http://www.picsearch.com/bot.html)” 8
“Mozilla/5.0 (compatible; seexie.com_bot/4.1; +http://www.seexie.com)” 6
“Mozilla/5.0 (compatible; Exabot-Images/3.0; +http://www.exabot.com/go/robot)” 5
“Mozilla/5.0 (compatible; BuzzRankingBot/1.0; +http://www.buzzrankingbot.com/)” 4
“DoCoMo/2.0 N905i(c100;TB;W24H16) (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)” 4
Time taken: 23.632 seconds
We can modify this query to only show instances where bot agent requests resulted in either a failed client request or a failed server response. In this case, the agents having the most failed requests are prominent, well-known bots having high volumes, such as Google or MSN. Some of these requests apparently occurred at a time when there was an internal server error, as given by the status code “505.” You will see, though, that this query only extracts the maximum code as a sample. The query could be easily modified to search for specified error types, such as “404,” which indicates that the file was not found. We did not have known bot threats loaded into the sample logs. However, bot activity may be worth investigating, as there have been known cases where, for example, attackers have tried to exploit a Wordpress plugin by uploading a shell. While trying to find where the shell is located, the attacker could generate numerous “404 file not found” errors. This query could be useful in beginning a search for such potential threats, should they occur. Also, it could be useful to monitor the volume of bot activity and any related issues they may cause for the server, even if the activity is not intended to be malicious.
SELECT agent, count(agent) AS hits, MAX(status) AS sampleStatus
FROM apachelog
WHERE agent LIKE ‘%bot%’
AND substr(status,1,1) IN (‘4’,’5’)
GROUP BY agent
ORDER BY hits DESC
LIMIT 20;
Total MapReduce CPU Time Spent: 7 seconds 190 msec
OK
agenthitssamplestatus
“Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”272500
“Mozilla/5.0 (Twiceler-0.9 http://www.cuil.com/twiceler/robot.html)”225500
“msnbot/2.0b (+http://search.msn.com/msnbot.htm)”88500
“msnbot/1.1 (+http://search.msn.com/msnbot.htm)”82500
“Googlebot-Image/1.0”74500
Time taken: 31.764 seconds

Time Aggregations

The queries in this section parse the time field in the server log, so as to be able to evaluate server activity over time. The time field appears in the log file as “[20/Jul/2009:20:12:22 -0700].” In this case, the first digit, “20,” is the day, which is followed by the month, “Jul,” and the year, “2009.” Three sets of numbers follow the year, which are separated by a colon. These represent the hour, minute, and second, respectively. The final four digits following the dash represent the time zone.
We use the query below, SUBSTR() function, to parse the day, month, and year. We then convert the month to a two-digit format and reassemble the order, so the year precedes month and month precedes day. This reassembly is accomplished with the CONCAT() function, which concatenates strings together into the desired order. This new ordering allows for easier sorting of the date. As you will see, the result of the query allows for subsequent queries to easily aggregate results by year, month, day, or any combination of these.
CREATE VIEW IF NOT EXISTS by_month AS
SELECT host, identity, user, time, CASE substr(time,5,3)
WHEN ‘Jan’ THEN ‘01’
WHEN ‘Feb’ THEN ‘02’
WHEN ‘Mar’ THEN ‘03’
WHEN ‘Apr’ THEN ‘04’
WHEN ‘May’ THEN ‘05’
WHEN ‘Jun’ THEN ‘06’
WHEN ‘Jul’ THEN ‘07’
WHEN ‘Aug’ THEN ‘08’
WHEN ‘Sep’ THEN ‘09’
WHEN ‘Oct’ THEN ‘10’
WHEN ‘Nov’ THEN ‘11’
WHEN ‘Dec’ THEN ‘12’
ELSE ‘00’
END
AS month, substr(time,9,4) AS year, concat(substr(time,9,4), CASE substr(time,5,3)
WHEN ‘Jan’ THEN ‘01’
WHEN ‘Feb’ THEN ‘02’
WHEN ‘Mar’ THEN ‘03’
WHEN ‘Apr’ THEN ‘04’
WHEN ‘May’ THEN ‘05’
WHEN ‘Jun’ THEN ‘06’
WHEN ‘Jul’ THEN ‘07’
WHEN ‘Aug’ THEN ‘08’
WHEN ‘Sep’ THEN ‘09’
WHEN ‘Oct’ THEN ‘10’
WHEN ‘Nov’ THEN ‘11’
WHEN ‘Dec’ THEN ‘12’
ELSE ‘00’
END) AS yearmonth, concat(CASE substr(time,5,3)
WHEN ‘Jan’ THEN ‘01’
WHEN ‘Feb’ THEN ‘02’
WHEN ‘Mar’ THEN ‘03’
WHEN ‘Apr’ THEN ‘04’
WHEN ‘May’ THEN ‘05’
WHEN ‘Jun’ THEN ‘06’
WHEN ‘Jul’ THEN ‘07’
WHEN ‘Aug’ THEN ‘08’
WHEN ‘Sep’ THEN ‘09’
WHEN ‘Oct’ THEN ‘10’
WHEN ‘Nov’ THEN ‘11’
WHEN ‘Dec’ THEN ‘12’
ELSE ‘00’
END,substr(time,2,2)) AS monthday, request, status, size, referer, agent
FROM apachelog;
We used the “by_month” query view to parse the time column to produce several new columns, as shown by running the select query below. There is now a column for month, year, monthday, and yearmonth.
hive> SELECT time, month, year, monthday, yearmonth FROM by_month LIMIT 1;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there’s no reduce operator
Starting Job = job_201305061901_0045, Tracking URL = http://localhost. localdomain:50030/jobdetails.jsp?jobid=job_201305061901_0045>
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_201305061901_0045
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2013-05-07 13:17:27,444 Stage-1 map = 0%, reduce = 0%
2013-05-07 13:17:31,458 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.36 sec
2013-05-07 13:17:32,464 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.36 sec
2013-05-07 13:17:33,471 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 1.36 sec
MapReduce Total cumulative CPU time: 1 seconds 360 msec
Ended Job = job_201305061901_0045
MapReduce Jobs Launched:
Job 0: Map: 1 Cumulative CPU: 1.36 sec HDFS Read: 66093 HDFS Write: 490 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 360 msec
OK
time month year monthday yearmonth
[20/Jul/2009:20:12:22 -0700] 07 2009 0720 200907
Time taken: 9.398 seconds
We can now use this view to produce a variety of time aggregations. We would like to share a few ideas to serve as possible starting points for follow-up queries. As with any of the queries suggested in this chapter, the results can be written to a local directory for further analysis, if desired. To do so, you simply precede the selected statement with the below-listed query. Of course, the local directory specified should be changed to match the directory on your system where the results will be stored.
INSERT OVERWRITE LOCAL DIRECTORY ‘/mnt/hgfs/BigDataAnalytics/Project1/
TopHostFailedLogongsByDay’
For the purpose of our demonstration, these queries simply write results to the screen, and the result sets are limited to only 10 rows.

Hosts with the most Failed Requests per day, or per month

Next we will use a query to show the number of failed query attempts, grouped by day and host. The below-listed query shows the hosts that had the highest numbers of failed requests in a day, along with the day and month in which they occurred. If there were a very high number of requests for a given host IP on any given day, it could be an indication of an attack attempt, such as a DoS attack. In this case, since the log samples were not loaded with any known DoS attack examples, we see that the IP hosts with the highest counts of failed requests per day are also the same hosts that tend to produce a great many requests overall.
--Show hosts sorted by frequency of failed calls to server by day
SELECT monthday, host, COUNT(host) AS host_freq
FROM by_month
WHERE substr(status,1,1) IN(‘4’,’5’)
GROUP BY monthday, host
ORDER BY host_freq DESC
LIMIT 10;
You can see the result of this query, which is shown below. (Spacing was added to the columns to make them easier to read.)
monthdayhosthost_freq
0724 121.205.226.76 57
0807 66.249.67.3 31
0727 66.249.67.3 27
0723 38.99.44.101 27
0724 72.30.142.87 25
0910 66.249.67.3 24
0724 66.249.67.3 24
0927 66.249.67.3 22
0926 66.249.67.3 20
0723 .30.142.87 20
Time taken: 21.697 seconds
Below is a similar query, except that it aggregates results by month, rather than day.
--Show hosts sorted by frequency of failed calls to server by month
SELECT yearmonth, host, COUNT(host) AS host_freq
FROM by_month
WHERE substr(status,1,1) IN(‘4’,’5’)
GROUP BY yearmonth, host
ORDER BY host_freq DESC
LIMIT 10;
yearmonthhosthost_freq
200908 66.249.67.3 437
200909 66.249.67.3 433
200909 74.125.74.193 184
200908 64.233.172.17 180
200907 66.249.67.3 178
200909 74.125.16.65 172
200908 72.14.192.65 171
200908 74.125.74.193 169
200908 66.249.67.87 169
200908 74.125.16.65 169
Time taken: 21.694 seconds

Failed Requests Presented as a Monthly Time Series

Next we will examine a query view aggregating the number of failed requests, regardless of the host source, as a time series by month and year. This query is stored as a view for our later use, to calculate a ratio of failed to successful requests.
--Unsuccessful server calls as a time series by year and month
Create VIEW FailedRequestsTimeSeriesByMonth AS
SELECT yearmonth, COUNT(yearmonth) AS failedrequest_freq
FROM by_month
WHERE substr(status,1,1) IN(‘4’,’5’)
GROUP BY yearmonth
ORDER BY yearmonth ASC;
SELECT ∗ FROM FailedRequestsTimeSeriesByMonth;
We show the result as shown. As you can see, there is a spike in the number of failed requests in August and September of 2009.
yearmonthfailedrequest_freq
200907 1861
200908 2848
200909 2706
200910 55
Time taken: 23.682 seconds
It is possible that the increase in failed requests in August and September was due to an increase in Web traffic, in general. To determine this, we can run the same query without the filter or we can also filter on just the successful requests. We use the below-listed query to tally the successful requests by month.
--Successful server calls as a time series by year and month
Create VIEW SuccessfulRequestsTimeSeriesByMonth AS
SELECT yearmonth, COUNT(yearmonth) AS successfulrequest_freq
FROM by_month
WHERE substr(status,1,1) IN(‘1’,’2’,’3’)
GROUP BY yearmonth
ORDER BY yearmonth ASC;
SELECT ∗ FROM SuccessfulRequestsTimeSeriesByMonth;
Based on the results below, we can see that, indeed, the volume of successful requests went up substantially in August and September. So, it is likely that the number of failed requests for those months may have been in-line with the overall request volume. We will check our assumption next.
yearmonthfailedrequest_freq
200907 57972
200908 88821
200909 83185
200910 1902
Time taken: 21.619 seconds

Ratio of Failed to Successful Requests as a Time Series

To see if the volume of failed requests is, indeed, proportional to what would be expected, given the number of successful requests, we use the query below to calculate a ratio of failed requests to successful requests. We combine the two views above using a JOIN statement.
SELECT a.yearmonth, failedrequest_freq / successfulrequest_freq AS failratio
FROM FailedRequestsTimeSeries a
JOIN SuccessfulRequestsTimeSeries b
ON a.yearmonth = b.yearmonth
ORDER BY yearmonth ASC;
yearmonthfailratio
200907 0.03210170427102739
200908 0.03206448925366749
200909 0.032529903227745384
200910 0.028916929547844375
Time taken: 66.867 seconds
We note that the ratio has held rather constant from month to month at around 0.03. This suggests that the number of failed requests each month is about what we would expect, given the volume of successful requests.
Of course, we could also run a query by day, as well, and this could reveal activity that could be obscured in a monthly aggregation. We use the query below, which combines the year, month, and day into a single string that can be sorted, in the form “20090720,” where the first four digits represents the year, the next two are the month, and the last two digits are the day.
--enable the creation of a time series by day over multiple years and months
CREATE VIEW by_day AS
SELECT host, identity, user, time, concat(year, monthday) AS yearmonthday, request, status, size, referer, agent
FROM by_month;
SELECT ∗ FROM by_day LIMIT 10;
The results of our query are shown below, giving the top 10 rows produced by the view. We can see that we now have a “yearmonthday” field in the desired format to enable aggregating and sorting by day, across multiple years and months.
host identity user time yearmonthday request status size referer agent
66.249.67.3 - - [20/Jul/2009:20:12:22 -0700] 20090720 “GET / gallery/main.php?g2_controller=exif.SwitchDetailMode&g2_mode= detailed&g2_return=%2Fgallery%2Fmain.php%3Fg2_itemId%3D15741&g2_returnName =photo HTTP/1.1”302 5 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +>http://www.google.com/bot.html)
66.249.67.3 - - [20/Jul/2009:20:12:25 -0700] 20090720 “GET /gallery/ main.php?g2_itemId=15741&g2_fromNavId=x8fa12efc HTTP/1.1” 200 8068 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +>http://www.google.com/bot.html)
64.233.172.17 - - [20/Jul/2009:20:12:26 -0700] 20090720 “GET /gwidgets/ alexa.xml HTTP/1.1” 200 2969 “-” “Mozilla/5.0 (compatible) Feedfetcher-Google; (+>http://www.google.com/feedfetcher.html)
74.125.74.193 - - [20/Jul/2009:20:13:01 -0700] 20090720 “GET /gwidgets/ alexa.xml HTTP/1.1” 200 2969 “-” “Mozilla/5.0 (compatible) Feedfetcher-Google; (+>http://www.google.com/feedfetcher.html)
192.168.1.198 - - [20/Jul/2009:20:13:18 -0700] 20090720 “GET / HTTP/1.1” 200 17935 “-” “Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_7; en-us) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Safari/530.17”
192.168.1.198 - - [20/Jul/2009:20:13:18 -0700] 20090720 “GET / style.css HTTP/1.1” 200 1504 “http://example.org/” “Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_7; en-us) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Safari/530.17”
192.168.1.198 - - [20/Jul/2009:20:13:19 -0700] 20090720 “GET / favicon.ico HTTP/1.1” 404 146 “http://example.org/” “Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_7; en-us) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Safari/530.17”
66.249.67.3 - - [20/Jul/2009:20:13:21 -0700] 20090720 “GET / gallery/main.php?g2_controller=exif.SwitchDetailMode&g2_mode= detailed&g2_return=%2Fgallery%2Fmain.php%3Fg2_itemId%3D30893&g2_ returnName=photo HTTP/1.1”302 5 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +>http://www.google.com/bot.html)
66.249.67.3 - - [20/Jul/2009:20:13:24 -0700] 20090720 “GET /gallery/ main.php?g2_itemId=30893&g2_fromNavId=xfc647d65 HTTP/1.1” 200 8196 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +>http://www.google.com/bot.html)
66.249.67.3 - - [20/Jul/2009:20:13:29 -0700] 20090720 “GET / gallery/main.php?g2_view=search.SearchScan&g2_form%5BuseDefaultSettings%5D =1&g2_return=%2Fgallery%2Fmain.php%3Fg2_itemId%3D15789& g2_returnName=photo HTTP/1.1” 200 6360 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +>http://www.google.com/ bot.html)
Time taken: 9.102 seconds
As you will see, combined with the two additional views shown below, these results can be used to produce a ratio of failed to successful requests by day, across the period of months and years within the server logs.
--Unsuccessful server calls as a time series by year, month, and day
Create VIEW FailedRequestsTimeSeriesByDay AS
SELECT yearmonthday, COUNT(yearmonthday) AS failedrequest_freq
FROM by_day
WHERE substr(status,1,1) IN(‘4’,’5’)
GROUP BY yearmonthday
ORDER BY yearmonthday ASC;
--Successful server calls as a time series by year, month, and day
Create VIEW SuccessfulRequestsTimeSeriesByDay AS
SELECT yearmonthday, COUNT(yearmonthday) AS successfulrequest_freq
FROM by_day
WHERE substr(status,1,1) IN(‘1’,’2’,’3’)
GROUP BY yearmonthday
ORDER BY yearmonthday ASC;
Finally, you will see these two views are joined in the query below, to produce the failed-to-successful request ratio by day.
--Calculate ratio of failed to successful requests by year, month, and day
SELECT a.yearmonthday, a.failedrequest_freq / b.successfulrequest_freq AS failratio
FROM FailedRequestsTimeSeriesByDay a
JOIN SuccessfulRequestsTimeSeriesByDay b
ON a.yearmonthday = b.yearmonthday
ORDER BY yearmonthday ASC;
We display the results for our request below.
yearmonthday failratio
20090720 0.023759608665269043
20090721 0.024037482175595846
20090722 0.029298848252172157
20090723 0.032535684298908484
20090724 0.04544235924932976
20090725 0.030345800988002825
20090726 0.031446540880503145
20090727 0.03494060097833683
20090728 0.031545741324921134
20090729 0.03138373751783167
20090730 0.03590285110876452
20090731 0.034519956850053934
20090801 0.024278676988036593
20090802 0.0297029702970297
20090803 0.0314026517794836
20090804 0.030692362598144184
20090805 0.039501779359430604
20090806 0.030526315789473683
20090807 0.03762418093426337
20090808 0.029632274187790075
20090809 0.029971791255289138
20090810 0.03563941299790356
20090811 0.036671368124118475
20090812 0.03349788434414669
20090813 0.03076923076923077
20090814 0.031578947368421054
20090815 0.03191862504384427
20090816 0.03364550815123136
20090817 0.029210241615578794
20090818 0.030576789437109102
20090819 0.033402922755741124
20090820 0.034220532319391636
20090821 0.032474408753971055
20090822 0.03897944720056697
20090823 0.029098651525904896
20090824 0.028070175438596492
20090825 0.02638058389025677
20090826 0.029650547123190964
20090827 0.029627047751829907
20090828 0.039628704034273474
20090829 0.035426166257453526
20090830 0.02492102492102492
20090831 0.032418952618453865
20090901 0.02949438202247191
20090902 0.032688927943760986
20090903 0.028690662493479395
20090904 0.029954719609892023
20090905 0.02907180385288967
20090906 0.031042901988140914
20090907 0.03449477351916376
20090908 0.035181236673773986
20090909 0.037141846480367884
20090910 0.03450679679330777
20090911 0.03566433566433566
20090912 0.031282952548330405
20090913 0.030218825981243487
20090914 0.03377437325905292
20090915 0.025804171085189113
20090916 0.030892051371051717
20090917 0.030978934324659233
20090918 0.028441011235955056
20090919 0.02912280701754386
20090920 0.027392510402219142
20090921 0.03273381294964029
20090922 0.031751570132588974
20090923 0.03167898627243928
20090924 0.03349964362081254
20090925 0.0420377627360171
20090926 0.03863716192483316
20090927 0.0328042328042328
20090928 0.040757954951734
20090929 0.030975008799718408
20090930 0.03368794326241135
20091001 0.028916929547844375
Time taken: 68.21 seconds
We can also export and analyze the report in a program, such as R, Excel, or SAS. Aggregating data into a time series is one way to turn large data into small data that can be analyzed with more conventional tools. Rerunning the last query with the INSERT OVERWRITE LOCAL DIRECTORY command exports the results for further analysis. We display the Hive input next. Most of the output was omitted, due to the size of the data. Prior to using this command, you should be aware that it will overwrite all files that already exist within your target folder. A good way to avoid overwriting your folder is to provide a new folder name at the end of the location address string—“FailedRequestsByDay.” Also, one trade-off worth noting with views is that when calling a view in another query, the view query is rerun. This lengthens run times and increases the verbosity of the command line output.
hive> INSERT OVERWRITE LOCAL DIRECTORY ‘/mnt/hgfs/BigDataAnalytics/Project1/
FailedRequestsByDay’
 > SELECT a.yearmonthday
 > ,a.failedrequest_freq / b.successfulrequest_freq AS failratio
 > FROM FailedRequestsTimeSeriesByDay a
 > JOIN SuccessfulRequestsTimeSeriesByDay b
 > ON a.yearmonthday = b.yearmonthday
 > ORDER BY yearmonthday ASC;
Total MapReduce jobs = 6
Launching Job 1 out of 6
Number of reduce tasks not specified. Estimated from input data size: 1

Total MapReduce CPU Time Spent: 18 seconds 800 msec
OK
As you will see, the output file is easily imported into R. We can even specify the standard delimiter used in Hive. This delimiter, which appears as ^A in graphical form, is actually quite useful. Unlike a comma or other typical delimiters, it is highly unlikely that most data will include Hive’s standard delimiter, which is, of course, why the developers of Hive chose to use this delimiter.
Next, the code snippet below shows some R code used to import the data and perform a simple control limits test to see what days had ratios that exceed a threshold. The threshold is defined as the average plus two times the standard deviation.
> rm(list=ls()) #Remove any objects
>
> library(fBasics)
>
> #Import failed requests file, using standard delimeter in Hive
> failedRequests <- read.table(“FailedRequestsByDay.txt”,sep=””)
>
> #Add column headings
> colnames(failedRequests) <- c(“Date”,”FailedRequestsRatio”)
> stdev <- sd(failedRequests$FailedRequestsRatio) #calculate the standard deviation
> avg <- mean(failedRequests$FailedRequestsRatio) #calculate the average
> avgPlus2Stdev <- avg + 2∗stdev #mean plus 2 standard deviations
>
> #Identify the days that had failed requests in excess of 2X the standard deviation
> failedRequests[failedRequests[,2]>avgPlus2Stdev,]
 Date FailedRequestsRatio
5 20090724 0.04544236
68 20090925 0.04203776
71 20090928 0.04075795
>
> #Produce a plot and save it as a PDF
> pdf(“PlotOfFailedRequestRatioSeries.pdf”)
> plot(failedRequests[,2],type=’l’,main=”Ratio of Failed Server Requests to Successful Requests by Day”
+ ,xlab=”Day”,ylab=”Ratio of Failed Requests to Successful Requests”)
> lines(rep(avg,length(failedRequests[,2])))
> lines(rep(avgPlus2Stdev,length(failedRequests[,2])),lty=2)
> legend(“topright”,c(“Average”,”2 X Standard Deviation”),lty=c(1,2))
> dev.off()
null device 1
>
> #Create autocorrelation plot to test for seasonality or other autocorrelation effects
> pdf(“FaileRequestsAutoCorrelation.pdf”)
> acfPlot(failedRequests[,2],lag.max=60)
> dev.off()
null device 1
In the excerpted output below, we can see that there were three occurrences that exceeded the threshold: July 24, 2009; September 25, 2009; and September 28, 2009.
 Date FailedRequestsRatio
5 20090724 0.04544236
68 20090925 0.04203776
71 20090928 0.04075795
The control plot is shown below (Figure 3.1).
Since it is possible that some of the variation could be due to seasonality or other causes, we need to perform an analysis of autocorrelation with the acfPlot() function, as part of the “fBasics” package. You will see that the resulting plot, however, gives no indication of seasonal effects. The tallest and only statistically significant bar in the graph is at the zero lag point on the x-axis. Any bars appearing above the dotted line would have indicated a statistically significant autocorrelation effect at that lag point (Figure 3.2).
image
Figure 3.1 Ratio of failed server requests to successful requests by day.
We merely suggest this as a starting point to demonstrate a means of organizing, aggregating, and sorting the time field. Further parsing could also be done on hours, minutes, and seconds as well. The Hive output can be used in conjunction with other tools to provide visualizations of the data, as well as to perform analysis that is generally unavailable within the Hadoop, MapReduce environment.

Hive for Producing Analytical Data Sets

We use the “statusgroupings” view, which was reviewed toward the beginning of this document, to produce an analytical data set that could be useful for analysis in other tools. As an example, a logistic regression was run to determine if we could find any patterns among the variables that might be predictive of request failure or success (as indicated by the status codes). However, there are too many categorical variables across too many dimensions to produce any apparently meaningful results. We have reproduced the method and results here, mainly as an exercise.
image
Figure 3.2 Testing for autocorrelation effects in the time series, such as seasonality.
We created the analytical data set by using the following Hive code, which calls the view and exports the data to a local directory.
INSERT OVERWRITE LOCAL DIRECTORY ‘/mnt/hgfs/BigDataAnalytics/Project1/ApacheLog’
SELECT ∗ FROM statusgroupings;
We ran the logistic regression by using Mahout. While Hive’s standard delimiter of “^A” can be useful for a number of data sets and analysis tools (such as R), Mahout prefers data in traditional CSV format, with a comma. You may be able to use the Hive command “set hive.io.output.fileformat = CSVTextFile” prior to running the export snippet above. However, this does not seem to always work for everyone, perhaps depending upon your environment. Barring this, you may be able to do a “find and replace” using standard Unix command line applications such as AWK, or SED, or with an editor, such as Emacs or Vi.
The logistic regression below simply attempts to identify whether there is a relationship between hosts and whether or not they tend to produce successful or failed requests.
Below is the Mahout command we used.
/usr/bin/mahout trainlogistic 
--input statusgroupings.csv
--output ./model
--target failedaccess
--categories 2
--predictors host
--types word
--features 50
--passes 20
--rate 50
The full command line output is far too lengthy to list here. However, segments of it are shown below. Incidentally, the term “failedaccess” is the variable name to be predicted and is not an error message.
[cloudera@localhost Project1]$ /usr/bin/mahout trainlogistic 
> --input statusgroupings.csv
> --output ./model
> --target failedaccess
> --categories 2
> --predictors host identity user
> --types word
> --features 50
> --passes 20
> --rate 50
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using /usr/lib/hadoop/bin/hadoop and HADOOP _CONF_DIR=/etc/hadoop/conf
MAHOUT-JOB: /usr/lib/mahout/mahout-examples-0.7-cdh4.2.0-job.jar
50
failedaccess ∼ -3.757∗Intercept Term + -19.927∗host=112.200.11.174 + -18.807∗host=112.200.200.192 + …

Intercept Term -3.75710
host=112.200.11.174 -19.92741
host=112.200.200.192 -18.80668
host=112.202.3.173 -0.08841
host=112.202.43.173 1.03845
host=114.111.36.26 0.33822
image
Figure 3.3 Logistic regression coefficients for each host.
host=114.127.246.36 3.49805
host=114.200.199.144 0.47374
host=114.39.145.56 -3.32575

13/05/07 18:28:30 INFO driver.MahoutDriver: Program took 70195 ms (Minutes: 1.1699166666666667)
Next we copy the long list of hosts and their respective coefficients into a spreadsheet program for some quick visualization. The graph in figure 3.3 shows how a few top hosts tend to drive a negative prediction of a request failure. In other words, some hosts are strongly associated with consistent requests with no failures. On the right side of the graph, the gradual slope suggests less predictive potential for hosts that can be associated with failed requests. The numbers on the x-axis are index numbers, with one number per host.
These results are not that surprising, given that we already know that this data set does not exhibit strong attack patterns, aside from a few isolated attack cases that were added to it. If we had a data set containing a series of events leading up to an attack, there might, indeed, be some hosts that could be characterized by failed requests. However, a sorted list of top hosts by failed requests, as was produced earlier, would also reveal this. If we had a larger number of positively identified breaches, logistic regression and other classifiers might prove more useful. Nevertheless, this exercise shows how Hive is useful for producing data sets suitable for analysis with other tools, such as Mahout (Figure 3.3).

Another Potential Analytical Data Set: Unstacked Status Codes

You will see some analytical tools and methods requiring categories to be unstacked. That is, each category within a column must be converted into its own separate column. Each new category column contains a numeral one for every row for which the category applies and a zero otherwise. Converting categorical data into columns in this way is sometimes referred to as creating dummy variables. Doing this is a convenient way to turn a categorical variable into a numerical variable.
The following code unstacks the category codes in this manner. The idea here is that each of these status codes could have predictive potential. For instance, we might be able to do a cross-correlation over time to see if a particular host had a number of failed requests that are predictive of a string of successes later. Or, failed requests might also precede large file sizes being transferred as well.Following is Hive code that could be a starting point for such an analysis.
--Unstack status codes into individual columns, create date fields, and show all other columns as well
CREATE VIEW IF NOT EXISTS unstacked_status_codes AS
SELECT host, identity, user, time, CASE substr(time,5,3)
WHEN ‘Jan’ THEN ‘01’
WHEN ‘Feb’ THEN ‘02’
WHEN ‘Mar’ THEN ‘03’
WHEN ‘Apr’ THEN ‘04’
WHEN ‘May’ THEN ‘05’
WHEN ‘Jun’ THEN ‘06’
WHEN ‘Jul’ THEN ‘07’
WHEN ‘Aug’ THEN ‘08’
WHEN ‘Sep’ THEN ‘09’
WHEN ‘Oct’ THEN ‘10’
WHEN ‘Nov’ THEN ‘11’
WHEN ‘Dec’ THEN ‘12’
ELSE ‘00’
END
AS month, substr(time,9,4) AS year, concat(substr(time,9,4), CASE substr(time,5,3) WHEN ‘Jan’ THEN ‘01’
WHEN ‘Feb’ THEN ‘02’
WHEN ‘Mar’ THEN ‘03’
WHEN ‘Apr’ THEN ‘04’
WHEN ‘May’ THEN ‘05’
WHEN ‘Jun’ THEN ‘06’
WHEN ‘Jul’ THEN ‘07’
WHEN ‘Aug’ THEN ‘08’
WHEN ‘Sep’ THEN ‘09’
WHEN ‘Oct’ THEN ‘10’
WHEN ‘Nov’ THEN ‘11’
WHEN ‘Dec’ THEN ‘12’
ELSE ‘00’
END) AS yearmonth, concat(CASE substr(time,5,3)
WHEN ‘Jan’ THEN ‘01’
WHEN ‘Feb’ THEN ‘02’
WHEN ‘Mar’ THEN ‘03’
WHEN ‘Apr’ THEN ‘04’
WHEN ‘May’ THEN ‘05’
WHEN ‘Jun’ THEN ‘06’
WHEN ‘Jul’ THEN ‘07’
WHEN ‘Aug’ THEN ‘08’
WHEN ‘Sep’ THEN ‘09’
WHEN ‘Oct’ THEN ‘10’
WHEN ‘Nov’ THEN ‘11’
WHEN ‘Dec’ THEN ‘12’
ELSE ‘00’ END,substr(time,2,2)) AS monthday, request, CASE status WHEN ‘100’ THEN 1 ELSE 0 END AS 100Continue, CASE status WHEN ‘101’ THEN 1 ELSE 0 END AS 101SwitchingProtocols, CASE status WHEN ‘102’ THEN 1 ELSE 0 END AS 102Processing, CASE status WHEN ‘200’ THEN 1 ELSE 0 END AS 200OK, CASE status WHEN ‘201’ THEN 1 ELSE 0 END AS 201Created, CASE status WHEN ‘202’ THEN 1 ELSE 0 END AS 202Accepted, CASE status WHEN ‘203’ THEN 1 ELSE 0 END AS 203NonAuthoritativeInformation, CASE status WHEN ‘204’ THEN 1 ELSE 0 END AS 204NoContent, CASE status WHEN ‘205’ THEN 1 ELSE 0 END AS 205ResetContent, CASE status WHEN ‘206’ THEN 1 ELSE 0 END AS 206PartialContent, CASE status WHEN ‘207’ THEN 1 ELSE 0 END AS 207MultiStatus, CASE status WHEN ‘208’ THEN 1 ELSE 0 END AS 208AlreadyReported, CASE status WHEN ‘226’ THEN 1 ELSE 0 END AS 226IMUsed, CASE status WHEN ‘300’ THEN 1 ELSE 0 END AS 300MultipleChoices, CASE status WHEN ‘301’ THEN 1 ELSE 0 END AS 301MovedPermanently, CASE status WHEN ‘302’ THEN 1 ELSE 0 END AS 302Found, CASE status WHEN ‘303’ THEN 1 ELSE 0 END AS 303SeeOther, CASE status WHEN ‘304’ THEN 1 ELSE 0 END AS 304NotModified, CASE status WHEN ‘305’ THEN 1 ELSE 0 END AS 305UseProxy, CASE status WHEN ‘306’ THEN 1 ELSE 0 END AS 306SwitchProxy, CASE status WHEN ‘307’ THEN 1 ELSE 0 END AS 307TemporaryRedirect, CASE status WHEN ‘308’ THEN 1 ELSE 0 END AS 308PermanentRedirect, CASE status WHEN ‘400’ THEN 1 ELSE 0 END AS 400BadRequest, CASE status WHEN ‘401’ THEN 1 ELSE 0 END AS 401Unauthorized, CASE status WHEN ‘402’ THEN 1 ELSE 0 END AS 402PaymentRequired, CASE status WHEN ‘403’ THEN 1 ELSE 0 END AS 403Forbidden, CASE status WHEN ‘404’ THEN 1 ELSE 0 END AS 404NotFound, CASE status WHEN ‘405’ THEN 1 ELSE 0 END AS 405MethodNotAllowed, CASE status WHEN ‘406’ THEN 1 ELSE 0 END AS 406NotAcceptable, CASE status WHEN ‘407’ THEN 1 ELSE 0 END AS 407ProxyAuthentication Required, CASE status WHEN ‘408’ THEN 1 ELSE 0 END AS 408RequestTimeout, CASE status WHEN ‘409’ THEN 1 ELSE 0 END AS 409Conflict, CASE status WHEN ‘410’ THEN 1 ELSE 0 END AS 410Gone, CASE status WHEN ‘411’ THEN 1 ELSE 0 END AS 411LengthRequired, CASE status WHEN ‘412’ THEN 1 ELSE 0 END AS 412PreconditionFailed, CASE status WHEN ‘413’ THEN 1 ELSE 0 END AS 413RequestEntityTooLarge, CASE status WHEN ‘414’ THEN 1 ELSE 0 END AS 414RequestUriTooLong, CASE status WHEN ‘415’ THEN 1 ELSE 0 END AS 415UnsupportedMediaType, CASE status WHEN ‘416’ THEN 1 ELSE 0 END AS 416RequestedRangeNotSatisfiable, CASE status WHEN ‘417’ THEN 1 ELSE 0 END AS 417ExpectationFailed, CASE 
status WHEN ‘418’ THEN 1 ELSE 0 END AS 418ImATeapot, CASE status WHEN ‘420’ THEN 1 ELSE 0 END AS 420EnhanceYourCalm, CASE status WHEN ‘422’ THEN 1 ELSE 0 END AS 422UnprocessableEntity, CASE status WHEN ‘423’ THEN 1 ELSE 0 END AS 423Locked, CASE status WHEN ‘424’ THEN 1 ELSE 0 END AS 424FailedDependency, CASE status WHEN ‘424’ THEN 1 ELSE 0 END AS 424MethodFailure, CASE status WHEN ‘425’ THEN 1 ELSE 0 END AS 425UnorderedCollection, CASE status WHEN ‘426’ THEN 1 ELSE 0 END AS 426UpgradeRequired, CASE status WHEN ‘428’ THEN 1 ELSE 0 END AS 428PreconditionRequired, CASE status WHEN ‘429’ THEN 1 ELSE 0 END AS 429TooManyRequests, CASE status WHEN ‘431’ THEN 1 ELSE 0 END AS 431RequestHeaderFieldsTooLarge, CASE status WHEN ‘444’ THEN 1 ELSE 0 END AS 444NoResponse, CASE status WHEN ‘449’ THEN 1 ELSE 0 END AS 449RetryWith, CASE status WHEN ‘450’ THEN 1 ELSE 0 END AS 450BlockedByWindowsParentalControls, CASE status WHEN ‘451’ THEN 1 ELSE 0 END AS 451UnavailableForLegalReasonsOrRedirect, CASE status WHEN ‘494’ THEN 1 ELSE 0 END AS 494RequestHeaderTooLarge, CASE status WHEN ‘495’ THEN 1 ELSE 0 END AS 495CertError, CASE status WHEN ‘496’ THEN 1 ELSE 0 END AS 496NoCert, CASE status WHEN ‘497’ THEN 1 ELSE 0 END AS 497HttpToHttps, CASE status WHEN ‘499’ THEN 1 ELSE 0 END AS 499ClientClosedRequest, CASE status WHEN ‘500’ THEN 1 ELSE 0 END AS 500InternalServerError, CASE status WHEN ‘501’ THEN 1 ELSE 0 END AS 501NotImplemented, CASE status WHEN ‘502’ THEN 1 ELSE 0 END AS 502BadGateway, CASE status WHEN ‘503’ THEN 1 ELSE 0 END AS 503ServiceUnavailable, CASE status WHEN ‘504’ THEN 1 ELSE 0 END AS 504GatewayTimeout, CASE status WHEN ‘505’ THEN 1 ELSE 0 END AS 505HttpVersionNotSupported, CASE status WHEN ‘506’ THEN 1 ELSE 0 END AS 506VariantAlsoNegotiates, CASE status WHEN ‘507’ THEN 1 ELSE 0 END AS 507InsufficientStorage, CASE status WHEN ‘508’ THEN 1 ELSE 0 END AS 508LoopDetected, CASE status WHEN ‘509’ THEN 1 ELSE 0 END AS 509BandwidthLimitExceeded, CASE status WHEN ‘510’ THEN 1 ELSE 0 END AS 510NotExtended, CASE status WHEN ‘511’ THEN 1 ELSE 0 END AS 511NetworkAuthenticationRequired, CASE status WHEN ‘598’ THEN 1 ELSE 0 END AS 598NetworkReadTimeoutError, CASE status WHEN ‘599’ THEN 1 ELSE 0 END AS 599NetworkConnectTimeoutError, size, referer, agent
FROM apachelog;
As you will see, the output from creating the view indicates that all of these variables were created.
OK
host identity user time month year yearmonth monthday
 request 100continue 101switchingprotocols
 102processing 200ok 201created 202accepted
 203nonauthoritativeinformation 204nocontent
 205resetcontent 206partialcontent 207multistatus
 208alreadyreported 226imused 300multiplechoices
 301movedpermanently 302found 303seeother 304notmodified
 305useproxy 306switchproxy 307temporaryredirect
 308permanentredirect 400badrequest 401unauthorized
 402paymentrequired 403forbidden 404notfound 
405methodnotallowed 406notacceptable
 407proxyauthenticationrequired 408requesttimeout
 409conflict 410gone 411lengthrequired
 412preconditionfailed 413requestentitytoolarge
 414requesturitoolong 415unsupportedmediatype
 416requestedrangenotsatisfiable 417expectationfailed
 418imateapot 420enhanceyourcalm 422unprocessableentity
 423locked 424faileddependency 424methodfailure
 425unorderedcollection 426upgraderequired
 428preconditionrequired 429toomanyrequests
 431requestheaderfieldstoolarge 444noresponse
 449retrywith 450blockedbywindowsparentalcontrols
 451unavailableforlegalreasonsorredirect
 494requestheadertoolarge 495certerror 496nocert
 497httptohttps 499clientclosedrequest
 500internalservererror 501notimplemented 502badgateway
 503serviceunavailable 504gatewaytimeout
 505httpversionnotsupported 506variantalsonegotiates
 507insufficientstorage 508loopdetected
 509bandwidthlimitexceeded 510notextended
 511networkauthenticationrequired
 598networkreadtimeouterror
 599networkconnecttimeouterrorsize referer agent
Time taken: 2.128 seconds
We then took the results and sent it as an output to a text file so that it can be available for further analysis with external tools, presuming you, as the analyst, have enough memory on your machine. If not, then further aggregation with Hive would be desirable, perhaps grouping counts of all the status codes by day. We used the following command to export the file.
INSERT OVERWRITE LOCAL DIRECTORY ‘/mnt/hgfs/BigDataAnalytics/Project1/UnstackedStatusCodes’
SELECT count()
FROM unstacked_status_codes;
We have provided the first row of the output below. You will see that although this output may look like a table, it is a single, wide row that is wrapped.
66.249.67.3 - - [20/Jul/2009:20:12:22 -0700] 07 2009 200907 
0720 “GET /gallery/main.php?g2_controller=exif.SwitchDetailMode
&g2_mode=detailed&g2_return=%2Fgallery%2Fmain.php%3Fg2_
itemId%3D15741&g2_returnName=photo HTTP/1.1” 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 5 “-” 
“Mozilla/5.0 (compatible; Googlebot/2.1; +>http://www.google.com/
bot.html)
We should see only a single numeral one, with all other status categories being zero. Since that is what we find, it appears our query was successful.

Other Applicable Security Areas and Scenarios

Although these examples involved server logs, the same methods can be applied to forensic analysis of data from other systems sources as well. For instance, data collected on network traffic, such as from router log files, can also accumulate in very large amounts. Big data tools and analysis methods can similarly prove useful for these scenarios as well.
One of the main challenges in analyzing other kinds of log files is parsing all the text data. There is an enormous variety of file formats that an analyst could encounter. Generally, these files are parsed using regular expressions, such as shown at the beginning of this chapter. However, writing regular expressions is notoriously challenging and generally requires a significant amount of patience, and trial-and-error attempts. Do not give up. There are plenty of examples for most common server log types, which are available from your favorite Web search engine. You can start out with an example and tweak it to adapt it to your specific format. It is also helpful to have a regular expression editing tool. There are many of these to choose from on the Internet and quite a few of them are free. One such favorite among many analysts is www.regexr.com.
Another option to writing regular expressions from scratch is to try a parsing tool. A popular tool for running on Windows operating systems is Log Parser by Microsoft. Do not expect a nice graphical interface though, as it is command line based. Reviewing all the possible parsing tools is beyond the scope of this book. But, there are a number of them available and new ones continue to be developed.
Of course, log files are not the only sources of data available for analysis. For example, e-mail and network traffic sniffers can also be valuable sources of security-related data. Many of the same techniques can be applied as with server logs. For instance, e-mail messages have various components that can be parsed, stored, and queried, such as the recipient, sender, date stamp, and so on. The same can be said of the output from network sniffers. However, e-mail messages tend to have larger amounts of unstructured data within the body section, and this requires different methods of analysis than what we have discussed in this chapter. Text mining techniques are particularly useful for unstructured data. We will cover some text mining techniques in Chapter 6.

Summary

Hive provides a very useful framework for analyzing large amounts of server log data. As attack vectors are so varied, a flexible tool that enables drilldowns and ad hoc analysis on the fly is very useful. However, it can also be useful to have a collection of queries and methods for analyzing common attack vectors as a starting point. We hope that the ideas offered here may serve as a catalyst for further creativity and research into this topic.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset