This chapter focuses on the analysis of data so that you have the skillset to analyze the large volume of most organizations already collect. We cover the use of open-source tools, such as Hadoop, MapReduce, R, and Mahout, so that you have a solid foundation for understanding the analytical process on any platform. Specifically, we demonstrate the power of these tools by applying them to a typical security scenario—analyzing very large collections of server log files. The tools and methods can be applied to the analysis of data from other systems sources as well.
[cloudera@localhost Project1]$ hive.
Logging initialized using configuration in jar:file:/usr/lib/hive/ lib/hive-common-0.10.0-cdh4.2.0.jar!/hive-log4j.properties
Hive history file=/tmp/cloudera/hive_job_log_cloudera_201305061902_843121553.txt
hive>
hive> add jar /usr/lib/hive/lib/hive-contrib-0.10.0-cdh4.2.0.jar;
Added /usr/lib/hive/lib/hive-contrib-0.10.0-cdh4.2.0.jar to class path
Added resource: /usr/lib/hive/lib/hive-contrib-0.10.0-cdh4.2.0.jar
hive>
hive> set hive.cli.print.header=true;
hive> CREATE TABLE apachelog (
> host STRING,
> identity STRING,
> user STRING,
> time STRING,
> request STRING,
> status STRING,
> size STRING,
> referer STRING,
> agent STRING)
> ROW FORMAT SERDE ‘org.apache.hadoop.hive.contrib.serde2.RegexSerDe’
> WITH SERDEPROPERTIES (“input.regex” = “([^ ]∗) ([^ ]∗) ([^ ]∗) (-
> |\[[^\]]∗\]) ([^ ”]∗|”[^”]∗”) (-|[0-9]∗) (-|[0-9]∗)(?: ([^
> ”]∗|”[^”]∗”) ([^ ”]∗|”[^”]∗”))?”, “output.format.string” =
> “%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s”)
> STORED AS TEXTFILE;
OK
Time taken: 0.029 seconds
hive> LOAD DATA LOCAL INPATH “ApacheLogData/access∗” INTO TABLE apachelog;
Copying data from file:/mnt/hgfs/BigDataAnalytics/Project1/ApacheLogData/access∗
Copying file: file:/mnt/hgfs/BigDataAnalytics/Project1/ApacheLogData/access_log_1
Copying file: file:/mnt/hgfs/BigDataAnalytics/Project1/ApacheLogData/access_log_2
Copying file: file:/mnt/hgfs/BigDataAnalytics/Project1/ApacheLogData/access_log_3
Copying file: file:/mnt/hgfs/BigDataAnalytics/Project1/ApacheLogData/access_log_4
Copying file: file:/mnt/hgfs/BigDataAnalytics/Project1/ApacheLogData/access_log_5
Copying file: file:/mnt/hgfs/BigDataAnalytics/Project1/ApacheLogData/access_log_6
Copying file: file:/mnt/hgfs/BigDataAnalytics/Project1/ApacheLogData/access_log_7
Loading data to table default.apachelog
Table default.apachelog stats: [num_partitions: 0, num_files: 7, num_rows: 0, total_size: 53239106, raw_data_size: 0]
OK
Time taken: 0.614 seconds
hadoop@domU-12-31-39-00-88-72:∼$ hadoop dfs -mkdir temp
hadoop@domU-12-31-39-00-88-72:∼$ hadoop dfs -cp ‘s3n:// elasticmapreduce/samples/pig-apache/input/∗’ temp
13/04/22 19:17:43 INFO s3native.NativeS3FileSystem: Opening ‘s3n:// elasticmapreduce/samples/pig-apache/input/access_log_1’ for reading
13/04/22 19:17:46 INFO s3native.NativeS3FileSystem: Opening ‘s3n:// elasticmapreduce/samples/pig-apache/input/access_log_2’ for reading
13/04/22 19:17:48 INFO s3native.NativeS3FileSystem: Opening ‘s3n:// elasticmapreduce/samples/pig-apache/input/access_log_3’ for reading
13/04/22 19:17:49 INFO s3native.NativeS3FileSystem: Opening ‘s3n:// elasticmapreduce/samples/pig-apache/input/access_log_4’ for reading
13/04/22 19:17:50 INFO s3native.NativeS3FileSystem: Opening ‘s3n:// elasticmapreduce/samples/pig-apache/input/access_log_5’ for reading
13/04/22 19:17:52 INFO s3native.NativeS3FileSystem: Opening ‘s3n:// elasticmapreduce/samples/pig-apache/input/access_log_6’ for reading
hadoop@domU-12-31-39-00-88-72:∼$ hadoop dfs -cp temp/∗ ‘s3n://Project1E185/’
13/04/22 19:19:36 INFO s3native.NativeS3FileSystem: Creating new file ‘s3n://Project1E185/access_log_1’ in S3
13/04/22 19:19:40 INFO s3native.Jets3tNativeFileSystemStore: s3.putObject Project1E185 access_log_1 8754118
13/04/22 19:19:40 INFO s3native.NativeS3FileSystem: Creating new file ‘s3n://Project1E185/access_log_2’ in S3
13/04/22 19:19:42 INFO s3native.Jets3tNativeFileSystemStore: s3.putObject Project1E185 access_log_2 8902171
13/04/22 19:19:42 INFO s3native.NativeS3FileSystem: Creating new file ‘s3n://Project1E185/access_log_3’ in S3
13/04/22 19:19:44 INFO s3native.Jets3tNativeFileSystemStore: s3.putObject Project1E185 access_log_3 8896201
13/04/22 19:19:44 INFO s3native.NativeS3FileSystem: Creating new file ‘s3n://Project1E185/access_log_4’ in S3
13/04/22 19:19:46 INFO s3native.Jets3tNativeFileSystemStore: s3.putObject Project1E185 access_log_4 8886636
13/04/22 19:19:46 INFO s3native.NativeS3FileSystem: Creating new file ‘s3n://Project1E185/access_log_5’ in S3
13/04/22 19:19:48 INFO s3native.Jets3tNativeFileSystemStore: s3.putObject Project1E185 access_log_5 8902365
13/04/22 19:19:48 INFO s3native.NativeS3FileSystem: Creating new file ‘s3n://Project1E185/access_log_6’ in S3
13/04/22 19:19:50 INFO s3native.Jets3tNativeFileSystemStore: s3.putObject Project1E185 access_log_6 8892828
SELECT ∗ FROM apachelog
WHERE LOWER(request) LIKE ‘% like %’
OR LOWER(request) LIKE ‘%select %’
OR LOWER(request) LIKE ‘% from %’
OR LOWER(request) LIKE ‘% where %’
OR LOWER(request) LIKE ‘% if %’
OR LOWER(request) LIKE ‘% having %’
OR LOWER(request) LIKE ‘% case %’
OR LOWER(request) LIKE ‘% when %’;
hive> SELECT ∗ FROM apachelog
> WHERE LOWER(request) LIKE ‘% like %’
> OR LOWER(request) LIKE ‘%select %’
> OR LOWER(request) LIKE ‘% from %’
> OR LOWER(request) LIKE ‘% where %’
> OR LOWER(request) LIKE ‘% if %’
> OR LOWER(request) LIKE ‘% having %’
> OR LOWER(request) LIKE ‘% case %’
> OR LOWER(request) LIKE ‘% when %’;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there’s no reduce operator
Starting Job = job_201305061901_0002, Tracking URL = http://localhost.localdomain:50030/jobdetails.jsp?jobid=job_201305061901_0002
Kill Command = /usr/lib/hadoop/bin/hadoop job-kill job_201305061901_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2013-05-06 20:16:31,416 Stage-1 map = 0%,reduce = 0%
2013-05-06 20:16:39,459 Stage-1 map = 100%,reduce = 0%, Cumulative CPU 4.82 sec
2013-05-06 20:16:40,471 Stage-1 map = 100%,reduce = 100%, Cumulative CPU 4.82 sec
MapReduce Total cumulative CPU time: 4 seconds 820 msec
Ended Job = job_201305061901_0002
MapReduce Jobs Launched:
Job 0: Map: 1Cumulative CPU: 4.82 sec HDFS Read: 53239663 HDFS Write: 218 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 820 msec
OK
host identity user time size referer status agent
216.185.64.79 - - [18/Sep/2009:00:00:55 -0800] “GET /SELECT ∗ FROM users WHERE username = ‘’ having 1=1-- HTTP/1.1” 200 3164 “-” “Mozilla/5.0 (compatible) Feedfetcher-Google; (+http://www.google.com/feedfetcher.html)”
Time taken: 11.572 seconds
hive>
SELECT ∗ FROM apachelog
WHERE LOWER(request) LIKE ‘%usr/%’
OR LOWER(request) LIKE ‘%∼/%’
OR LOWER(request) LIKE ‘%.exe%’
OR LOWER(request) LIKE ‘%.ini%’
OR LOWER(request) LIKE ‘%usr/%’
OR LOWER(request) LIKE ‘%etc/%’
--OR LOWER(request) LIKE ‘%home/%’
--OR LOWER(request) LIKE ‘%bin/%’
OR LOWER(request) LIKE ‘%dev/%’
OR LOWER(request) LIKE ‘%opt/%’
OR LOWER(request) LIKE ‘%root/%’
OR LOWER(request) LIKE ‘%sys/%’
OR LOWER(request) LIKE ‘%boot/%’
OR LOWER(request) LIKE ‘%mnt/%’
OR LOWER(request) LIKE ‘%proc/%’
OR LOWER(request) LIKE ‘%sbin/%’
OR LOWER(request) LIKE ‘%srv/%’
OR LOWER(request) LIKE ‘%var/%’
OR LOWER(request) LIKE ‘%c:\%’
OR LOWER(request) LIKE ‘%..%’;
hive> SELECT ∗ FROM apachelog
> WHERE LOWER(request) LIKE ‘%usr/%’
> OR LOWER(request) LIKE ‘%∼/%’
> OR LOWER(request) LIKE ‘%.exe%’
> OR LOWER(request) LIKE ‘%.ini%’
> OR LOWER(request) LIKE ‘%usr/%’
> OR LOWER(request) LIKE ‘%etc/%’
> --OR LOWER(request) LIKE ‘%home/%’
> --OR LOWER(request) LIKE ‘%bin/%’
> OR LOWER(request) LIKE ‘%dev/%’
> OR LOWER(request) LIKE ‘%opt/%’
> OR LOWER(request) LIKE ‘%root/%’
> OR LOWER(request) LIKE ‘%sys/%’
> OR LOWER(request) LIKE ‘%boot/%’
> OR LOWER(request) LIKE ‘%mnt/%’
> OR LOWER(request) LIKE ‘%proc/%’
> OR LOWER(request) LIKE ‘%sbin/%’
> OR LOWER(request) LIKE ‘%srv/%’
> OR LOWER(request) LIKE ‘%var/%’
> OR LOWER(request) LIKE ‘%c:\%’
> OR LOWER(request) LIKE ‘%..%’;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there’s no reduce operator
Starting Job = job_201305061901_0003, Tracking URL = http://localhost.localdomain:50030/jobdetails.jsp?jobid=job_201305061901_0003>
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_201305061901_0003
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2013-05-06 20:32:02,894 Stage-1 map = 0%, reduce = 0%
2013-05-06 20:32:10,931 Stage-1 map = 83%, reduce = 0%
2013-05-06 20:32:11,937 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 7.58 sec
2013-05-06 20:32:12,944 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 7.58 sec
2013-05-06 20:32:13,956 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 7.58 sec
MapReduce Total cumulative CPU time: 7 seconds 580 msec
Ended Job = job_201305061901_0003
MapReduce Jobs Launched:
Job 0: Map: 1 Cumulative CPU: 7.58 sec HDFS Read: 53239663 HDFS Write: 855 SUCCESS
Total MapReduce CPU Time Spent: 7 seconds 580 msec
OK
host identity user time size referer status agent
10.255.255.124 - - [25/Apr/2013:15:31:46 -0400] “GET /cgi-bin/ powerup/r.cgi?FILE=../../../../../../../../../../etc/passwd HTTP/1.1” 404 539 “-” “Mozilla/4.75 (Nikto/2.1.4) (Evasions:None) (Test:003175)”
10.255.255.124 - - [25/Apr/2013:15:31:46 -0400] “GET /cgi-bin/ r.cgi?FILE=../../../../../../../../../../etc/passwd HTTP/1.1” 404 531 “-” “Mozilla/4.75 (Nikto/2.1.4) (Evasions:None) (Test:003176)”
216.185.64.79 - - [18/Sep/2009:00:00:55 -0800] “GET /example.com/ doc/..%5c../Windows/System32/cmd.exe?/c+dir+c: HTTP/1.1” 200 3164 “-” “Mozilla/5.0 (compatible) Feedfetcher-Google; (+”>http://www. google.com/feedfetcher.html)”
216.185.64.79 - - [18/Sep/2009:00:00:55 -0800] “GET /example.com/ example.asp?display=../../../../../Windows/system.ini HTTP/1.1” 200 3164 “-” “Mozilla/5.0 (compatible) Feedfetcher-Google; (+”>http:// www.google.com/feedfetcher.html)”
Time taken: 13.626 seconds
hive>
SELECT ∗ FROM apachelog
WHERE LOWER(request) LIKE ‘%>alert%’
OR LOWER(request) LIKE ‘%vulnerable%’;
hive> SELECT ∗ FROM apachelog
> WHERE LOWER(request) LIKE ‘%>alert%’
> OR LOWER(request) LIKE ‘%vulnerable%’;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there’s no reduce operator
Starting Job = job_201305071923_0007, Tracking URL = http:// localhost.localdomain:50030/jobdetails.jsp?jobid=job_201305071923_0007
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_201305071923_0007
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2013-05-07 22:37:45,751 Stage-1 map = 0%, reduce = 0%
2013-05-07 22:37:53,784 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 7.01 sec
2013-05-07 22:37:54,796 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 7.01 sec
MapReduce Total cumulative CPU time: 7 seconds 10 msec
Ended Job = job_201305071923_0007
MapReduce Jobs Launched:
Job 0: Map: 1 Cumulative CPU: 7.01 sec HDFS Read: 159723693 HDFS Write: 2428 SUCCESS
Total MapReduce CPU Time Spent: 7 seconds 10 msec
OK
host identity user time request status size referer agent
10.255.255.124 - - [25/Apr/2013:15:31:46 -0400] “GET /options. php?optpage=<script>alert(‘Vulnerable!’)</script> HTTP/1.1” 404 529 “-” “Mozilla/4.75 (Nikto/2.1.4) (Evasions:None) (Test:003171)”
10.255.255.124 - - [25/Apr/2013:15:31:46 -0400] “GET /search.php?ma ilbox=INBOX&what=x&where=<script>alert(‘Vulnerable!’)</script>& submit=Search HTTP/1.1” 404 528 “-” “Mozilla/4.75 (Nikto/2.1.4) (Evasions:None) (Test:003172)”
10.255.255.124 - - [25/Apr/2013:15:31:46 -0400] “GET /help. php?chapter=<script>alert(‘Vulnerable’)</script> HTTP/1.1” 404 526 “-” “Mozilla/4.75 (Nikto/2.1.4) (Evasions:None) (Test:003173)”
SELECT ∗ FROM apachelog
WHERE LOWER(request) LIKE ‘%&comma%’
OR LOWER(request) LIKE ‘%20echo%’
OR LOWER(request) LIKE ‘%60id%’;
hive> SELECT ∗ FROM apachelog
> WHERE LOWER(request) LIKE ‘%&comma%’
> OR LOWER(request) LIKE ‘%20echo%’
> OR LOWER(request) LIKE ‘%60id%’;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there’s no reduce operator
Starting Job = job_201305071923_0005, Tracking URL = http://localhost. localdomain:50030/jobdetails.jsp?jobid=job_201305071923_0005>
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_201305071923_0005
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2013-05-07 22:28:42,343 Stage-1 map = 0%, reduce = 0%
2013-05-07 22:28:51,378 Stage-1 map = 83%, reduce = 0%
2013-05-07 22:28:52,384 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 8.09 sec
2013-05-07 22:28:53,394 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 8.09 sec
MapReduce Total cumulative CPU time: 8 seconds 90 msec
Ended Job = job_201305071923_0005
MapReduce Jobs Launched:
Job 0: Map: 1 Cumulative CPU: 8.09 sec HDFS Read: 159723693 HDFS Write: 6080 SUCCESS
Total MapReduce CPU Time Spent: 8 seconds 90 msec
OK
host identity user time request status size referer agent
10.255.255.124 - - [25/Apr/2013:15:31:46 -0400] “GET / forumscalendar.php?calbirthdays=1&action=getday&day=2001-8-15&comma=%22; echo%20’’;%20echo%20%60id%20%60;die();echo%22 HTTP/1.1” 404 536 “-” “Mozilla/4.75 (Nikto/2.1.4) (Evasions:None) (Test:003039)”
10.255.255.124 - - [25/Apr/2013:15:31:46 -0400] “GET / forumzcalendar.php?calbirthdays=1&action=getday&day=2001-8-15&comma=%22; echo%20’’;%20echo%20%60id%20%60;die();echo%22 HTTP/1.1” 404 536 “-” “Mozilla/4.75 (Nikto/2.1.4) (Evasions:None) (Test:003040)”
10.255.255.124 - - [25/Apr/2013:15:31:46 -0400] “GET / htforumcalendar.php?calbirthdays=1&action=getday&day=2001-8-15&comma=%22; echo%20’’;%20echo%20%60id%20%60;die();echo%22 HTTP/1.1” 404 537 “-” “Mozilla/4.75 (Nikto/2.1.4) (Evasions:None) (Test:003041)”
10.255.255.124 - - [25/Apr/2013:15:31:46 -0400] “GET /vbcalendar. php?calbirthdays=1&action=getday&day=2001-8-15&comma=%22;echo%20
’’;%20echo%20%60id%20%60;die();echo%22 HTTP/1.1” 404 532 “-” “Mozilla/4.75 (Nikto/2.1.4) (Evasions:None) (Test:003042)”
10.255.255.124 - - [25/Apr/2013:15:31:46 -0400] “GET / vbulletincalendar.php?calbirthdays=1&action=getday&day=2001-8-15&comma=%22; echo%20’’;%20echo%20%60id%20%60;die();echo%22 HTTP/1.1” 404 539 “-” “Mozilla/4.75 (Nikto/2.1.4) (Evasions:None) (Test:003043)”
10.255.255.124 - - [25/Apr/2013:15:31:46 -0400] “GET /cgi-bin/ calendar.php?calbirthdays=1&action=getday&day=2001-8-15&comma=%22; echo%20’’;%20echo%20%60id%20%60;die();echo%22 HTTP/1.1” 404 538 “-” “Mozilla/4.75 (Nikto/2.1.4) (Evasions:None) (Test:003044)”
Time taken: 13.51 seconds
SELECT ∗ FROM apachelog
WHERE LOWER(request) LIKE ‘%alter%’
AND LOWER(request) LIKE ‘%character%’
AND LOWER(request) LIKE ‘%set%’;
SELECT ∗ FROM apachelog
WHERE LOWER(request) LIKE ‘%waitfor%’
AND LOWER(request) LIKE ‘%time%’;
SELECT ∗ FROM apachelog
WHERE LOWER(request) LIKE ‘%goto%’;
hive> SELECT ∗ FROM apachelog
> WHERE LOWER(request) LIKE ‘%alter%’
> AND LOWER(request) LIKE ‘%character%’
> AND LOWER(request) LIKE ‘%set%’;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there’s no reduce operator
Starting Job = job_201305061901_0005, Tracking URL = http://localhost. localdomain:50030/jobdetails.jsp?jobid=job_201305061901_0005
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_201305061901_0005
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2013-05-06 21:07:52,856 Stage-1 map = 0%, reduce = 0%
2013-05-06 21:07:57,880 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.82 sec
2013-05-06 21:07:58,886 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.82 sec
2013-05-06 21:07:59,897 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.82 sec
MapReduce Total cumulative CPU time: 2 seconds 820 msec
Ended Job = job_201305061901_0005
MapReduce Jobs Launched:
Job 0: Map: 1 Cumulative CPU: 2.82 sec HDFS Read: 53239663 HDFS Write: 277 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 820 msec OK
216.185.64.79 - - [18/Sep/2009:00:00:55 -0800] “GET /ALTER TABLE ‘users’ CHANGE ‘password’ ‘password’ VARCHAR(255) CHARACTER SET gbk COLLATE gbk_chinese_ci NOT NULL HTTP/1.1” 200 3164 “-” “Mozilla/5.0 (compatible) Feedfetcher-Google; (+http://www.google.com/ feedfetcher.html)”
Time taken: 9.927 seconds
hive>
> SELECT ∗ FROM apachelog
> WHERE LOWER(request) LIKE ‘%waitfor%’
> AND LOWER(request) LIKE ‘%time%’;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there’s no reduce operator
Starting Job = job_201305061901_0006, Tracking URL = http://localhost. localdomain:50030/jobdetails.jsp?jobid=job_201305061901_0006
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_201305061901_0006
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2013-05-06 21:08:02,294 Stage-1 map = 0%, reduce = 0%
2013-05-06 21:08:08,318 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.75 sec
2013-05-06 21:08:09,328 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.75 sec
MapReduce Total cumulative CPU time: 2 seconds 750 msec
Ended Job = job_201305061901_0006
MapReduce Jobs Launched:
Job 0: Map: 1 Cumulative CPU: 2.75 sec HDFS Read: 53239663 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 750 msec
OK
Time taken: 9.437 seconds
hive>
> SELECT ∗ FROM apachelog
> WHERE LOWER(request) LIKE ‘%goto%’;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there’s no reduce operator
Starting Job = job_201305061901_0007, Tracking URL = http://localhost. localdomain:50030/jobdetails.jsp?jobid=job_201305061901_0007
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_201305061901_0007
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2013-05-06 21:08:23,417 Stage-1 map = 0%, reduce = 0%
2013-05-06 21:08:28,438 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.72 sec
2013-05-06 21:08:29,450 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.72 sec
MapReduce Total cumulative CPU time: 2 seconds 720 msec
Ended Job = job_201305061901_0007
MapReduce Jobs Launched:
Job 0: Map: 1 Cumulative CPU: 2.72 sec HDFS Read: 53239663 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 720 msec
OK
Time taken: 8.413 seconds
CREATE VIEW IF NOT EXISTS statusgroupings AS
SELECT host, identity, user, time, request, status, size, referer, agent, CASE substr(status,1,1)
WHEN ‘1’ THEN ‘0’
WHEN ‘2’ THEN ‘0’
WHEN ‘3’ THEN ‘0’
WHEN ‘4’ THEN ‘1’
WHEN ‘5’ THEN ‘1’
ELSE ‘0’
END
AS failedaccess
FROM apachelog;
--store the results of the preceding view
set hive.io.output.fileformat = CSVTextFile;
INSERT OVERWRITE LOCAL DIRECTORY ‘/mnt/hgfs/BigDataAnalytics/Project1/ ApacheLog’ SELECT ∗ FROM statusgroupings;
--sum total failed access attempts by host
CREATE VIEW IF NOT EXISTS FailedHostAttempts AS
SELECT host, SUM(failedaccess) AS failedAttempts
FROM statusgroupings
GROUP BY host
ORDER BY failedAttempts DESC;
--count total host attempts to access
CREATE VIEW IF NOT EXISTS TotalHostAttempts AS
SELECT host, count(host) AS hostAccessAttempts
FROM statusgroupings
GROUP BY host
ORDER BY hostAccessAttempts DESC;
--top 20 proportions of failed attempts
SELECT a.host, failedAttempts, hostAccessAttempts, failedAttempts / hostAccessAttempts AS percentFailed
FROM TotalHostAttempts a
JOIN FailedHostAttempts b
ON a.host = b.host
WHERE failedAttempts / hostAccessAttempts > 0
ORDER BY percentFailed DESC
LIMIT 20;
Total MapReduce CPU Time Spent: 17 seconds 260 msec
OK
host failedattempts hostaccessattempts percentfailed
189.106.160.140 1.0 1 1.0
10.255.255.124 12.0 12 1.0
79.11.25.143 1.0 1 1.0
99.157.209.86 1.0 1 1.0
24.21.134.171 3.0 4 0.75
81.193.181.206 2.0 3 0.6666666666666666
87.238.130.200 4.0 7 0.5714285714285714
165.166.103.33 1.0 2 0.5
124.177.134.93 2.0 4 0.5
88.38.60.152 2.0 4 0.5
84.203.44.110 2.0 4 0.5
115.74.146.80 1.0 2 0.5
71.49.39.235 1.0 2 0.5
68.96.161.201 1.0 2 0.5
62.150.156.229 1.0 2 0.5
58.69.168.155 1.0 2 0.5
58.68.8.190 2.0 4 0.5
41.100.138.220 1.0 2 0.5
4.131.17.243 1.0 2 0.5
201.43.250.154 1.0 2 0.5
Time taken: 64.786 seconds
--top 20 proportions of failed attempts where more than 20 attempts total
SELECT a.host, failedAttempts, hostAccessAttempts, failedAttempts / hostAccessAttempts AS percentFailed
FROM TotalHostAttempts a
JOIN FailedHostAttempts b
ON a.host = b.host
WHERE failedAttempts / hostAccessAttempts > 0
AND hostAccessAttempts > 20
ORDER BY percentFailed DESC
LIMIT 20;
Total MapReduce CPU Time Spent: 16 seconds 880 msec
OK
host failedattempts hostaccessattempts percentfailed
221.221.9.60 14.0 50 0.28
41.202.75.35 5.0 26 0.19230769230769232
41.178.112.197 17.0 89 0.19101123595505617
121.205.226.76 63.0 334 0.18862275449101795
114.200.199.144 5.0 27 0.18518518518518517
78.178.233.22 5.0 27 0.18518518518518517
216.24.131.152 12.0 66 0.18181818181818182
68.42.128.66 4.0 22 0.18181818181818182
201.2.78.10 4.0 24 0.16666666666666666
194.141.2.1 4.0 24 0.16666666666666666
194.110.194.1 5.0 30 0.16666666666666666
69.122.96.121 4.0 24 0.16666666666666666
68.117.200.60 4.0 24 0.16666666666666666
76.122.81.132 4.0 25 0.16
79.66.21.60 6.0 38 0.15789473684210525
24.183.197.231 3.0 21 0.14285714285714285
68.193.125.219 4.0 28 0.14285714285714285
149.6.164.150 6.0 43 0.13953488372093023
99.141.88.139 3.0 22 0.13636363636363635
15.203.233.77 3.0 22 0.13636363636363635
Time taken: 65.271 seconds
--bot activity
SELECT agent, count(agent) AS hits
FROM apachelog
WHERE agent LIKE ‘%bot%’
GROUP BY agent
ORDER BY hits DESC;
Total MapReduce CPU Time Spent: 6 seconds 920 msec
OK
agent hits
“Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)” 45528
“msnbot/2.0b (+http://search.msn.com/msnbot.htm)” 9248 “msnbot/1.1 (+http://search.msn.com/msnbot.htm)” 8400.
“Mozilla/5.0 (Twiceler-0.9 http://www.cuil.com/twiceler/robot.html)” 4571
“Googlebot-Image/1.0” 4543
“Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1; aggregator: Spinn3r (Spinn3r 3.1);http://spinn3r.com/robot) Gecko/20021130” 2431
“Mozilla/5.0 (Windows; U; Windows NT 5.1; fr; rv:1.8.1) VoilaBot BETA 1.2 ([email protected])” 382
“OOZBOT/0.20 (http://www.setooz.com/oozbot.html; agentname at setooz dot_com )” 169“Mozilla/5.0 (compatible; Tagoobot/3.0; +http://www.tagoo.ru)” 102
“Mozilla/5.0 (compatible; discobot/1.1; +http://discoveryengine.com/discobot.html)” 56
“Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;
http://www.changedetection.com/bot.html)” 49
“SAMSUNG-SGH-E250/1.0 Profile/MIDP-2.0 Configuration/CLDC-1.1 UP.Browser/6.2.3.3.c.1.101 (GUI) MMP/2.0 (compatible; Googlebot-Mobile/ 2.1; +http://www.google.com/bot.html)” 32
“Mozilla/5.0 (compatible; Exabot/3.0; +http://www.exabot.com/go/robot)” 29
“Gaisbot/3.0+([email protected];+http://gais.cs.ccu.edu.tw/robot.php)” 27
“FollowSite Bot (http://www.followsite.com/bot.html)” 18
“Gigabot/3.0 (http://www.gigablast.com/spider.html)” 13
“MLBot (www.metadatalabs.com/mlbot)” 10
“Mozilla/5.0 (compatible; MJ12bot/v1.2.5; http://www.majestic12.co.uk/bot.php?+)” 10
“Yeti/1.0 (NHN Corp.; http://help.naver.com/robots/)” 9
“psbot/0.1 (+http://www.picsearch.com/bot.html)” 8
“Mozilla/5.0 (compatible; seexie.com_bot/4.1; +http://www.seexie.com)” 6
“Mozilla/5.0 (compatible; Exabot-Images/3.0; +http://www.exabot.com/go/robot)” 5
“Mozilla/5.0 (compatible; BuzzRankingBot/1.0; +http://www.buzzrankingbot.com/)” 4
“DoCoMo/2.0 N905i(c100;TB;W24H16) (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)” 4
Time taken: 23.632 seconds
SELECT agent, count(agent) AS hits, MAX(status) AS sampleStatus
FROM apachelog
WHERE agent LIKE ‘%bot%’
AND substr(status,1,1) IN (‘4’,’5’)
GROUP BY agent
ORDER BY hits DESC
LIMIT 20;
Total MapReduce CPU Time Spent: 7 seconds 190 msec
OK
agenthitssamplestatus
“Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”272500
“Mozilla/5.0 (Twiceler-0.9 http://www.cuil.com/twiceler/robot.html)”225500
“msnbot/2.0b (+http://search.msn.com/msnbot.htm)”88500
“msnbot/1.1 (+http://search.msn.com/msnbot.htm)”82500
“Googlebot-Image/1.0”74500
Time taken: 31.764 seconds
CREATE VIEW IF NOT EXISTS by_month AS
SELECT host, identity, user, time, CASE substr(time,5,3)
WHEN ‘Jan’ THEN ‘01’
WHEN ‘Feb’ THEN ‘02’
WHEN ‘Mar’ THEN ‘03’
WHEN ‘Apr’ THEN ‘04’
WHEN ‘May’ THEN ‘05’
WHEN ‘Jun’ THEN ‘06’
WHEN ‘Jul’ THEN ‘07’
WHEN ‘Aug’ THEN ‘08’
WHEN ‘Sep’ THEN ‘09’
WHEN ‘Oct’ THEN ‘10’
WHEN ‘Nov’ THEN ‘11’
WHEN ‘Dec’ THEN ‘12’
ELSE ‘00’
END
AS month, substr(time,9,4) AS year, concat(substr(time,9,4), CASE substr(time,5,3)
WHEN ‘Jan’ THEN ‘01’
WHEN ‘Feb’ THEN ‘02’
WHEN ‘Mar’ THEN ‘03’
WHEN ‘Apr’ THEN ‘04’
WHEN ‘May’ THEN ‘05’
WHEN ‘Jun’ THEN ‘06’
WHEN ‘Jul’ THEN ‘07’
WHEN ‘Aug’ THEN ‘08’
WHEN ‘Sep’ THEN ‘09’
WHEN ‘Oct’ THEN ‘10’
WHEN ‘Nov’ THEN ‘11’
WHEN ‘Dec’ THEN ‘12’
ELSE ‘00’
END) AS yearmonth, concat(CASE substr(time,5,3)
WHEN ‘Jan’ THEN ‘01’
WHEN ‘Feb’ THEN ‘02’
WHEN ‘Mar’ THEN ‘03’
WHEN ‘Apr’ THEN ‘04’
WHEN ‘May’ THEN ‘05’
WHEN ‘Jun’ THEN ‘06’
WHEN ‘Jul’ THEN ‘07’
WHEN ‘Aug’ THEN ‘08’
WHEN ‘Sep’ THEN ‘09’
WHEN ‘Oct’ THEN ‘10’
WHEN ‘Nov’ THEN ‘11’
WHEN ‘Dec’ THEN ‘12’
ELSE ‘00’
END,substr(time,2,2)) AS monthday, request, status, size, referer, agent
FROM apachelog;
hive> SELECT time, month, year, monthday, yearmonth FROM by_month LIMIT 1;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there’s no reduce operator
Starting Job = job_201305061901_0045, Tracking URL = http://localhost. localdomain:50030/jobdetails.jsp?jobid=job_201305061901_0045>
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_201305061901_0045
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2013-05-07 13:17:27,444 Stage-1 map = 0%, reduce = 0%
2013-05-07 13:17:31,458 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.36 sec
2013-05-07 13:17:32,464 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.36 sec
2013-05-07 13:17:33,471 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 1.36 sec
MapReduce Total cumulative CPU time: 1 seconds 360 msec
Ended Job = job_201305061901_0045
MapReduce Jobs Launched:
Job 0: Map: 1 Cumulative CPU: 1.36 sec HDFS Read: 66093 HDFS Write: 490 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 360 msec
OK
time month year monthday yearmonth
[20/Jul/2009:20:12:22 -0700] 07 2009 0720 200907
Time taken: 9.398 seconds
INSERT OVERWRITE LOCAL DIRECTORY ‘/mnt/hgfs/BigDataAnalytics/Project1/ TopHostFailedLogongsByDay’
--Show hosts sorted by frequency of failed calls to server by day
SELECT monthday, host, COUNT(host) AS host_freq
FROM by_month
WHERE substr(status,1,1) IN(‘4’,’5’)
GROUP BY monthday, host
ORDER BY host_freq DESC
LIMIT 10;
monthdayhosthost_freq
0724 121.205.226.76 57
0807 66.249.67.3 31
0727 66.249.67.3 27
0723 38.99.44.101 27
0724 72.30.142.87 25
0910 66.249.67.3 24
0724 66.249.67.3 24
0927 66.249.67.3 22
0926 66.249.67.3 20
0723 .30.142.87 20
Time taken: 21.697 seconds
--Show hosts sorted by frequency of failed calls to server by month
SELECT yearmonth, host, COUNT(host) AS host_freq
FROM by_month
WHERE substr(status,1,1) IN(‘4’,’5’)
GROUP BY yearmonth, host
ORDER BY host_freq DESC
LIMIT 10;
yearmonthhosthost_freq
200908 66.249.67.3 437
200909 66.249.67.3 433
200909 74.125.74.193 184
200908 64.233.172.17 180
200907 66.249.67.3 178
200909 74.125.16.65 172
200908 72.14.192.65 171
200908 74.125.74.193 169
200908 66.249.67.87 169
200908 74.125.16.65 169
Time taken: 21.694 seconds
--Unsuccessful server calls as a time series by year and month
Create VIEW FailedRequestsTimeSeriesByMonth AS
SELECT yearmonth, COUNT(yearmonth) AS failedrequest_freq
FROM by_month
WHERE substr(status,1,1) IN(‘4’,’5’)
GROUP BY yearmonth
ORDER BY yearmonth ASC;
SELECT ∗ FROM FailedRequestsTimeSeriesByMonth;
yearmonthfailedrequest_freq
200907 1861
200908 2848
200909 2706
200910 55
Time taken: 23.682 seconds
--Successful server calls as a time series by year and month
Create VIEW SuccessfulRequestsTimeSeriesByMonth AS
SELECT yearmonth, COUNT(yearmonth) AS successfulrequest_freq
FROM by_month
WHERE substr(status,1,1) IN(‘1’,’2’,’3’)
GROUP BY yearmonth
ORDER BY yearmonth ASC;
SELECT ∗ FROM SuccessfulRequestsTimeSeriesByMonth;
yearmonthfailedrequest_freq
200907 57972
200908 88821
200909 83185
200910 1902
Time taken: 21.619 seconds
SELECT a.yearmonth, failedrequest_freq / successfulrequest_freq AS failratio
FROM FailedRequestsTimeSeries a
JOIN SuccessfulRequestsTimeSeries b
ON a.yearmonth = b.yearmonth
ORDER BY yearmonth ASC;
yearmonthfailratio
200907 0.03210170427102739
200908 0.03206448925366749
200909 0.032529903227745384
200910 0.028916929547844375
Time taken: 66.867 seconds
--enable the creation of a time series by day over multiple years and months
CREATE VIEW by_day AS
SELECT host, identity, user, time, concat(year, monthday) AS yearmonthday, request, status, size, referer, agent
FROM by_month;
SELECT ∗ FROM by_day LIMIT 10;
host identity user time yearmonthday request status size referer agent
66.249.67.3 - - [20/Jul/2009:20:12:22 -0700] 20090720 “GET / gallery/main.php?g2_controller=exif.SwitchDetailMode&g2_mode= detailed&g2_return=%2Fgallery%2Fmain.php%3Fg2_itemId%3D15741&g2_returnName =photo HTTP/1.1”302 5 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +>http://www.google.com/bot.html)
66.249.67.3 - - [20/Jul/2009:20:12:25 -0700] 20090720 “GET /gallery/ main.php?g2_itemId=15741&g2_fromNavId=x8fa12efc HTTP/1.1” 200 8068 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +>http://www.google.com/bot.html)
64.233.172.17 - - [20/Jul/2009:20:12:26 -0700] 20090720 “GET /gwidgets/ alexa.xml HTTP/1.1” 200 2969 “-” “Mozilla/5.0 (compatible) Feedfetcher-Google; (+>http://www.google.com/feedfetcher.html)
74.125.74.193 - - [20/Jul/2009:20:13:01 -0700] 20090720 “GET /gwidgets/ alexa.xml HTTP/1.1” 200 2969 “-” “Mozilla/5.0 (compatible) Feedfetcher-Google; (+>http://www.google.com/feedfetcher.html)
192.168.1.198 - - [20/Jul/2009:20:13:18 -0700] 20090720 “GET / HTTP/1.1” 200 17935 “-” “Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_7; en-us) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Safari/530.17”
192.168.1.198 - - [20/Jul/2009:20:13:18 -0700] 20090720 “GET / style.css HTTP/1.1” 200 1504 “http://example.org/” “Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_7; en-us) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Safari/530.17”
192.168.1.198 - - [20/Jul/2009:20:13:19 -0700] 20090720 “GET / favicon.ico HTTP/1.1” 404 146 “http://example.org/” “Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_7; en-us) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Safari/530.17”
66.249.67.3 - - [20/Jul/2009:20:13:21 -0700] 20090720 “GET / gallery/main.php?g2_controller=exif.SwitchDetailMode&g2_mode= detailed&g2_return=%2Fgallery%2Fmain.php%3Fg2_itemId%3D30893&g2_ returnName=photo HTTP/1.1”302 5 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +>http://www.google.com/bot.html)
66.249.67.3 - - [20/Jul/2009:20:13:24 -0700] 20090720 “GET /gallery/ main.php?g2_itemId=30893&g2_fromNavId=xfc647d65 HTTP/1.1” 200 8196 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +>http://www.google.com/bot.html)
66.249.67.3 - - [20/Jul/2009:20:13:29 -0700] 20090720 “GET / gallery/main.php?g2_view=search.SearchScan&g2_form%5BuseDefaultSettings%5D =1&g2_return=%2Fgallery%2Fmain.php%3Fg2_itemId%3D15789& g2_returnName=photo HTTP/1.1” 200 6360 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +>http://www.google.com/ bot.html)
Time taken: 9.102 seconds
--Unsuccessful server calls as a time series by year, month, and day
Create VIEW FailedRequestsTimeSeriesByDay AS
SELECT yearmonthday, COUNT(yearmonthday) AS failedrequest_freq
FROM by_day
WHERE substr(status,1,1) IN(‘4’,’5’)
GROUP BY yearmonthday
ORDER BY yearmonthday ASC;
--Successful server calls as a time series by year, month, and day
Create VIEW SuccessfulRequestsTimeSeriesByDay AS
SELECT yearmonthday, COUNT(yearmonthday) AS successfulrequest_freq
FROM by_day
WHERE substr(status,1,1) IN(‘1’,’2’,’3’)
GROUP BY yearmonthday
ORDER BY yearmonthday ASC;
--Calculate ratio of failed to successful requests by year, month, and day
SELECT a.yearmonthday, a.failedrequest_freq / b.successfulrequest_freq AS failratio
FROM FailedRequestsTimeSeriesByDay a
JOIN SuccessfulRequestsTimeSeriesByDay b
ON a.yearmonthday = b.yearmonthday
ORDER BY yearmonthday ASC;
yearmonthday failratio
20090720 0.023759608665269043
20090721 0.024037482175595846
20090722 0.029298848252172157
20090723 0.032535684298908484
20090724 0.04544235924932976
20090725 0.030345800988002825
20090726 0.031446540880503145
20090727 0.03494060097833683
20090728 0.031545741324921134
20090729 0.03138373751783167
20090730 0.03590285110876452
20090731 0.034519956850053934
20090801 0.024278676988036593
20090802 0.0297029702970297
20090803 0.0314026517794836
20090804 0.030692362598144184
20090805 0.039501779359430604
20090806 0.030526315789473683
20090807 0.03762418093426337
20090808 0.029632274187790075
20090809 0.029971791255289138
20090810 0.03563941299790356
20090811 0.036671368124118475
20090812 0.03349788434414669
20090813 0.03076923076923077
20090814 0.031578947368421054
20090815 0.03191862504384427
20090816 0.03364550815123136
20090817 0.029210241615578794
20090818 0.030576789437109102
20090819 0.033402922755741124
20090820 0.034220532319391636
20090821 0.032474408753971055
20090822 0.03897944720056697
20090823 0.029098651525904896
20090824 0.028070175438596492
20090825 0.02638058389025677
20090826 0.029650547123190964
20090827 0.029627047751829907
20090828 0.039628704034273474
20090829 0.035426166257453526
20090830 0.02492102492102492
20090831 0.032418952618453865
20090901 0.02949438202247191
20090902 0.032688927943760986
20090903 0.028690662493479395
20090904 0.029954719609892023
20090905 0.02907180385288967
20090906 0.031042901988140914
20090907 0.03449477351916376
20090908 0.035181236673773986
20090909 0.037141846480367884
20090910 0.03450679679330777
20090911 0.03566433566433566
20090912 0.031282952548330405
20090913 0.030218825981243487
20090914 0.03377437325905292
20090915 0.025804171085189113
20090916 0.030892051371051717
20090917 0.030978934324659233
20090918 0.028441011235955056
20090919 0.02912280701754386
20090920 0.027392510402219142
20090921 0.03273381294964029
20090922 0.031751570132588974
20090923 0.03167898627243928
20090924 0.03349964362081254
20090925 0.0420377627360171
20090926 0.03863716192483316
20090927 0.0328042328042328
20090928 0.040757954951734
20090929 0.030975008799718408
20090930 0.03368794326241135
20091001 0.028916929547844375
Time taken: 68.21 seconds
hive> INSERT OVERWRITE LOCAL DIRECTORY ‘/mnt/hgfs/BigDataAnalytics/Project1/ FailedRequestsByDay’
> SELECT a.yearmonthday
> ,a.failedrequest_freq / b.successfulrequest_freq AS failratio
> FROM FailedRequestsTimeSeriesByDay a
> JOIN SuccessfulRequestsTimeSeriesByDay b
> ON a.yearmonthday = b.yearmonthday
> ORDER BY yearmonthday ASC;
Total MapReduce jobs = 6
Launching Job 1 out of 6
Number of reduce tasks not specified. Estimated from input data size: 1
…
Total MapReduce CPU Time Spent: 18 seconds 800 msec
OK
> rm(list=ls()) #Remove any objects
>
> library(fBasics)
>
> #Import failed requests file, using standard delimeter in Hive
> failedRequests <- read.table(“FailedRequestsByDay.txt”,sep=””)
>
> #Add column headings
> colnames(failedRequests) <- c(“Date”,”FailedRequestsRatio”)
> stdev <- sd(failedRequests$FailedRequestsRatio) #calculate the standard deviation
> avg <- mean(failedRequests$FailedRequestsRatio) #calculate the average
> avgPlus2Stdev <- avg + 2∗stdev #mean plus 2 standard deviations
>
> #Identify the days that had failed requests in excess of 2X the standard deviation
> failedRequests[failedRequests[,2]>avgPlus2Stdev,]
Date FailedRequestsRatio
5 20090724 0.04544236
68 20090925 0.04203776
71 20090928 0.04075795
>
> #Produce a plot and save it as a PDF
> pdf(“PlotOfFailedRequestRatioSeries.pdf”)
> plot(failedRequests[,2],type=’l’,main=”Ratio of Failed Server Requests to Successful Requests by Day”
+ ,xlab=”Day”,ylab=”Ratio of Failed Requests to Successful Requests”)
> lines(rep(avg,length(failedRequests[,2])))
> lines(rep(avgPlus2Stdev,length(failedRequests[,2])),lty=2)
> legend(“topright”,c(“Average”,”2 X Standard Deviation”),lty=c(1,2))
> dev.off()
null device 1
>
> #Create autocorrelation plot to test for seasonality or other autocorrelation effects
> pdf(“FaileRequestsAutoCorrelation.pdf”)
> acfPlot(failedRequests[,2],lag.max=60)
> dev.off()
null device 1
Date FailedRequestsRatio
5 20090724 0.04544236
68 20090925 0.04203776
71 20090928 0.04075795
INSERT OVERWRITE LOCAL DIRECTORY ‘/mnt/hgfs/BigDataAnalytics/Project1/ApacheLog’
SELECT ∗ FROM statusgroupings;
/usr/bin/mahout trainlogistic
--input statusgroupings.csv
--output ./model
--target failedaccess
--categories 2
--predictors host
--types word
--features 50
--passes 20
--rate 50
[cloudera@localhost Project1]$ /usr/bin/mahout trainlogistic
> --input statusgroupings.csv
> --output ./model
> --target failedaccess
> --categories 2
> --predictors host identity user
> --types word
> --features 50
> --passes 20
> --rate 50
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using /usr/lib/hadoop/bin/hadoop and HADOOP _CONF_DIR=/etc/hadoop/conf
MAHOUT-JOB: /usr/lib/mahout/mahout-examples-0.7-cdh4.2.0-job.jar
50
failedaccess ∼ -3.757∗Intercept Term + -19.927∗host=112.200.11.174 + -18.807∗host=112.200.200.192 + …
…
Intercept Term -3.75710
host=112.200.11.174 -19.92741
host=112.200.200.192 -18.80668
host=112.202.3.173 -0.08841
host=112.202.43.173 1.03845
host=114.111.36.26 0.33822
host=114.127.246.36 3.49805
host=114.200.199.144 0.47374
host=114.39.145.56 -3.32575
…
13/05/07 18:28:30 INFO driver.MahoutDriver: Program took 70195 ms (Minutes: 1.1699166666666667)
--Unstack status codes into individual columns, create date fields, and show all other columns as well
CREATE VIEW IF NOT EXISTS unstacked_status_codes AS
SELECT host, identity, user, time, CASE substr(time,5,3)
WHEN ‘Jan’ THEN ‘01’
WHEN ‘Feb’ THEN ‘02’
WHEN ‘Mar’ THEN ‘03’
WHEN ‘Apr’ THEN ‘04’
WHEN ‘May’ THEN ‘05’
WHEN ‘Jun’ THEN ‘06’
WHEN ‘Jul’ THEN ‘07’
WHEN ‘Aug’ THEN ‘08’
WHEN ‘Sep’ THEN ‘09’
WHEN ‘Oct’ THEN ‘10’
WHEN ‘Nov’ THEN ‘11’
WHEN ‘Dec’ THEN ‘12’
ELSE ‘00’
END
AS month, substr(time,9,4) AS year, concat(substr(time,9,4), CASE substr(time,5,3) WHEN ‘Jan’ THEN ‘01’
WHEN ‘Feb’ THEN ‘02’
WHEN ‘Mar’ THEN ‘03’
WHEN ‘Apr’ THEN ‘04’
WHEN ‘May’ THEN ‘05’
WHEN ‘Jun’ THEN ‘06’
WHEN ‘Jul’ THEN ‘07’
WHEN ‘Aug’ THEN ‘08’
WHEN ‘Sep’ THEN ‘09’
WHEN ‘Oct’ THEN ‘10’
WHEN ‘Nov’ THEN ‘11’
WHEN ‘Dec’ THEN ‘12’
ELSE ‘00’
END) AS yearmonth, concat(CASE substr(time,5,3)
WHEN ‘Jan’ THEN ‘01’
WHEN ‘Feb’ THEN ‘02’
WHEN ‘Mar’ THEN ‘03’
WHEN ‘Apr’ THEN ‘04’
WHEN ‘May’ THEN ‘05’
WHEN ‘Jun’ THEN ‘06’
WHEN ‘Jul’ THEN ‘07’
WHEN ‘Aug’ THEN ‘08’
WHEN ‘Sep’ THEN ‘09’
WHEN ‘Oct’ THEN ‘10’
WHEN ‘Nov’ THEN ‘11’
WHEN ‘Dec’ THEN ‘12’
ELSE ‘00’ END,substr(time,2,2)) AS monthday, request, CASE status WHEN ‘100’ THEN 1 ELSE 0 END AS 100Continue, CASE status WHEN ‘101’ THEN 1 ELSE 0 END AS 101SwitchingProtocols, CASE status WHEN ‘102’ THEN 1 ELSE 0 END AS 102Processing, CASE status WHEN ‘200’ THEN 1 ELSE 0 END AS 200OK, CASE status WHEN ‘201’ THEN 1 ELSE 0 END AS 201Created, CASE status WHEN ‘202’ THEN 1 ELSE 0 END AS 202Accepted, CASE status WHEN ‘203’ THEN 1 ELSE 0 END AS 203NonAuthoritativeInformation, CASE status WHEN ‘204’ THEN 1 ELSE 0 END AS 204NoContent, CASE status WHEN ‘205’ THEN 1 ELSE 0 END AS 205ResetContent, CASE status WHEN ‘206’ THEN 1 ELSE 0 END AS 206PartialContent, CASE status WHEN ‘207’ THEN 1 ELSE 0 END AS 207MultiStatus, CASE status WHEN ‘208’ THEN 1 ELSE 0 END AS 208AlreadyReported, CASE status WHEN ‘226’ THEN 1 ELSE 0 END AS 226IMUsed, CASE status WHEN ‘300’ THEN 1 ELSE 0 END AS 300MultipleChoices, CASE status WHEN ‘301’ THEN 1 ELSE 0 END AS 301MovedPermanently, CASE status WHEN ‘302’ THEN 1 ELSE 0 END AS 302Found, CASE status WHEN ‘303’ THEN 1 ELSE 0 END AS 303SeeOther, CASE status WHEN ‘304’ THEN 1 ELSE 0 END AS 304NotModified, CASE status WHEN ‘305’ THEN 1 ELSE 0 END AS 305UseProxy, CASE status WHEN ‘306’ THEN 1 ELSE 0 END AS 306SwitchProxy, CASE status WHEN ‘307’ THEN 1 ELSE 0 END AS 307TemporaryRedirect, CASE status WHEN ‘308’ THEN 1 ELSE 0 END AS 308PermanentRedirect, CASE status WHEN ‘400’ THEN 1 ELSE 0 END AS 400BadRequest, CASE status WHEN ‘401’ THEN 1 ELSE 0 END AS 401Unauthorized, CASE status WHEN ‘402’ THEN 1 ELSE 0 END AS 402PaymentRequired, CASE status WHEN ‘403’ THEN 1 ELSE 0 END AS 403Forbidden, CASE status WHEN ‘404’ THEN 1 ELSE 0 END AS 404NotFound, CASE status WHEN ‘405’ THEN 1 ELSE 0 END AS 405MethodNotAllowed, CASE status WHEN ‘406’ THEN 1 ELSE 0 END AS 406NotAcceptable, CASE status WHEN ‘407’ THEN 1 ELSE 0 END AS 407ProxyAuthentication Required, CASE status WHEN ‘408’ THEN 1 ELSE 0 END AS 408RequestTimeout, CASE status WHEN ‘409’ THEN 1 ELSE 0 END AS 409Conflict, CASE status WHEN ‘410’ THEN 1 ELSE 0 END AS 410Gone, CASE status WHEN ‘411’ THEN 1 ELSE 0 END AS 411LengthRequired, CASE status WHEN ‘412’ THEN 1 ELSE 0 END AS 412PreconditionFailed, CASE status WHEN ‘413’ THEN 1 ELSE 0 END AS 413RequestEntityTooLarge, CASE status WHEN ‘414’ THEN 1 ELSE 0 END AS 414RequestUriTooLong, CASE status WHEN ‘415’ THEN 1 ELSE 0 END AS 415UnsupportedMediaType, CASE status WHEN ‘416’ THEN 1 ELSE 0 END AS 416RequestedRangeNotSatisfiable, CASE status WHEN ‘417’ THEN 1 ELSE 0 END AS 417ExpectationFailed, CASE
status WHEN ‘418’ THEN 1 ELSE 0 END AS 418ImATeapot, CASE status WHEN ‘420’ THEN 1 ELSE 0 END AS 420EnhanceYourCalm, CASE status WHEN ‘422’ THEN 1 ELSE 0 END AS 422UnprocessableEntity, CASE status WHEN ‘423’ THEN 1 ELSE 0 END AS 423Locked, CASE status WHEN ‘424’ THEN 1 ELSE 0 END AS 424FailedDependency, CASE status WHEN ‘424’ THEN 1 ELSE 0 END AS 424MethodFailure, CASE status WHEN ‘425’ THEN 1 ELSE 0 END AS 425UnorderedCollection, CASE status WHEN ‘426’ THEN 1 ELSE 0 END AS 426UpgradeRequired, CASE status WHEN ‘428’ THEN 1 ELSE 0 END AS 428PreconditionRequired, CASE status WHEN ‘429’ THEN 1 ELSE 0 END AS 429TooManyRequests, CASE status WHEN ‘431’ THEN 1 ELSE 0 END AS 431RequestHeaderFieldsTooLarge, CASE status WHEN ‘444’ THEN 1 ELSE 0 END AS 444NoResponse, CASE status WHEN ‘449’ THEN 1 ELSE 0 END AS 449RetryWith, CASE status WHEN ‘450’ THEN 1 ELSE 0 END AS 450BlockedByWindowsParentalControls, CASE status WHEN ‘451’ THEN 1 ELSE 0 END AS 451UnavailableForLegalReasonsOrRedirect, CASE status WHEN ‘494’ THEN 1 ELSE 0 END AS 494RequestHeaderTooLarge, CASE status WHEN ‘495’ THEN 1 ELSE 0 END AS 495CertError, CASE status WHEN ‘496’ THEN 1 ELSE 0 END AS 496NoCert, CASE status WHEN ‘497’ THEN 1 ELSE 0 END AS 497HttpToHttps, CASE status WHEN ‘499’ THEN 1 ELSE 0 END AS 499ClientClosedRequest, CASE status WHEN ‘500’ THEN 1 ELSE 0 END AS 500InternalServerError, CASE status WHEN ‘501’ THEN 1 ELSE 0 END AS 501NotImplemented, CASE status WHEN ‘502’ THEN 1 ELSE 0 END AS 502BadGateway, CASE status WHEN ‘503’ THEN 1 ELSE 0 END AS 503ServiceUnavailable, CASE status WHEN ‘504’ THEN 1 ELSE 0 END AS 504GatewayTimeout, CASE status WHEN ‘505’ THEN 1 ELSE 0 END AS 505HttpVersionNotSupported, CASE status WHEN ‘506’ THEN 1 ELSE 0 END AS 506VariantAlsoNegotiates, CASE status WHEN ‘507’ THEN 1 ELSE 0 END AS 507InsufficientStorage, CASE status WHEN ‘508’ THEN 1 ELSE 0 END AS 508LoopDetected, CASE status WHEN ‘509’ THEN 1 ELSE 0 END AS 509BandwidthLimitExceeded, CASE status WHEN ‘510’ THEN 1 ELSE 0 END AS 510NotExtended, CASE status WHEN ‘511’ THEN 1 ELSE 0 END AS 511NetworkAuthenticationRequired, CASE status WHEN ‘598’ THEN 1 ELSE 0 END AS 598NetworkReadTimeoutError, CASE status WHEN ‘599’ THEN 1 ELSE 0 END AS 599NetworkConnectTimeoutError, size, referer, agent
FROM apachelog;
OK
host identity user time month year yearmonth monthday
request 100continue 101switchingprotocols
102processing 200ok 201created 202accepted
203nonauthoritativeinformation 204nocontent
205resetcontent 206partialcontent 207multistatus
208alreadyreported 226imused 300multiplechoices
301movedpermanently 302found 303seeother 304notmodified
305useproxy 306switchproxy 307temporaryredirect
308permanentredirect 400badrequest 401unauthorized
402paymentrequired 403forbidden 404notfound
405methodnotallowed 406notacceptable
407proxyauthenticationrequired 408requesttimeout
409conflict 410gone 411lengthrequired
412preconditionfailed 413requestentitytoolarge
414requesturitoolong 415unsupportedmediatype
416requestedrangenotsatisfiable 417expectationfailed
418imateapot 420enhanceyourcalm 422unprocessableentity
423locked 424faileddependency 424methodfailure
425unorderedcollection 426upgraderequired
428preconditionrequired 429toomanyrequests
431requestheaderfieldstoolarge 444noresponse
449retrywith 450blockedbywindowsparentalcontrols
451unavailableforlegalreasonsorredirect
494requestheadertoolarge 495certerror 496nocert
497httptohttps 499clientclosedrequest
500internalservererror 501notimplemented 502badgateway
503serviceunavailable 504gatewaytimeout
505httpversionnotsupported 506variantalsonegotiates
507insufficientstorage 508loopdetected
509bandwidthlimitexceeded 510notextended
511networkauthenticationrequired
598networkreadtimeouterror
599networkconnecttimeouterrorsize referer agent
Time taken: 2.128 seconds
INSERT OVERWRITE LOCAL DIRECTORY ‘/mnt/hgfs/BigDataAnalytics/Project1/UnstackedStatusCodes’
SELECT count()
FROM unstacked_status_codes;
66.249.67.3 - - [20/Jul/2009:20:12:22 -0700] 07 2009 200907 0720 “GET /gallery/main.php?g2_controller=exif.SwitchDetailMode &g2_mode=detailed&g2_return=%2Fgallery%2Fmain.php%3Fg2_ itemId%3D15741&g2_returnName=photo HTTP/1.1” 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 5 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +>http://www.google.com/ bot.html)