Chapter 6

Security and Text Mining

Abstract

Massive amounts of unstructured data are being collected from online sources, such as e-mails, call center transcripts, wikis, online bulletin boards, blogs, tweets, Web pages, and so on. The R programming language contains a rich collection of packages and functions for analyzing unstructured text data. Functions include those for identifying unique words and their corresponding occurrence frequencies, a process known as tokenizing. Other functions provide a means for cleansing text data, such as removal of white space and punctuation, converting to lowercase, and removing less meaningful words through a stop word list. Apache Hive functions also provide the means for tokenizing large amounts of text using the Hadoop MapReduce framework. Text data that have been reduced to a more manageable size using Hive functions can be further analyzed using the R language's vast array text processing and advanced analytical functions.

Keywords

Big data; CRAN; Hadoop; Hive; R; Stop word list; Text mining; Token; Unstructured text
Information in this chapter
▪ Scenarios and challenges in security analytics with text mining
▪ Use of text mining techniques to analyze and find patterns in unstructured data
▪ Step by step text mining example in R
▪ Other applicable security areas and scenarios

Scenarios and Challenges in Security Analytics with Text Mining

Massive amounts of unstructured data are being collected from online sources, such as e-mails, call center transcripts, wikis, online bulletin boards, blogs, tweets, Web pages, and so on. Also, as noted in a previous chapter, large amounts of data are also being collected in semistructured form, such as log files containing information from servers and networks. Semistructured data sets are not quite as free form as the body of an e-mail, but are not as rigidly structured as tables and columns in a relational database. Text mining analysis is useful for both unstructured and semistructured textual data.
There are many ways that text mining can be used for security analytics. E-mails can be analyzed to discover patterns in words and phrases, which may indicate a phishing attack. Call center recordings can be converted to text, which then can be analyzed to find patterns and phrases, which may indicate attempts to use a stolen identity, gain passwords to secure systems, or commit other fraudulent acts. Web sites can be scraped and analyzed to find trends in security-related themes, such as the latest botnet threats, malware, and other Internet hazards.
There has been a proliferation of new tools available to deal with the challenge of analyzing unstructured, textual data. While there are many commercial tools available, often with significant costs to purchase them, some tools are free and open source. We will focus on open source tools in this chapter. This is not to say, of course, that the commercial tools offer less value. Many offer significant ease of use advantages, and come with a wide array of analytical methods. In some cases, the benefits may outweigh the costs. Open source software tools, however, can be accessed by most readers regardless of budget constraints, and are useful for learning about text mining analysis methods in general.
Popular open source software for analyzing small- to moderate-sized bodies of text includes R, Python, and Weka. In the case of Big Data, popular tools used to mine for relationships in text include Hadoop/MapReduce, Mahout, Hive, and Pig, to name a few. Since R has a particularly comprehensive set of text mining tools available in a package called “tm,” we will mainly focus on R for this chapter. The “tm” package can be downloaded from the CRAN repository, at www.cran.r-project.org.

Use of Text Mining Techniques to Analyze and Find Patterns in Unstructured Data

The phrase, text mining, refers to the process of analyzing textual data, to find hidden relationships between important themes. Regardless of the toolsets or languages used, there are methods used in text mining that are common to all. This section provides a brief description of some of the more common text mining methods and data transformations. However, it is not meant to be a comprehensive coverage of the topic. Rather, these concepts cover some of the basics required to follow the examples provided later in this chapter.

Text Mining Basics

In order to analyze textual data with computers, it is necessary to convert it from text to numbers. The most basic means of accomplishing this is by counting the number of times that certain words appear within a document. The result is generally a table, or matrix of word frequencies.
A common way to represent word frequencies in a table is to have each column represent a single word that appears in any one of a collection of documents. This arrangement is known as a “document-term matrix.” It is also possible to transpose the matrix, so that each document name becomes a column header and each row represents a different word. This arrangement is called a “term document matrix.” For our example later in this chapter, we will focus on the document-term matrix format, with each column representing a word.
Each unique word heading in a column can be referred to as a “term,” “token,” or simply “word,” interchangeably. This can be confusing to text mining newcomers. But just remember that “term,” and “token,” merely refer to a single representation of a word.
Each row in the document-term matrix table represents a single document. For instance, a row might represent a single blog with other rows in the table representing other blogs.
The numerical values within the body of the table represent the number of times each word appears in a given document. There may be many terms that appear in only one document or just a few documents. These frequency numbers can be further transformed to account for such things as differences in document sizes, reducing words to their root forms, and inversely weighting the frequencies by how commonly certain words appear within a given language.

Common Data Transformations for Text Mining

Text data is notoriously messy. Many of the common data transformations used in text mining do nothing more than make the data analyzable. For instance, text often contains extra characters that are generally meaningless for analyzing word frequencies. These characters, and even extra white space, all must be removed.
Some of the most frequently occurring words in any language are also meaningless for most kinds of analysis and must be removed. These words are collected in a “stop word list,” also known as a “stop word corpus.” This list commonly includes words such as “the,” “of,” “a,” “is,” “be,” and hundreds of others that, while frequently appearing in documents, are not very interesting for most text mining projects. Computers can compare the words that appear in documents, to a stop word list. Only those words that do not appear in the stop word list will be included in the word frequency table.

Step by Step Text Mining Example in R

For our example, suppose we wish to know common themes in reported system vulnerabilities in the Web Hacking Incident Database (WHID), maintained by the Web Application Security Consortium, at webappsec.org. This database is convenient for illustration purposes, due to its moderate size, and ease of data collection. The database Web site offers users the ability to download data in CSV format, making it easy to import into R, or almost any other text mining software for analysis. The complete URL from which these sample data sets were retrieved is: https://www.google.com/fusiontables/DataSource?snapid=S195929w1ly.
The data consists of 19 columns. Following is a complete list of all the column headings.
▪ “EntryTitle”
▪ “WHIDID”
▪ “DateOccured”
▪ “AttackMethod”
▪ “ApplicationWeakness”
▪ “Outcome”
▪ “AttackedEntityField”
▪ “AttackedEntityGeography”
▪ “IncidentDescription”
▪ “MassAttack”
▪ “MassAttackName”
▪ “NumberOfSitesAffected”
▪ “Reference”
▪ “AttackSourceGeography”
▪ “AttackedSystemTechnology”
▪ “Cost”
▪ “ItemsLeaked”
▪ “NumberOfRecords”
▪ “AdditionalLink”
For the purpose illustration, we will focus on two columns: “DateOccurred” and “IncidentDescription.” These are columns 3 and 9, respectively. Note, that for the purpose of analysis, we removed the spaces from the column names. Spaces in column names can create problems with some analytical algorithms. If you wish to download your own data set to follow along with this analysis, you will need to remove the spaces so that the column names appear exactly as appearing in the preceding list.
Most of our R code will be devoted to the IncidentDescription column, since that is where the bulk of the text exists. This column contains highly detailed narratives pertaining to each incident. Each description can be as much as a couple hundred words or more, providing more to text to analyze than any other column in the data set. Text analysis could also be performed on other columns as well, and might even reveal some interesting relationships between different column variables. However, by focusing on the “IncidentDescription” column, we will keep this analysis as simple as possible to serve as an introduction to basic text mining methods and techniques.

R Code Walk-through

The first line in our R code example removes any objects that might have already existed in your R environment. If you are running the code for the first time, this step will not be necessary. But, it will not hurt anything to include it, either. When developing code, you may find yourself modifying and rerunning code sets over again. For this reason, you will want to be sure there are no unwanted remnant variable values or other objects remaining in your R environment, before you run your code anew.

Initiating the Package Libraries and Importing the Data

rm(list=ls()) #Start with a clean slate: remove all objects
Next, we load all the libraries needed for our analysis. Recall that the “#” symbol denotes comments. The code is liberally commented to improve understanding as to what the code does.
#Load libraries
library(tm) #for text mining functions
library(corrplot) #for creating plots of correlation matrices
The “tm” package contains many more functions than we can cover in a single chapter. They are worth exploring, however. A full index of all the available functions within this package can be produced in R, with the following command.
#See tm package documentation and index of functions
library(help=tm)
Now, we can import the data from the CSV file, using R’s “read.csv” function. We set the header parameter to a value of TRUE, since the first row of the CSV file contains the headers. The next line of code then takes the imported data and converts it to a text corpus. The “corpus” function extracts the text data from the column we specify, which is “IncidentDescription,” in this case. The function also includes metadata that is necessary for the text mining functions we will be applying later in the analysis.
rawData <- read.csv(“DefaultWHIDView2.csv”,header=TRUE)
data <- Corpus(VectorSource(rawData[,”IncidentDescription”]))

Text Data Cleansing

At this point, we are ready to begin applying data transformations that will clean up the text and put it into an analyzable format. Here, we use a variety functions in the “tm” package. The “tm_map” function maps a specified transformation to the text data. This function allows us to specify a nested transformation function.
The first use of the “tm_map” function calls the “stripWhitespace” function from within it. As the function’s name implies, this combination will strip all white space from the text.
The second tm_map function calls a data cleansing function called, “stemDocument.” This function combination will reduce all words to their root forms. For example, the terms “reading,” and “reads” would be converted to the stem word, “read,” which can also be referred to as a root word. This can be a handy transformation, depending upon the objective of your analysis. However, the transformation can also create some problems as well. Sometimes the extra characters after a stem word are not superfluous after all. They can, in some cases, completely change the meaning of a word. You should experiment with your own text data, to see how stemming affects your results. In the case of the data for this example, the results appear valid for the most part, without the use of stemming. Thus, the stemming step is commented out in the code, but is left for reference purposes.
The next line uses the “toLower” function. This also is self-explanatory, as it simply changes all the letters to lowercase.
At this point in the code, we assemble a list of stopwords, using the “stopwords” function in the “tm” package. Notice the parameter, “english,” which specifies a list of stopwords for the English language. There are other languages that are supported by this function as well. See the function’s built-in help for other language parameters that can be used. Recall that help for a function can be called by simply typing a question mark in front of the function name, such as “?stopwords.” Running the stopwords function returns a vector of stopwords, and we assign this list of stopwords to a variable we also are calling “stopWords.” On the next line of code, the “stopWords” variable is used to provide a list to the “removewords” function, which is embedded within the tm_map function. This removes all words that appear in the “stopWords” list from the text corpus. (Note that corpus just means “body.” So, “text corpus” is the same as “text body.”)
The last two lines in the code section remove all punctuation and numerals from the text. They utilize the functions, “removePunctuation” and “removeNumbers,” respectively.
#Cleanup text
data2 = tm_map(data, stripWhitespace)
#data2 = tm_map(data2, stemDocument)
data2 = tm_map(data2, tolower)
stopWords = c(stopwords(“english”))
data2 = tm_map(data2, removeWords, stopWords)
data2 = tm_map(data2, removePunctuation)
data2 = tm_map(data2, removeNumbers)
Let’s review a sample of what all this data cleansing has done to our text. To do this, we use the “inspect” function. Notice how the results maintain the individual documents and their index numbers. Only the first two documents are shown here to save space. The results also begin with a brief metadata description. Metadata are sometimes described as simply being “data about the data.”
> inspect(data2[1:5])
A corpus with 5 text documents
The metadata consists of 2 tag-value pairs and a data frame.
Available tags are:
create_date creator
Available variables in the data frame are:
MetaID
[[1]]
department health hospitals spokeswoman lisa faust said bureau emergency medical services personnel discovered database breach unauthorized entry gave hacker access individuals name personal information including social security numbers dont know whether hacker able access information faust said computer screen displayed message hacked faust said since dont know one way sent notices people s potential information compromised wasc whid note portal login page httpsemsophdhhlagovemsloginasp looks vulnerable sql injection
[[2]]
boingboingnet popular blog directory wonderful things hacked home page replaced message containing vulgar language pictures site pulled administrators shortly attack suspected executed via sql injection techcrunch reports

Creating a Document-Term Matrix

Now that we have cleaned up the text, we are ready to convert it into a word frequency matrix. We will use the document-term matrix format, which lists the words, or tokens, as column headers, and each document as a separate row. The number of times each token appears within a document is given at the intersection between a row number and a token column heading.
We use the “DocumentTermMatrix” function to create the matrix, and assign the results to a variable we are calling, “dtm.” Lastly, we inspect the first five rows and the first five columns using the “inspect” function. Notice how the results give the metadata for the matrix. The metadata description says that there are no nonsparse entries. This means that all five of the returned terms have zero entries in the five documents shown. If we wanted to see a value returned for these, we would have to view more than just five entries.
> #Make a word frequency matrix, with documents as rows, and terms as columns
> dtm = DocumentTermMatrix(data2)
> inspect(dtm[1:5,1:5])
A document-term matrix (5 documents, 5 terms)
Non-/sparse entries: 0/25
Sparsity: 100%
Maximal term length: 10
Weighting: term frequency (tf)
Terms
Docs aapl abandoned abc abcnewscom abell
1 0 0 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
4 0 0 0 0 0
5 0 0 0 0 0

Removing Sparse Terms

It is common in text mining to see many terms that only appear in a few documents or sometimes even one document. These are called sparse entries. In the next code sample, we will remove sparse terms from our document-term matrix using the “removeSparseTerms” function. This function has a parameter for setting a percentage threshold. Any words that are above this sparse threshold will be removed. In this case, we set the threshold to “0.90,” which is 90%. Note that you must type the percentage as a decimal. You should experiment with different thresholds to see what produces the best results for your analysis. If the threshold is set too high, your results may contain too many isolated terms and will not reveal any significant patterns. However, set the threshold too low and you could end up removing words that have significant meaning. Lastly, we inspected the first five rows and columns, and can see some frequency numbers appearing, now that many of the most sparse terms have been removed.
> #Remove and sparse terms a given percentage of sparse (i.e., 0) occurence
> dtm = removeSparseTerms(dtm, 0.90)
> inspect(dtm[1:5,1:5])
A document-term matrix (5 documents, 5 terms)
Non-/sparse entries: 6/19
Sparsity: 76%
Maximal term length: 6
Weighting: term frequency (tf)
Terms
Docs access attack can data hacked
1 2 0 0 0 1
2 0 1 0 0 1
3 0 0 0 0 0
4 0 2 0 0 0
5 0 1 0 0 0

Data Profiling with Summary Statistics

Now that the data set has been cleansed and transformed, and we have a document-term matrix in hand, we are ready to begin some analysis. One of the first things you should be in the habit of doing with any kind of data analysis is to look at a data profile. A data profile contains descriptive statistics that help give you a feel for your data. It is a great way of giving your data a reasonableness check and look for possible anomalies. R makes it easy to get descriptive statistics, using a standard function called “summary.” Descriptive statistics produced by this function include minimum, maximum, mean, and median, as well as the first and third quartiles. These can really help you to understand the shape of the distribution your data.
Notice that we are embedding the “inspect” function within the “summary” function. This is necessary, since the document-term matrix, “dtm,” uses text corpus data. As noted earlier, a text corpus includes metadata required for a number of text analytical functions in the “tm” package. The “inspect” function separates the plain frequency data from the metadata. In other words, it returns a data frame table type, which is what is required of the summary function. Any time you may want to use a standard R function, other than a function within the “tm” package, you will likely need to use the “inspect” function in a similar manner, since most standard R functions require a data frame format, rather than a text corpus format. Notice also, that within the “inspect” function, we have used R’s indexing functionality, by specifying within square brackets that we only want the first three columns returned.
The result indicates that these first three terms are still fairly sparse. For instance, the word “access” appears an average of 14% of the time among all documents, with a maximum frequency of four occurrences in a single document.
summary(inspect(dtm[,1:3]))
access attack can
Min.:0.0000 Min.:0.0000 Min.:0.0000
1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
Median:0.0000 Median:0.0000 Median:0.0000
Mean:0.1403 Mean:0.3073 Mean:0.1332
3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000
Max.:4.0000 Max.:8.0000 Max.:4.0000

Finding Commonly Occurring Terms

One of the more common things we want to do in text analysis, is to determine which terms appear most frequently. The “findFreqTerms” function within the “tm” package is useful for this purpose. This function has a parameter that sets a threshold for a minimum frequency. Only words that have at least this level of frequency or more will be returned by the function. This example sets the threshold at 100, since this reduces the list to something more printable in a book.
Notice in the output from this function that there is an odd word appearing as “ppadditional.” This is an anomaly in the data that is the result of the fact that the original data had some HTML tags that crept into the CSV file. In this case, a pair of paragraph tags denoted as “<p>” in the text. When you find such anomalies, you will need to determine whether it affects your analysis enough to be worthwhile spend the extra effort to clean them up. In this case, as we will see later, the impact of this anomaly is inconsequential and easy to spot.
> #Find terms that occur at least n times
> findFreqTerms(dtm, 100)
[1] “attack” “hacked” “hacker” “hackers”
[5] “information” “informationp” “injection” “one”
[9] “ppadditional” “security” “site” “sites”
[13] “sql” “used” “vulnerability” “web”
[17] “website”

Word Associations

Often, we maybe interested in knowing whether there are any terms associated with a particular word theme. In other words, we want to know which terms in our collection of documents may be correlated with a given term. The “tm” package’s “findAssocs” function can help us with this. In the example below, we give the “findAssocs” function three parameters: (1) the object containing the data-term matrix or “dtm” in this case; (2) the word for which we want to find associated terms, which is the word, “attack” in this example; and (3) the correlation threshold, also referred to as an r-value, or 0.1 in this case. There are no hard fast rules for setting the correlation threshold. This one is set very low to include more associated words in the result. However, if your document has many associated words in it, you might set this threshold much higher to reduce the number of words returned by the function. It is a subjective, iterative process to find the balance you need for your analysis.
In the output for this function, notice there were eight words returned that at least met the correlation threshold. Their respective correlation values are showing in a vector of numbers. In this case, the word “attack” is most closely associated with the word “injection.” This, no doubt, refers to the use of the phrase “injection attack,” which refers to a means of injecting malicious commands into a system, such as by SQL injection into a database. Notice that “sql” is the second most associated term. Next in line for the commonly referred to terms are “sites” and “service.” This most likely refers to reports of SQL injection against Web sites and web services.
> findAssocs(dtm, “attack”, 0.1)
attack
injection 0.30
sql 0.29
sites 0.24
service 0.23
access 0.16
incident 0.12
new 0.10
web 0.10

Transposing the Term Matrix

At this point, it might be interesting to see whether our results for the above word association exercise might have changed, had we elected to transpose our matrix to a term document matrix, instead of a document-term matrix. In other words, what if the columns represented documents instead of words and the rows represented words instead of documents? Notice that we store the resulting matrix in a variable called “tdm” for “term-document matrix” to compare against the variable “dtm” for “document-term matrix.”
Running the code below, we can see that it makes no difference in this case. All the preceding steps for the document-term matrix were repeated here, only the data are transposed. If we had far more documents than words, and we wished to focus on the words, we might find it more convenient to transpose our matrix in this way.
> inspect(tdm[1:5,1:5])
A term-document matrix (5 terms, 5 documents)
Non-/sparse entries: 0/25
Sparsity: 100%
Maximal term length: 10
Weighting: term frequency (tf)
Docs
Terms 1 2 3 4 5
aapl 0 0 0 0 0
abandoned 0 0 0 0 0
abc 0 0 0 0 0
abcnewscom 0 0 0 0 0
abell 0 0 0 0 0
>
> tdm = removeSparseTerms(tdm, 0.90)
> inspect(tdm[1:5,1:5])
A term-document matrix (5 terms, 5 documents)
Non-/sparse entries: 6/19
Sparsity: 76%
Maximal term length: 6
Weighting: term frequency (tf)
Docs
Terms 1 2 3 4 5
access 2 0 0 0 0
attack 0 1 0 2 1
can 0 0 0 0 0
data 0 0 0 0 0
hacked 1 1 0 0 0
>
> findAssocs(tdm, “attack”, 0.1)
attack
injection 0.30
sql 0.29
sites 0.24
service 0.23
access 0.16
incident 0.12
new 0.10
web 0.10
As we see from the above, the most associated terms along with their respective correlations are exactly the same in this transposed matrix, as with the original term document matrix.

Trends over Time

Earlier, we said that we wanted to see if there were any patterns in the text over time. We will now convert the “DateOccured” (sic) column from a string format, to a date format, from which we can extract year and month elements. For this example, we will analyze the trends in the usage of key terms by year.
For this conversion, we use the standard R function “as.Date.” We use square bracket indexing to identify the date column and use the actual name of the column “DateOccured” (sic). Alternatively, we could have used a number for the index instead of the name. That is, we could have shown it as rawData[,3], instead of rawData[,”DateOccured”]. Also, the empty space before the comma within the index brackets refers to the row index, and it means that we want to return all rows. The “as.Date” function also requires a date format string to specify the native format in which the date appears within the data. In this case, the date is in numeric format, with one to two digits given for the month and day, and four digits for the year, with each element separated by a forward slash. For instance, a typical date in the source data appears in the structure, 10/12/2010. We specify the format as “%m/%d/%Y,” where the lower case “m” coupled with the percent sign means a single or double digit month, and likewise with %d for the day. The capital “Y” in %Y refers to a four digit year.
Once we have our dates converted to a date object, stored in the variable, “dateVector”, we can use the “month” function to extract the months from the date vector and the “year” function to extract the years.
> #Convert to date format
> dateVector <- as.Date(rawData[,”DateOccured”], “%m/%d/%Y”)
> mnth <- month(dateVector)
> yr <- year(dateVector)
Next, we create a data frame by using the standard R “data.frame” function. We add the vectors we created for date, month, and year, to the document-term matrix to create a single table. Notice again, that we used the “inspect” function to retrieve a data frame from within the text corpus object, “dtm.”
> dtmAndDates <- data.frame(dateVector, mnth,yr,inspect(dtm))
A document-term matrix (563 documents, 30 terms)
Non-/sparse entries: 3039/13851
Sparsity: 82%
Maximal term length: 13
Weighting: term frequency (tf)
Terms
Docs access attack can data hacked hacker hackers…
1 2 0 0 0 1 2 0 0 3
2 0 1 0 0 1 0 0 0 0
3 0 0 0 0 0 2 0 0 0
4 0 2 0 0 0 0 0 0 0

> head(dtmAndDates)
dateVector mnth yr access attack can data hacked hacker
1 2010-09-17  9 2010 2 0 0 0 1 2
2 2010-10-27 10 2010 0 1 0 0 1 0
3 2010-10-25 10 2010 0 0 0 0 0 2
4 2010-10-28 10 2010 0 2 0 0 0 0
5 2010-10-27 10 2010 0 1 0 0 0 0
6 2010-10-25 10 2010 0 1 2 0 0 0
We see, however, that the results for the daily and even the monthly trends are rather sparse. So, we will make a new data.frame in the same manner that we created the one above, but with data aggregated by years. This will enable us to analyze annual trends in key terms. First, we build a data frame that joins the “yr” vector we created, with the data-term matrix, “dtm,” and put the resulting data frame in an object we call “dtmAndYr.”
> dtmAndYr <- data.frame(yr,inspect(dtm))
A document-term matrix (563 documents, 30 terms)
Non-/sparse entries: 3039/13851
Sparsity: 82%
Maximal term length: 13
Weighting: term frequency (tf)
Terms
Docs access attack can data hacked hacker hackers incident…
1 2 0 0 0 1 2 0 0 3
2 0 1 0 0 1 0 0 0 0
3 0 0 0 0 0 2 0 0 0
4 0 2 0 0 0 0 0 0 0
5 0 1 0 0 0 0 1 0 0
Next, we use the standard R function, “aggregate,” to sum all the frequencies for each word by year. Notice that we remove the first column from the “dtmAndYr” data frame, by indicating a 1 in the column index place within the square brackets. The first column contains the year value, which would be pointless to sum within the aggregate function, so we remove it from the final result. However, the year column is used for the “by” parameter in the “aggregate” function, as we are summing by year. This point might be confusing, so perhaps we should restate it this way: we are summing by the year, not for the year.
The “by” parameter references the “yr” column by using the dollar sign after the “dtmAndYr” data frame reference, as “dtmAndYr$Yr.” This is a standard way that R allows for referencing a single column within a data frame. We could have alternatively referenced this column by using square bracket indexing. But, we use this method here just to show another way it can be done.
Notice also that the “list” function is used to convert the “yr” column to a list format. Otherwise, it would be extracted into a data frame format. R can be very finicky with its unusual data types. In this case the aggregate function requires that the variable that is referenced in its “by” parameter be formatted as a single list.
Lastly, the parentheses that surround the entire line are used to tell R to show the output. Otherwise, without the parentheses, the output would be assigned to the “sumByYear” variable, but would not show in R’s output. This works for many functions, but not all. For instance, it will not work for the “rownames” function that we will use in the next section. We will show the output by simply stating the variable name on a line by itself, in that case.
The resulting summations by year indicate that word frequencies have been going up for each year since the WHID began collecting data. This is likely due to a general increase in reporting of attacks overall. The data shown here are truncated to save space. However, the same pattern can be seen to exist for nearly all of the terms. A few exceptions can be noted, however. Although not shown here, the term, “xss,” a common abbreviation for a cross-site scripting attack, peaked out in 2006, but has come down significantly since that time. This could be due to the fact that Web sites have tightened their code considerably in more recent years, to prevent these kinds of attacks. Since then, cross site scripting reports seem to have moderated and leveled off.
> (sumByYear <- aggregate(dtmAndYr[,1], by = list(dtmAndYr$yr), sum))
Group.1 access attack can data hacked hacker…xss
1 2001  0  0  0  0  0  0… 0
2 2004  0  0  0  0  0  0… 0
3 2005  2  5  0  5 10  4… 9
4 2006  4 12  8  6  6  9…39
5 2007  8 23  8 17 13 24…14
6 2008 10 27  7 14 22 28…11
7 2009  7 27 21 19 10 26… 9
8 2010 42 79 31 17 64 57…11

Correlation Analysis of Time Series Trends

We might be interested in doing a correlation analysis on these annual trends to see if any particular trend might be related. We cannot correlate the years column, as it is nonnumeric, and it would not make much sense to do this, even if we could. So, we will remove the years column and turn the year labels into row names instead. The correlation function we will be using will not try to correlate years, if they exist only as row names.
We use the “rownames” function to append the years to the data as row headers. Notice the unusual use of this function on the left-hand side of the assignment operator. When the “rownames” function is used by itself, or on the right-hand side of the assignment operator, it returns all the rownames. However, when on the left side of the assignment operator, the “rownames” function will push new row names into the table, as provided on the right-hand side of the assignment operator.
> selTermsByYr <- sumByYear[,1]
> rownames(selTermsByYr) <- sumByYear[,1]
> selTermsByYr
access attack can data hacked hacker hackers incident…
2001  0  0  0  0  0  0  0  0
2004  0  0  0  0  0  0  0  0
2005  2  5  0  5 10  4  3  0
2006  4 12  8  6  6  9  7  6
2007  8 23  8 17 13 24  7 16
2008 10 27  7 14 22 28 17 30
2009  7 27 21 19 10 26 18 35
2010 42 79 31 17 64 57 70 10
We can now create a correlation matrix on the data by using the “cor” function. Notice again, the use of the surrounding parentheses, so that the output will be shown in the R window, in addition to assigning the resulting matrix to the “corData” variable.
A small sample of the correlation matrix output is shown here. Each value in the matrix represents the correlation coefficient. The closer a value is to 1.0, the more each term is related to each other. The closer a value is to 1.0, the more often one term tends to be absent when the other term is present. The 1.0 values on the diagonal indicate, as we would expect, that each value is perfectly correlated to itself. The matrix is symmetrical, with values mirroring each other on opposite sides of the diagonal. In this sample view, the highest correlation is 0.98, indicating that the trend for the term “attack” is strongly related to the trend for the term “access.” This could be an indication of an increase in access attacks, such as through stolen passwords, social hacking in call centers, and so on. Further research would need to be conducted to confirm these hypotheses, of course.
> (corData <- cor(selTermsByYr))
access attack can data…
access 1.00000000 0.98036383 0.8619558 0.5848456
attack 0.98036383 1.00000000 0.9226145 0.7270269
can    0.86195584 0.92261449 1.0000000 0.7719258
data   0.58484558 0.72702694 0.7719258 1.0000000
The entire output from the trends correlation analysis is not only much too large to include here, it would be hard to compare all of the many term correlations even if we could show them. In cases where there are many correlations to examine, it can be helpful to convert the correlation matrix into a graphical image for visualization.
In the code sample that follows, we use the “png” function to start the png image device in R. Then, we run the “corrplot” function. This function generates the visualization, which is converted into an image file by the png image device we just started. Lastly, we turn off the image device, with the function, “dev.off.”
png(“Figure1.png”)
corrplot(corData, method=”ellipse”)
dev.off()
The resulting correlation plot can be seen in Figure 6.1. This correlation plot shows the strength of each correlation in two ways: ellipse width, and color shading. The more narrow the ellipse, the more highly correlated the relationship. The more narrow an ellipse, and the more it begins to appear as a straight line, the stronger the correlation. Also, the darker blue shadings indicate a stronger positive relationship. The darker pink shadings indicate a stronger negative correlation. Remember, a negative correlation is not the same as having no correlation. A zero correlation value indicates no correlation. A negative correlation approaching 1 would indicate a strong correlation, where the two variables tend to move in opposite directions. In this case, a negative correlation suggests that when one term is present, the other term tends to be absent.
In Figure 6.1, we can see that some of the strongest correlations occur where terms are commonly used together in a phrase such as “SQL injection.” One interesting example is the strong positive correlation between the words “attack” and “service.” Web services are growing in usage and are increasingly targets of attack. Let’s see what terms maybe associated with the term, “service.”
image
Figure 6.1 Visualization of correlation matrix of term frequency trends.
> #Find associated terms with the term, “service”, and a correlation of at least r
> findAssocs(dtm, “service”, 0.1)
service
attack  0.23
hackers 0.12
online  0.11
using   0.10
Here we see that terms more commonly associated with the term, “service,” include “hackers,” “online,” and “using,” all suggesting usage of online services that got hacked.

Creating a Terms Dictionary

Sometimes, it can be helpful to focus analysis on a select subset of terms that are relevant to the analysis. This can be accomplished through the use of a terms dictionary. In this example, we will use the “dictionary” parameter from within the “DocumentTermMatrix” function. You will recall that this function was used to build our initial document-term matrix. We will again use this function, but will limit the number of terms in it by specifying the dictionary parameter to only include the words, “attack,” “security,” “site,” and “Web.” These terms are somewhat arbitrarily chosen. However, they were among the top terms remaining when another analysis, not shown, was done on data that had been reduced with a higher sparsity threshold. Notice in the code, that the dictionary parameter is converted to a list format, by using the “list” function, similar to how we converted the years data frame to a list in an earlier processing step.
> #Create a dictionary: a subset of terms
> d = inspect(DocumentTermMatrix(data2, list(dictionary = c(“attack”, “security”, “site”, “web”))))
A document-term matrix (563 documents, 4 terms)
Non-/sparse entries: 756/1496
Sparsity: 66%
Maximal term length: 8
Weighting: term frequency (tf)
Terms
Docs attack security site web
1 0 1 0 0
2 1 0 1 0
3 0 0 0 0
4 2 0 1 0
5 1 1 0 0
We can create a correlation matrix on just these terms, in the same manner as we produced a correlation matrix on the trended data. We see the output below. The only terms that are highly correlated in our dictionary set are “Web,” and “site.” Of course, these terms are commonly used together to form the phrase, “Web site.”
> #Correlation matrix
> cor(d) #correlation matrix of dictionary terms only
attack security site web
attack   1.00000000 0.06779848 0.05464950 0.1011398
security 0.06779848 1.00000000 0.06903378 0.1774329
site     0.05464950 0.06903378 1.00000000 0.5582861
web      0.10113977 0.17743286 0.55828610 1.0000000
We can create a scatterplot graph to visually inspect any correlations that may appear strong. Sometimes correlations appear due to anomalies, such as outliers. A scatterplot can help us identify these situations. We would not necessarily expect that to be the case in our correlation between “Web” and “site,” but we will plot these terms just to demonstrate.
As before, we are using the “png” function to save our graph as a png image. There are similar functions for other image file types including “jpg” and “pdf.” They are all set up in the same manner, and all require use of the “dev.off” function to turn the device off after the file is created.
Within the “plot” function, we use another function called “jitter.” The “jitter” function adds random noise, to each frequency value, so that we can see how densely each of the frequency values occurs. Otherwise, without adding jitter, every document occurrence of each frequency value would appear as a single point on the graph. For example, there are many documents where the term, “attack,” appears only twice. Without adding jitter, all of these documents appear as a single point at the two locations on the axis. Jitter moves, or unstacks these points just enough, so that we can see how many documents share that same value.
You will notice a tilde between the variables in the “plot” function. This tilde is also commonly used in various sorts of regression models, as we will see in the regression function that creates the diagonal line in this graph. You could mentally replace the word “tilde,” with the word “by.” So, if you are trying to see if a variable, “x,” predicts a variable, “y,” you would say “y by x,” which would be written as “y  x” in the function. Here, we are saying “site by Web.” It would reverse the order of the axes, if we replaced the tilde with a comma. This could be done, but the result would not be consistent with the order of the variables we used in the regression function, “lm,” that follows. More will be discussed on the “lm” function in a moment.
We can see in the scatterplot output in Figure 6.2, that co-occurrences of “Web” and “site” are indeed correlated, as indicated by the general diagonal shape of all the points, from the bottom left corner, toward the upper right corner. We can also see from the jitter that higher frequencies of co-occurrence within the same document are fairly rare, while lower frequencies of co-occurrence are quite common. We can also see that there are many cases where the two terms occur independently as well. These findings do not yield any great epiphanies, but more important is what the graph does not show: no evidence of outliers or anomalies that might skew our results.
The “lm” function name stands for “linear model” and is the most common way of creating a regression model in R. We embedded the “lm” function within a function called “abline,” which adds a line to a plot. Here, we have added a regression line to the plot, to more clearly show that there is a positive relationship between the terms “Web” and “site.”
image
Figure 6.2 Strong evidence of frequent co-occurrence of the terms, “web,” and “site,” with little evidence of any unexpected anomalies.
> #Create a scatterplot comparing “site” and “web”
> png(“Figure2.png”)
> plot(jitter(d[,“site”])∼jitter(d[,“web”]), main=“Scatterplot: ‘site’ vs ‘web’”,xlab=“term: web”, ylab=“term: site”)
> abline(lm(d[,“site”]~d[,“web”]))
> dev.off()

Cluster Analysis

Suppose we wish to determine which reported incidents are related and group them together. Such groupings may reveal patterns that might be difficult to spot otherwise. This kind of analysis is referred to as “cluster analysis.” Two common types of cluster analysis are “hierarchical clustering” and “k-means” clustering.
Hierarchical clustering groups data by finding groupings that maximize the differences between them. K-means clustering groups data by partitioning the data elements in a predefined number of groupings, a parameter commonly referred to as “k” groupings. K-means minimizes the sum of squares distances from each data element to each group’s centerpoint.
K-means is very useful for assigning each data element a group membership value. In other words, we can create a new column in our data, which provides a group number that each element is placed within. A common problem with k-means, however, is figuring out how many groupings, “k,” to specify in advance. Hierarchical clustering is useful for this purpose and can be used in combination with k-means. Once we have decided how many cluster groupings make sense in our data through the use of hierarchical clustering, we can perform k-means clustering to assign the groupings. While it is possible to also assign groupings using hierarchical clustering, k-means is less computationally intensive for this process and is often preferred for larger data sets.
The following code sample generates a hierarchical cluster, using the “hclust” function. Notice that we used the “dist” function within the “hclust” function, in order to calculate the differences between the data elements. The result of running the “hclust” is stored in a variable, “hClust.” This variable is then plotted using R’s generic “plot” function.
As an aside, you may have noticed that the “plot” function adjusts its output and graph types depending upon the data that it receives. The “plot” function automatically knows, in this case, that it needs to plot a hierarchical cluster in a tree type of a diagram, known as a “dendogram.”
> #Hierarchical cluster analysis
> png(“Figure3.png”)
> hClust <- hclust(dist(dtm))
> plot(hClust, labels=FALSE)
> dev.off()
We can see in the resulting dendogram of the hierarchical cluster in Figure 6.3 that there are many levels of groupings. We want to try to choose a level that produces groupings with a fairly good representation of data elements in each. We see that, the second level at the top of the tree is split into three groupings. The first grouping, on the far left, results in very sparse membership, if you follow that fork down to its end points. These endpoints are known as “leaf nodes.” Each leaf node relates to an individual data element or row. The next grouping to the right at the second level, indicated by the next vertical line at that level, has more data elements within it. The third grouping, however, has by far the most data elements. Choosing to use three clusters appears to offer a relatively good balance between interpretability and group representation.
The choice as to how many clusters to use is highly subjective, and is a balance between trying to gain good representation of data points in each cluster, with having enough clusters to find meaningful relationships in the data. Basically, too few clusters may lack meaningful detail, while too many clusters can be impossible to interpret.
image
Figure 6.3 Dendogram of the document-term matrix.
Based on our hierarchical cluster analysis, we will choose a “k” number of groupings equal to three for our k-means cluster analysis. We use the “kmeans” function on our document-term matrix, “dtm,” and set the “centers” parameter to three. We then store the results in a variable we call, “kmClust.” We can see the results stored in “kmClust” by either entering “kmClust” on a line by itself or by using the “print” function. Either method does the same thing, but some feel that using the “print” function is more explicit and readable in the code. This time, we use the “print” function just to demonstrate it. Notice that the results indicate how many members are within each cluster. The results are a bit different than with hierarchical clustering and the methods differ. However, hierarchical clustering can still provide a good estimate for the purposes of guessing a decent value for “k.” There is, however, one cluster that has far fewer members than the other two; albeit it appears to have more members than the smallest cluster given by the hierarchical method. The representation of data elements in each cluster looks decent overall, though.
Only a sample of the output is shown, as the cluster means is given for each and every term in the data-term matrix in the actual output. Also, every incident row in the data is assigned a cluster membership value. Again, only a portion of the membership output is shown due to size constraints on the page. Lastly, the output includes some diagnostic statistics, regarding the sum of squares fit, and the available values that can be extracted from the output. See the help for the “kmeans” function for additional details on these.
> kmClust <- kmeans(dtm,centers=3)
> print(kmClust)
K-means clustering with 3 clusters of sizes 216, 264, 83
Cluster means:
access attack can data hacked…
1 0.07407407 0.1111111 0.06018519 0.15740741 0.1203704
2 0.20454545 0.3977273 0.14393939 0.07954545 0.2727273
3 0.10843373 0.5301205 0.28915663 0.38554217 0.3614458

Clustering vector:
1 2 3 4 5 6 7 8 9 10 11 12 13 14…
2 2 2 2 2 2 2 2 2  2  2  2  2  2

Within cluster sum of squares by cluster:
[1] 1194.569 2160.356 1744.482
(between_SS/total_SS = 16.3%)
Available components:
[1] “cluster” “centers” “totss” “withinss” “tot.withinss”
[6] “betweenss” “size”
We can now append the cluster memberships as an additional value to our data-term matrix. The sample code below creates a new data frame that combines “dtm” data-term matrix, with the cluster members by calling the “cluster” component from the cluster results stored in the “kmClust” object we created. Notice how we use the dollar sign in R to call a specific component from the list of components available in the output. Again, the list of available components is shown at the end of the previous k-means output that we discussed. You can see “cluster” in that list.
> #Assign cluster membership to the original data
> dtmWithClust <- data.frame(inspect(dtm), kmClust$cluster)
Now, we can print the “dtmWithClust” data frame object we just created, and see how the additional last column containing the cluster memberships was appended. In this case, the first five documents all belong to cluster number three.
> print(dtmWithClust)
Docs access attack can data hacked…kmClust.cluster
1 2 0 0 0 1…3
2 0 1 0 0 1…3
3 0 0 0 0 0…3
4 0 2 0 0 0…3
5 0 1 0 0 0…3
With the cluster memberships assigned, we can use this information to build a predictive model to categorize new hacking reports that are entered into the Web Hacking Information Database. Of course, we may want to know how to generalize characteristics that tend to define each cluster. That can be a bit of a challenge with each word in our document-term matrix representing a predictive variable. Some methods that can be helpful in this include decision tree classification analysis using the “rpart” function or the “randomForests” function. The “randomForests” function combines the results of many decision tree classifiers. This function is found within the “randomForest” package. A detailed discussion into decision tree analysis gets into another realm of data mining and is beyond the scope of this chapter. However, we show a random forest example below.
Before running this example, you will need to be sure that you have the “randomForest” package installed and that the library is initiated. We show these steps first. Notice in specifying the model in the “randomForest” function that we use the tilde along with a period. Again, the tilde can be read as “by,” and the period as “all other variables.” Thus, our model specification of “kmClust.cluster∼.” can be read as “classify the cluster variable by all other variables.” The “importance=TRUE” parameter specifies to output a list of all of the predictive variables and their relative importance as classifiers. In this case, the output shows all of the words, along with statistics giving each word’s relative importance. The larger the number, the more important the word is in characterizing each incident within a cluster.
> install.packages(“randomForest”, dependencies=TRUE)

> library(“randomForest”)

> rfClust <- randomForest(kmClust.cluster∼., data=dtmWithClust, importance=TRUE, proximity=TRUE)
Warning message:
In randomForest.default(m, y,…):
The response has five or fewer unique values. Are you sure you want to do regression?
> print(rfClust)
Call:
randomForest(formula = kmClust.cluster ∼., data = dtmWithClust, importance = TRUE, proximity = TRUE)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 10
Mean of squared residuals: 0.05231353
% Var explained: 90.23
> importance(rfClust)
%IncMSE IncNodePurity
access  5.21554027 2.1754015
attack  0.73614019 2.2021336
can  0.43914970 1.2053006
data  2.12488252 1.3255387
hacked  2.22153755 3.0291678
hacker  0.59934964 2.1058682
hackers  0.69140487 1.4719546
incident  9.35312755 6.7942922
information  0.92171115 1.9432987
informationp  1.10089894 1.3157080
injection  0.98117200 1.4235802
new  1.89448271 2.0425851
one  1.38418946 1.8823775
online  1.13857428 0.9484073
padditional  1.87159685 0.7986896
ppadditional  2.35617800 0.9213330
said  7.12029542 1.8712096
security  0.84890151 2.2249081
service  2.18474469 0.4115085
site  101.31981296 144.1131055
sites  5.38968388 5.5126839
sql  1.97680424 2.0980965
time  0.05651688 2.7443020
used  1.26504782 1.3189609
users  2.15274360 0.9221132
using  0.67911496 2.0038781
vulnerability 0.42220017 1.5492946
web  60.89116556 92.5641831
Website  4.98273660 1.4276663
xss  4.11442275 1.6486677

Other Applicable Security Areas and Scenarios

We used rather generic data, but these same techniques can be applied to a variety of security scenarios. It is difficult to get actual data pertaining to real security breaches, as such data sets tend to be proprietary. Also, company leaders generally are not keen to release their data to the public. Such proprietary data sets often contain personally identifiable information that maybe difficult to remove entirely. They also can reveal weaknesses in a company’s defenses, which maybe exploited by other attackers. The data used in this chapter does not reveal the full extent of the power of the text analysis tools in R as would actual, detailed corporate data.
Nevertheless, it is hoped that the analysis provided here is sufficient to get readers thinking and using their own creativity as to how to use these methods on their own data. Following are a few suggestions to think about, in hopes that they will be a catalyst for further creativity in the readers.

Additional Security Scenarios for Text Mining

Most call centers record the telephone conversations that they have with their customers. These can provide a breadcrumb trail as to the patterns and techniques used by social hackers. Social engineering involves coaxing people into giving up security-related information, such as usernames, passwords, account numbers, and credit card details, generally by falsely impersonating a legitimate stakeholder. These recordings may be bulk transformed into text by using automated voice dictation software, such as the Dragonfly package in Python (http://code.google.com/p/dragonfly/). Some call centers utilize chat sessions, or e-mail instead of phone calls, in which case voice to text conversion would be unnecessary. Text mining techniques maybe used to find patterns in the resulting text files with each call representing a document. In addition to applying the same processes as discussed in this chapter, you might also have call center employees flag each call in the database that seem suspicious. You could use decision trees such as discussed in this chapter to find combinations of words that may predict a high probability of fraudulent intent from a caller. Even if a caller uses different phone numbers to make phone calls, which is a common technique among social engineers, the word patterns they use may give them away.
Another potential use of text mining is with log files, such as from network routers or server logs, such as those we analyzed in Chapter 3. In that chapter, we used more direct means of finding potential security breaches and potentially malicious activity. However, text mining techniques could be applied to log data, such as the user, agent, or referrer fields. This could yield patterns that might otherwise be obscured. For instance, a referrer might use several variants of urls from which to make requests of the server. This would be hard to catch with a direct query. However, cluster analysis on word frequencies could find enough similarities in the urls used that they could be found to be associated.
Corporate e-mail contains very large collections of unstructured text. Most corporations have software that helps them sift through e-mails to identify spam. However, it is more unusual to find systems that identify phishing attacks. Text mining could be used to find patterns in phishing attack e-mails, and predictive models could be used to identify new e-mails that have a high probability of being a phishing attack. Predictive models could include the decision trees discussed here, or any number of a wide variety of other predictive modeling algorithms that are beyond the scope of this text. Once potential new phishing attack threats are identified, employees could be warned about them as a training measure and the senders could be placed on a watch list. If suspicious e-mail senders identified by the algorithms are later verified as actual attackers, they could be put on a block list, preventing further e-mails from getting through. Of course, there are ways to get around this, such as through bots that send e-mails from multiple accounts. Never-ending vigilance is required.

Text Mining and Big Data

When data sets to be analyzed become too big to be handled by analytical tools such as R, there are a variety of techniques that can be applied. Some techniques are used to turn big data into smaller data through aggregation, so that it can be more readily digested by analytical tools such as R. Hadoop and MapReduce, as discussed in Chapter 3, were used in a similar manner, to aggregate a large amount of data into a much smaller time series, which was then analyzed in R as well as a spreadsheet.
Similar aggregation and data size reduction techniques can be used for text mining, although you will need a means of tokenizing the data. That is, you will need some way of isolating each word occurrence and pairing it with an occurrence frequency for each document. We discussed Hive as a means of using familiar SQL-like code to aggregate data. Hive happens to also have several built-in functions for tokenizing data. For basic application of a stop word list, and calculation of word frequencies, the “ngrams” function in Hive is very useful. This function also has the capability of only selecting words that occur above a frequency threshold, further reducing the size of the data. Another interesting Hive function for string tokenizing is “context_ngrams,” which tokenizes data within a sentence context. For instance, you might want to only find words that occur within the context of a sentence containing the words “username” and “password.” If you want more information and examples as to the usage of these functions, you are encouraged to examine Apache’s resources online such as the Apache wiki: https://cwiki.apache.org/confluence/display/Hive/StatisticsAndDataMining.
Once the data are tokenized and reduced to more manageable size, the previously discussed analytical methods can be applied in R. The R language has a vast number of analytical packages and functions of all sorts, including others for text mining that we do not have space to cover here. So, it makes sense to try and do much of your analysis using R if you can.
Alternatively, instead of aggregating and reducing your data size, another means of handling Big Data is to run your analytical algorithms utilizing parallel processing. There are many ways to do this, although many may add significant additional complexity to your processes and code. The CRAN R Web site at http://www.cran.r-project.org has an entire page listing and describing R packages for utilizing parallel processing with R. This page can be found by clicking the “Task Views” link in the left margin of the CRAN R Web site, followed by selecting the “HighPerformanceComputing” link. In addition to packages that run R on multiple processors, such as “multicore” and “snow,” there are also packages for running R jobs in a Hadoop, MapReduce framework such as “HadoopStreaming,” as well as Revolution Analytics’ “rmr” package. Before further investigating Big Data techniques, however, you should be sure you understand the foundational text mining concepts provided in the examples in this chapter.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset