19.1 Introduction

In this chapter, we will discuss inside threat detection for sequence data. A sequence is an ordered list of objects (or events). Sequence contains members (also called elements or terms). In a set, element order does not matter. On the other hand, in a sequence, order matters, and, hence, exactly the same elements can appear multiple times at different positions in the sequence [QUMR13]. For example, (U, T, D) is a sequence of letters with the letter “U” first and “D” last. This sequence differs from (D, T, U).

The sequence (U, T, D, A, L, L, A, S) that contains the alphabet “A” at two different positions is a valid sequence. Figure 19.1 illustrates some sequence of the movement pattern of a user. The first row represents a particular user’s one movement pattern sequence: student center, office, and media lab (ml). In this sequence, the user was first at student center and ml (media lab) last [EAGL06].

79578.jpg

Figure 19.1 Example of sequence data related to movement pattern.

The organization of this chapter is as follows. Classifying sequence data will be discussed in Section 19.2. Unsupervised stream-based sequence learning will be discussed in Section 19.3. Anomaly detection aspects will be discussed in Section 19.4. Complexity analysis will be provided in Section 19.4. This chapter is summarized in Section 19.5.

19.2 Classifying Sequence Data

As the length of the sequence is defined as the number of ordered elements, sequence data can be finite or infinite in length. Infinite sequences are known as stream sequence data. Insider threat detection-related sequence data is stream-based in nature. Sequence data may be gathered over time, maybe even years. In this case, we assume a data stream will be converted into a number of chunks. For example, each chunk may represent a week and contain the sequence data that arrived during that time period.

Figure 19.2 demonstrates how the classifier decision boundary changes over time (from one chunk to the next chunk). Data points are associated with two classes (normal and anomalous). In particular, a user command or a subsequence of commands may form a pattern/phrase that will be called here as a data point. A nonrepetitive pattern/phrase for a user may form an anomaly. There are three contiguous data chunks as shown in Figure 19.2. The dark straight line represents the decision boundary of its own chunk, whereas the dotted straight line represents the decision boundary of the previous chunk. If there were no concept drift in the data stream, the decision boundary would be the same for both the current chunk and its previous chunk (the dotted and straight lines). White dots represent the normal data (True Negative), blue dots represent anomaly data (True Positive), and striped dots represent the instances victim of the concept drift.

79078.jpg

Figure 19.2 Concept drift in stream data.

We show that the two different cases in Figure 19.2 are as follows:

Case 1: The decision boundary of the second chunk moves upward compared to that of the first chunk. As a result, more normal data will be classified as anomalous by the decision boundary of the first chunk; thus, FP will go up. Recall that a test point having a true benign (normal) category classified as an anomalous by a classifier is known as an FP.

Case 2: The decision boundary of the third chunk moves downward compared to that of the first chunk. So, more anomalous data will be classified as normal data by the decision boundary of the first chunk; thus, the FN will go up. Recall that a test point having a true malicious category classified as benign by a classifier is known as an FN.

In the more general case, the decision boundary of the current chunk can vary, which causes the decision boundary of the previous chunk to misclassify both normal and anomalous data. Therefore, both FP and FN may go up at the same time.

This suggests that a model built from a single chunk will not suffice. This motivates the adoption of adaptive learning. In particular, we will exploit two approaches as follows:

Incremental Learning: A single dictionary is maintained [PARV12a]. When a normative sequence pattern is learned from a chunk, it will be simply added to the dictionary. To find the normative pattern, we will exploit unsupervised stream-based sequence learning (USSL). The incremental learning classification procedure is illustrated in Figure 19.3. Here, first from a new chunk, patterns will be extracted and next these patterns will be merged with the old quantized dictionary (QD) from previous chunks. Finally, a new merged dictionary will be quantized in Figure 19.4.

79115.jpg

Figure 19.3 Unsupervised stream sequence learning using incremental learning.

79151.jpg

Figure 19.4 Block diagram of incremental learning.

Ensemble Learning: A number of dictionaries are maintained [PARV12b]. In the ensemble, we maintain K models. For each model, we maintain a single dictionary. Our ensemble approach classifies data using an evolving set of K models. The ensemble classification procedure is illustrated in Figure 19.5. Recall that we use USSL to train models from an individual chunk. USSL identifies the normative patterns in the chunk and stores it in a QD. In the literature ([MASU10], [MASU11], [MASU08]), it shows that ensemble-based learning is more effective than incremental learning. Here, we will focus on ensemble-based learning. To identify an anomaly, a test point will be compared against each model of the ensemble.

79186.jpg

Figure 19.5 Ensemble-based unsupervised stream sequence learning.

Recall that a model will declare the test data as anomalous based on how much the test differs from the model’s normative patterns. Once all models cast their vote, we will apply majority voting to make the final decision as to whether the test point is anomalous or not (as shown in Figure 16.2).

Model Update: We always keep an ensemble of fixed size models (K in that case). Hence, when a new chunk is processed, we already have K models in the ensemble, and the (K + 1)st model will be created from the current chunk. We need to update the ensemble by replacing a victim model with this new model. Victim selection can be done in a number of ways. One approach is to calculate the prediction error of each model on the most recent chunk relative to the majority vote. Here, we assume that ground truth on the most recent chunk is not available. If ground truth is available, we can exploit this knowledge for training. The new model will replace the existing model from the ensemble that gives the maximum prediction error.

19.3 Unsupervised Stream-based Sequence Learning (USSL)

Normal user profiles are considered to be repetitive daily or weekly activities that are frequent sequences of commands, system calls, etc. These repetitive command sequences are called normative patterns. These patterns reveal the regular, or normal, behavior of a user. When a user suddenly demonstrates unusual activities that indicate a significant excursion from normal behavior, an alarm is raised for potential insider threat.

So, in order to identify an insider threat, first we need to find normal user behavior. For that, we need to collect sequences of commands and find the potential normative patterns observed within these command sequences in an unsupervised fashion. This unsupervised approach also needs to identify normal user behavior in a single pass. One major challenge with these repetitive sequences is their variability in length. To combat this problem, we need to generate a dictionary that will contain any combination of possible normative patterns existing in the gathered data stream. Potential variations that could emerge within the data include the commencement of new events, the omission or modification of existing events, or the reordering of events in the sequence. For example, liftliftliftliftliftcomecomecomecomecome-come, is a sequence of commands represented by the alphabets given in a data stream. We will consider all patterns li,if,ft,tl, lif, ift, ftl, lift, iftl etc., as our possible normative patterns. However, the huge size of the dictionary presents another significant challenge.

We have addressed the above two challenges in the following ways. First, we extract possible patterns from the current data chunk using a single pass algorithm (e.g., LempelZiv–Welch, LZW, algorithm [ZIV77]) to prepare a dictionary that we called the LZW dictionary. This dictionary has a set of patterns and their corresponding weights according to

wi=fii=1nfi

(19.1)

where wi is the weight of a particular pattern pi in the current chunk, fi is the number of times the pattern pi appears in the current chunk, and n is the total number of distinct patterns found in that chunk.

Next, we compress the dictionary by keeping only the longest, frequent unique patterns according to their associated weight and length, while discarding other subsumed patterns. This technique is called the compression method (CM), and the new dictionary is a QD. The QD has a set of patterns and their corresponding weights. Here, we use the edit distance to find the longest pattern. The edit distance is a measure of similarity between pairs of strings ([BORG11], [VLAD66]). It is the minimum number of actions required to transfer one string to another, where an action can be substitution, addition, or deletion of a character into the string. As in the case of the earlier example, the best normative pattern in the QD would be lift, come, etc.

This process is a lossy compression, but is sufficient enough to extract the meaningful normative patterns. The reason behind this is that the patterns that we extract are the superset of the subsumed patterns. Moreover, as frequency is another control parameter in our experiment, the patterns which do not appear often cannot be regular user patterns.

Data relevant to insider threat is typically accumulated over many years of organization and system operations, and is therefore best characterized as an unbounded data stream. As our data is a continuous stream of data, we use ensemble-based learning to continuously update our compressed dictionary. This continuous data stream is partitioned into a sequence of discrete chunks. For example, each chunk might be comprised of a day or weeks’ worth of data and may contain several user sessions. We generate our QD and their associated weight from each chunk. Weight is measured as the normalized frequency of a pattern within that chunk.

When a new chunk arrives, we generate a new QD model and update the ensemble as mentioned earlier. Figure 19.6 shows the flow diagram of our dynamic, ensemble-based, unsupervised stream sequence learning method. Algorithm 19.1 shows the basic building block for updating the ensemble. It takes the most recent data chunk S, ensemble E, and test chunk T. Lines 34 generate a new QD model from the most recent chunk S and temporarily add it to the ensemble E. Lines 59 test chunk T for anomalies for each model in the ensemble. Lines 1324 find and label the anomalous patterns in test chunk T according to the majority voting of the models in the ensemble. Finally, line 29 updates the ensemble by discarding the model with the lowest accuracy. An arbitrary model is discarded in the case of multiple models having the same low performance.

79222.jpg

Figure 19.6 Unsupervised stream-based sequence learning (USSL) from a chunk in ensemble-based case.

19.3.1 Construct the LZW Dictionary by Selecting the Patterns in the Data Stream

At the beginning, we consider that our data is not annotated (i.e., unsupervised). In other words, we do not know the possible sequence of future operations by the user. So, we use the LZW algorithm [ZIV77] to extract the possible sequences that we can add to our dictionary. These can also be commands like liftliftliftliftliftcomecomecomecomecomecome where each unique letter represents a unique system call or command. We have used a unicode to index each command. For example, ls, cp, and find are indexed as l, c, and f, respectively. The possible patterns or sequences are added to our dictionary would be li, if,ft,tl, lif,ift, ftl, lift, iftl, ftli, tc, co, om, mc, com, come, and so on. When the sequence li is seen in the data stream for the second time, in order to avoid repetition, it will not be included in the LZW dictionary. Instead, we increase the frequency by 1 and extend the pattern by concatenating it with the next character in the data stream, thus turning up a new pattern lif. We will continue the process until we reach the end of the current chunk. Figure 19.7 demonstrates how we generate an LZW dictionary from the data stream.

79259.jpg

Figure 19.7 Quantization of dictionary.

19.3.2 Constructing the Quantized Dictionary

Once we have our LZW dictionary, we keep the longest and most frequent patterns and discard all their subsumed patterns. Algorithm 19.2 shows step by step how a QD is generated from the LZW dictionary. Inputs of this algorithm are as follows: the LZW dictionary D that contains a set of patterns P and their associated weight W. Line 5 picks a pattern (e.g., li). Lines 79 find all the closest patterns that are 1 edit distance away. Lines 1316 keep the pattern that has the highest weight multiplied by its length and discard the other patterns. We repeat the steps (line 516) until we find the longest, frequent pattern (lift). After that, we start with a totally different pattern (co) and repeat the steps until we have explored all the patterns in the dictionary. Finally, we end up with a more compact dictionary that will contain many meaningful and useful sequences. We call this dictionary our QD. Figure 19.7 demonstrates how we generate a QD from the LZW dictionary. Once, we identify different patterns lift, come, etc., any pattern with X% ( %30 in our implementation) deviation from all these patterns would be considered as anomaly. Here, we will use edit distance to identify the deviation.

Algorithm 19.1Update the Ensemble

1: Input: E (ensemble), T (test chunk), S (chunk)

2: Output: E (updated ensemble)

3: M NewModel(S)

4: E E {M}

5: for each model M in ensemble E do

6:AM T

7:for each pattern p in M do

8:if EditDistance(x, p) α = 1 of length then 3

9:AM = AM x

10:end if

11:end for

12: end for

13: for each candidate a in ME AM do

14:if round(W eightedAverage(E, a)) = 1 then

15:A A {a}

16:for each model M in ensemble E do

17:if a AM then

18:cM cM + 1

19:end if

20:end for

21:else

22:for each model M in ensemble E do

23:if a AM then

24:cM cM + 1

25:end if

26:end for

27:end if

28: end for

29: E E {choose(arg minM (cM))}

Algorithm 19.2Quantized Dictionary

1: Input: D = {P attern, W eight} (LZW Dictionary)

2: Output: QD (Quantized Dictionary)

3: V isited 0

4: while D = 0 do

5: X Dj | j V isited, Dj D

6: V isited V isited j

7: for each pattern i in D do

8:if EditDistance(X, Di) = 1 then

9:P P i

10:end if

11: end for

12: D DX

13: if P = 0 then

14: X choose(arg maxi(wi li)) | li = Length(Pi), wi = W eight(Pi), Pi

15: QD QD X

16: D DP

17: end if

18: X Dj | j V isited, Dj D

19: V isited V isited j

20: end while

19.4 Anomaly Detection

Given a QD, we need to find out the sequences in the data stream, which may raise a potential threat. To formulate the problem, given the data stream S and ensemble E, where E = QD1, QD2, QD3, …, and QDi = qdi1, qdi2, …, any pattern in the data stream is considered as an anomaly if it deviates from all the patterns qdij in E by more than X% (say > 30%). In order to find the anomalies, we need to first find the matching patterns and delete those from the stream S. In particular, we find the pattern from the data stream S that is an exact match or α edit distance away from any pattern, qdij, in E. This pattern will be considered as the matching pattern. α can be half, one-third, or one-fourth of the length of that particular pattern in qdij. Next, remaining patterns in the stream will be considered as anomalies.

In order to identify the nonmatching patterns in the data stream S, we compute a distance matrix L that contains the edit distance between each pattern, qdij in E, and the data stream S. If we have a perfect match, that is, the edit distance 0 between a pattern qdij and S, we can move backward exactly the length of qdij in order to find the starting point of that pattern in S, and then delete it from the data stream. On the other hand, if there is an error in the match that is greater than 0 but less than α, in order to find the starting point of that pattern in the data stream, we need to traverse either left, or diagonal, or up within the matrix according to which one among the mentioned value (L[i,j1], L[i1,j1], L[i1, j]) gives the minimum, respectively. Finally, once we find the starting point, we can delete that pattern from the data stream. The remaining patterns in the data stream will be considered as anomalous.

19.5 Complexity Analysis

Here, we will report time complexity of QD construction. In order to calculate edit distance between two patterns of length K (in worst case maximum length would be K), our worst case time complexity would be O(K2).

Suppose we have n patterns in our LZW dictionary. We have to construct a QD from this LZW dictionary. In order to do this, we need to find patterns in the LZW dictionary, which have 1 edit distance from a particular pattern (say p). We have to calculate edit distance between all the patterns and the pattern p. Recall that time complexity to find the edit distance between two patterns is O(K2). Since there are total n number of distinct patterns, the total time complexity between p and the rest of patterns is O(n K2). Note that p is one of the members of n patterns. Therefore, the total time complexity between the pair of patterns is O(n2K2). This is valid for a single user. If there is u of distinct users, the total time complexity across u user is O(u n2K2) (see Table 19.1).

Table 19.1

Time Complexity of Quantization Dictionary Construction

Description

Time Complexity

Pair of patterns

O(n2 × K2)

u number of user

O(u × n2 × K2)

19.6 Summary and Directions

Insider threat detection-related sequence data is stream-based in nature. Sequence data may be gathered over time, maybe even years. In this case, we assume a continuous data stream will be converted into a number of chunks. For example, each chunk may represent a week and contain the sequence data that arrived during that time period. We described both supervised and unsupervised learning techniques for mining data streams for sequence data. Experimental results for our techniques are provided in Chapter 20.

Future work should examine how our techniques can be enhanced to provide better accuracy and fewer false positives and negatives. Scalability of the techniques to handle massive data streams also needs to be investigated.

References

[BORG11]. E. N., Borges, M. G. de Carvalho, R. Galante, M. A. Gones, A. H. F. Laender, “An Unsupervised Heuristic-Based Approach for Bibliographic Metadata Deduplication,” Information Processing and Management 47 (5), 706–718, 2011.

[EAGL06]. N. Eagle and A. (Sandy) Pentland, “Reality Mining: Sensing Complex Social Systems,” Personal and Ubiquitous Computing 10 (4), 255–268, 2006.

[MASU08]. M. M. Masud, J. Gao, L. Khan, J. Han, B. Thuraisingham, “A Practical Approach to Classify Evolving Data Streams: Training with Limited Amount of Labeled Data,” In ICDM’08: Proceedings of the IEEE International Conference on Data Mining, Pisa, Italy, pp. 929–934, 2008.

[MASU09]. M. Masud, J. Gao, L. Khan, J. Han, B. Thuraisingham, A Multi-Partition Multi-Chunk Ensemble Technique to Classify Concept-Drifting Data Streams, Advances in Knowledge Discovery and Data Mining, Springer, Berlin, pp. 363–375, 2009.

[MASU10]. M. M. Masud, Q. Chen, J. Gao, L. Khan, C. Aggarwal, J. Han, B. Thuraisingham, “Addressing Concept-Evolution in Concept-Drifting Data Streams,” In ICDM’10: Proceedings of the IEEE International Conference on Data Mining, Sydney, Australia, pp. 929–934, 2010.

[MASU11]. M. M., Masud, J. Gao, L. Khan, J. Han, B. M. Thuraisingham, “Classification and Novel Class Detection in Concept-Drifting Data Streams Under Time Constraints,” IEEE Transactions on Knowledge and Data Engineering 23 (6), 859–874, 2011.

[PARV12a]. P. Parveen and B. Thuraisingham, “Unsupervised Incremental Sequence Learning for Insider Threat Detection,” In IS’2012: Proceedings of the IEEE International Conference on Intelligence and Security, June, Washington, D.C., 2012.

[PARV12b]. P. Parveen, N. McDaniel, B. Thuraisingham, L. Khan, “Unsupervised Ensemble Based Learning for Insider Threat Detection. In PASSAT’2012: Proceedings of the 4th IEEE International Conference on Information Privacy, Security, Risk and Trust, September, Amsterdam, The Netherlands, 2012.

[QUMR13]. S. M. Qumruzzaman, L. Khan, B. M. Thuraisingham, “Behavioral Sequence Prediction for Evolving Data Stream,” In IRI’2013: Proceedings of the 14th IEEE Conference on Information Reuse and Integration, August 1416, San Francisco, CA, pp. 482–488, 2013.

[VLAD66]. L. Vladimir, “Binary Codes Capable of Correcting Deletions, Insertions and Reversals,” Soviet Physics—Doklady 10 (8), 707–710, 1966.

[ZIV77]. J. Ziv and A. Lempel, “A Universal Algorithm for Sequential Data Compression,” IEEE Transactions on Information Theory 23 (3), 337–343, 1977.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset