4.1 Overview

Data mining has many applications in security including in national security (e.g., surveillance) as well as in cyber security (e.g., virus detection). The threats to national security include attacking buildings and destroying critical infrastructures such as power grids and telecommunication systems [BOLZ05]. Data mining techniques are being investigated to find out who the suspicious people are and who is capable of carrying out terrorist activities [THUR03]. Cyber security is involved with protecting the computer and network systems against corruption due to Trojan horses and viruses. Data mining is also being applied to provide solutions such as intrusion and malware detection and auditing [MASU11]. In this chapter, we will focus mainly on data mining for cyber security applications.

To understand the mechanisms to be applied to safeguard the nation and the computers and networks, we need to understand the types of threats. In [THUR03] we described real-time threats, as well as non real-time threats. A real-time threat is a threat that must be acted upon within a certain time to prevent some catastrophic situation. Note that a nonreal-time threat could become a real-time threat over time. For example, one could suspect that a group of terrorists will eventually perform some act of terrorism. However, when we set time bounds such as a threat that will likely occur say before July 1, 2018, then it becomes a real-time threat and we have to take actions immediately. If the time bounds are tighter such as “a threat will occur within 2 days,” then we cannot afford to make any mistakes in our response.

There has been a lot of work on applying data mining for both national security and cyber security and our previous books have focused on both aspects (e.g., [THUR03] and [MASU11]). Our focus in this chapter will be mainly on applying data mining for cyber security. In Section 4.2, we will discuss data mining for cyber security applications. In particular, we will discuss the threats to computers and networks and describe the applications of data mining to detect such threats and attacks. Some of the data mining tools for security applications developed at The University of Texas at Dallas will be discussed in Section 4.3. We are reimplementing some of our tools to analyze massive amounts of data. That is, we are developing big data analytics tools for cyber security applications and some of our current work will be discussed later in this book. This chapter is summarized in Section 4.4. Figure 4.1 illustrates data mining applications in security.

78924.jpg

Figure 4.1 Data mining applications in security.

4.2 Data Mining for Cyber Security

4.2.1 Cyber Security Threats

This section discusses the various cyber threats including cyber terrorism, insider threats, and external attacks. Figure 4.2 illustrates the various types of cyber-security threats.

78960.jpg

Figure 4.2 Cyber security threats.

4.2.1.1 Cyber Terrorism, Insider Threats, and External Attacks

Cyber terrorism is one of the major terrorist threats posed to our nation today. As we have mentioned earlier, there is now so much of information available electronically and on the web. Attack on our computers as well as networks, databases, and the Internet could be devastating to businesses. We are hearing almost daily about the cyber attacks to businesses. It is estimated that cyber terrorism could cost billions of dollars to businesses. For example, consider a banking information system. If terrorists attack such a system and deplete accounts of the funds, then the bank could lose millions and perhaps billions of dollars. By crippling the computer system, millions of hours of productivity could be lost and that equates to money in the end. Even a simple power outage at work through some accident could cause several hours of productivity loss and as a result a major financial loss. Therefore, it is critical that our information systems be secure. We discuss various types of cyber terrorist attacks. One is spreading malware that can wipe away files and other important documents and another is intruding the computer networks.

Note that threats can occur from outside or from the inside of an organization. Outside attacks are attacks on computers from someone outside the organization. We hear of hackers breaking into computer systems and causing havoc within an organization. These hackers infect the computers with malware that can not only cause great damage to the files stored in the systems but also spread to other systems via the networks. But a more sinister problem is the insider threat problem. People inside an organization who have studied the business practices develop schemes to cripple the organization’s information assets. These people could be regular employees or even those working at computer centers and contractors. The problem is quite serious as someone may be masquerading as someone else and causing all kinds of damage. Malicious processes in the system can also masquerade as benign processes and cause damage. Data mining techniques have been applied to detect the various attacks. We discuss some of these attacks next. Part III will elaborate on applying data mining for the insider threat problem.

4.2.1.2 Malicious Intrusions

Malicious intrusions may include intruding the systems, the networks, the web clients and servers, and the databases and applications. Many of the cyber terrorism attacks are due to malicious intrusions. We hear much about network intrusions. What happens here is that intruders try to tap into the networks and get the information that is being transmitted. These intruders may be human intruders or malicious processes. Intrusions could also happen on files. For example, a malicious individual can masquerade as an employee and log into the corporation’s computer systems and network and access the files. Intrusions can also occur on databases. Intruders pretending to be legitimate users can pose queries such as Structured Query Language queries and access the data that they are not authorized to know.

Essentially cyber terrorism includes malicious intrusions as well as sabotage through malicious intrusions or otherwise. Cyber security consists of security mechanisms that attempt to provide solutions to cyber attacks or cyber terrorism. When we discuss malicious intrusions or cyber attacks, it would be useful to think about the noncyber world and then translate those attacks to attacks on computers and networks. For example, a thief could enter a building through a trap door. In the same way, a computer intruder could enter the computer or network through some sort of a trap door that has been intentionally built by a malicious insider and left unattended through perhaps careless design. Another example is a thief entering the bank with a mask and stealing the money. The analogy here is an intruder masquerading as a legitimate user takes control of the information assets. Money in the real world would translate to information assets in the cyber world. More recently, we are hearing about ransomware where hackers are not only stealing the data, but also holding the data to ransom by encrypting the data. Then the owner of the data has to pay a ransom, usually in the form of bit coins, and then retrieve his/her data. That is, there are many parallels between what happens in the real world and the cyber world.

4.2.1.3 Credit Card Fraud and Identity Theft

Credit card fraud and identity theft are common security problems. In the case of credit card fraud, others get hold of a person’s credit card numbers though electronic means (e.g., when swiping the card at gas stations) or otherwise and make all kinds of purchases; by the time the owner of the card finds out, it may be too late. A more serious problem is identity theft. Here one assumes the identity of another person, say by getting hold of the social security number and essentially carries out all the transactions under the other person’s name. This could even be selling houses and depositing the income in a fraudulent bank account. By the time the owner finds out, it will be far too late. It is very likely that the owner may have lost millions of dollars due to the identity theft.

We need to explore the use of data mining both for credit card fraud detection as well as for identity theft. There have been some efforts on detecting credit card fraud [CHAN99]. However, detecting identity theft still remains a challenge.

4.2.1.4 Attacks on Critical Infrastructures

Attacks on critical infrastructures could cripple a nation and its economy. Infrastructure attacks include attacking the telecommunication lines, the power grid, gas pipelines, reservoirs and water and food supplies, and other basic entities that are critical for the operation of a nation.

Attacks on critical infrastructures could occur during due to malware or by physical means such as bombs. For example, one could attack the software that runs the telecommunication systems and close down all the telecommunications lines. Similarly, software that runs the power grid could be attacked. Infrastructures could also be attacked by natural disaster such as hurricanes and earthquakes. Our main interest here is the attacks on infrastructures through malicious attacks. While some progress has been made on developing solutions to such attacks, much remains to be done. One of the directions we are pursuing is to examine the use of data mining to detect such infrastructure attacks. Figure 4.3 illustrates attacks on critical infrastructures.

78995.jpg

Figure 4.3 Attacks on critical infrastructures.

4.2.2 Data Mining for Cyber Security

Data mining is being applied for problems such as intrusion and malware detection and auditing. For example, anomaly detection techniques could be used to detect unusual patterns and behaviors. Link analysis may be used to trace the viruses to the perpetrators. Classification may be used to group various cyber attacks and then use the profiles to detect an attack when it occurs. Prediction may be used to determine potential future attacks depending in a way on information learned about terrorists through email and phone conversations. Also, for some threats, nonreal-time data mining may suffice while for certain other threats such as for network intrusions, we may need real-time data mining. Many researchers are investigating the use of data mining for intrusion detection [AWAD09]. While we need some form of real-time data mining where the results have to be generated in real time, we also need to build models in real time. For example, credit card fraud detection is a form of real-time processing. However, here, models are usually built ahead of time. Building models in real time remains a challenge. Data mining can also be used for analyzing web logs as well as the audit trails. Based on the results of the data mining tool, one can then determine whether any unauthorized intrusions have occurred and/or whether any unauthorized queries have been posed [MASU11].

Other applications of data mining for cyber security include analyzing the audit data. One could build a repository or a warehouse containing the audit data and then conduct an analysis using various data mining tools to see if there are potential anomalies. For example, there could be a situation where a certain user group may access the database between 3 and 5 a.m. in the morning. It could be that this group is working the night shift, in which case there may be a valid explanation. However, if this group is working between say 9 a.m. and 5 p.m., then this may be an unusual occurrence. Another example is when a person accesses the databases always between 1 and 2 p.m., but for the last 2 days he/she has been accessing the database between 1 and 2 a.m. This could then be flagged as an unusual pattern that would need further investigation.

Insider threat analysis is also a problem both from a national security and from a cyber security perspective. That is, those working in a corporation who are considered to be trusted could commit espionage. Similarly, those with proper access to the computer system could insert malicious code that behaves like benign code until an opportunity arrives to steal the data. Catching such terrorists is far more difficult than catching terrorists outside of an organization. One may need to monitor the access patterns of all the individuals of a corporation even if they are system administrators, to see whether they are carrying out cyber terrorism activities. However, this could result in privacy violations. Our approach to applying data mining for insider threat detection is discussed in Part III.

While data mining can be used to detect and possibly prevent cyber attacks, data mining also exacerbates some security problems such as the inference and privacy problems. With data mining techniques, one could infer sensitive associations from the legitimate responses. Figure 4.4 illustrates data mining services for cyber security. For more details on a high-level overview we refer to [THUR04] and [THUR05].

79029.jpg

Figure 4.4 Data mining services for cyber security.

4.3 Data Mining Tools

Over the past decade, we have developed a number of data mining for cyber security applications at The University of Texas at Dallas. In one of our previous books, we discussed one such tool for intrusion detection [AWAD09]. In another book, we discussed a number of data mining tools for malware detection [MASU11]. In this section, we discuss the tools and provide an overview for the discussion.

An intrusion can be defined as any set of actions that attempts to compromise the integrity, confidentiality, or availability of a resource. As systems become more complex, there are always exploitable weaknesses due to design and programming errors, or through the use of various “socially engineered” penetration techniques. Computer attacks are split into two categories, host-based attacks and network-based attacks. Host-based attacks target a machine and try to gain access to privileged services or resources on that machine. Host-based detection usually uses routines that obtain system call data from an audit process which tracks all system calls made on behalf of each user.

Network-based attacks make it difficult for legitimate users to access various network services by purposely occupying or sabotaging network resources and services. This can be done by sending large amounts of network traffic, exploiting well-known faults in networking services, overloading network hosts, and so on. Network-based attack detection uses network traffic data (i.e., tcpdump) to look at traffic addressed to the machines being monitored. Intrusion detection systems are split into two groups: anomaly detection systems and misuse detection systems.

Anomaly detection is the attempt to identify malicious traffic based on deviations from established normal network traffic patterns. Misuse detection is the ability to identify intrusions based on a known pattern for the malicious activity. These known patterns are referred to as signatures. Anomaly detection is capable of catching new attacks. However, new legitimate behavior can also be falsely identified as an attack, resulting in a false positive. The focus with the current state-of-the-art technology is to reduce false negative and false positive rate. We discussed in detail an innovative data mining technique called DGSOT for intrusion detection. We have shown through experimentation that DGSOT performs much better and gives more accurate results than other tools in the literature at that time.

Following DGSOT, we designed a number of data mining tools for malware detection and this work is presented in [MASU11]. These include tools for email worm detection, malicious code detection, buffer overflow detection, and botnet detection, as well as analyzing firewall policy rules. Figure 4.5 illustrates the various tools we have developed. For example, for email worm detection, we examine emails and extract features such as “number of attachments” and then train data mining tools with techniques such as support vector machine (SVM) or Naïve Bayesian classifiers and develop a model. Then, we test the model and determine whether the email has a virus/worm or not. We use training and testing datasets posted on various web sites. Similarly, for malicious code detection, we extract n-gram features both with assembly code and binary code. We first train the data mining tool using the SVM technique and then test the model. The classifier will determine whether the code is malicious or not. For buffer overflow detection, we assume that malicious messages contain code while normal messages contain data. We train SVM and then test to see if the message contains code or data.

79064.jpg

Figure 4.5 Data mining tools at the University of Texas, Dallas.

We have also reimplemented some of our data mining tools to operate in a cloud. Essentially, we have applied big data analytics techniques for malware detection and showed the significant improvement we can get by using big data analytics versus data mining. This is the approach we have taken for the insider threat detection problems discussed in this book. That is, we discuss stream analytics techniques that we have developed and show how they can be implemented in the cloud for detecting insider threats. We believe that due to the very large amounts of malware data that are dynamic and heterogeneous in nature, we need big data mining tools to analyze such data to detect for security violations.

4.4 Summary and Directions

In this chapter, we provided an overview of data mining for cyber security applications. In particular, we discussed the threats to computers and networks and described the applications of data mining to detect such threats and attacks. Some of the data mining tools for security applications developed at The University of Texas at Dallas were also discussed.

Data mining for national security as well as for cyber security is a very active research area. Various data mining techniques including link analysis and association rule mining are being explored to detect abnormal patterns. Because of data mining, users can now make all kinds of correlations. In addition, in the past 5 years massive amounts of data have been collected. We need big data analytics techniques to detect potential security violations. This also raises privacy concerns. More details on privacy can be obtained from [THUR02]. Much of the contents in this book are on big data management and analytics techniques for cyber security applications.

References

[AWAD09]. M. Awad, L. Khan, B. Thuraisingham, L. Wang, Design and Implementation of Data Mining Tools, CRC Press, Boca Raton, FL, 2009.

[BOLZ05]. F. Bolz, K. Dudonis, D. Schulz, The Counterterrorism Handbook: Tactics, Procedures, and Techniques, Third Edition Practical Aspects of Criminal & Forensic Investigations, CRC Press, Boca Raton, FL, 2005.

[CHAN99]. P. Chan, W. Fan, A. Prodromidis, S. Stolfo, “Distributed Data Mining in Credit Card Fraud Detection,” IEEE Intelligent Systems, 14, #6, 67–74, 1999.

[MASU11]. M. Masud, L. Khan, B. Thuraisingham, Data Mining Tools for Malware Detection, CRC Press, Boca Raton, FL, 2011.

[THUR02]. B. Thuraisingham, Data Mining, National Security, Privacy and Civil Liberties, SIGKDD Explorations, Vol. 4, #2, New York, NY, December 2002.

[THUR03]. B. Thuraisingham, Web Data Mining Technologies and Their Applications in Business Intelligence and Counter-Terrorism, CRC Press, Boca Raton, FL, 2003.

[THUR04]. B. Thuraisingham, Data mining for security applications. Managing Threats to Web Databases and Cyber Systems, Issues, Solutions and Challenges, V. Kumar, J. Srivastava, Al. Lazarevic, editors, Kluwer, MA, 2004.

[THUR05]. B. Thuraisingham, Database and Applications Security, CRC Press, Boca Raton, FL, 2005.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset