Chapter 5

Access Analytics

Abstract

Since the technologies providing the convenience to remotely access our IT systems can be manipulated by the malicious actors, it is important to have a security program to quickly identify the misuse of systems. We do this by using access analytics. In this chapter, we use a programming language called Python, which is used to create programs for detection strategies, and apply it to a scenario involving virtual private network logs.

Keywords

Access analytics; Knowledge engineering; MaxMind GeoIP; Programming detection strategies; Python; Remote access technologies; Scripting; Security analytics; Virtual private network; VPN
Information in This Chapter
▪ Access Analytics (30 pages)
Scenarios and Challenges in User Access
▪ Use of Analytics in Identifying Anomalies or Misuse in Access
▪ Case Study: Step-by-step guide on how to “code” access anomaly detection in user access
▪ Other Applicable Security Areas and Scenarios

Introduction

There are so many ways that malicious users can access IT systems right now. In fact, the very technologies affording us the convenience to remotely access our IT systems are the ones that are being manipulated by malicious users. In today’s IT environment, physical access is no longer a hindrance to gaining access to internal resources and data.
Remote access technologies such as virtual private network (VPN) are commonly used in business environments. While these technologies provide increased efficiency in terms of productivity, they also introduce another level of risk into an organization. There have been many incidents lately stemming from remote access intrusions. In fact, several studies indicate that the majority of data breaches were linked to third-party components of IT system administration.
It is important to have a security program where we can quickly identify misuse of system access. In so doing, we are able to limit any damage that could be done through an unauthorized access. But how can you, as a security professional, track anomalous behavior and detect attacks? We need to have efficient ways of monitoring remote access data.
Unfortunately, many current products for third-party remote access do not offer granular security settings and comprehensive audit trails. If they do, they do not have advanced misuse or anomaly detection capabilities that will help security professionals identify potential unauthorized access scenarios.
In this chapter, we would like to provide some techniques and tools that could help you in these types of scenarios. Some of the things we will explore include knowledge engineering, by means of programming detection strategies. If you do not know how to program, do not worry. We will provide simple techniques and step-by-step walk-through instruction to get you going.

Technology Primer

First off, we will provide a brief background of the technologies involved in our scenario. As you can tell by the introduction, we will be focusing on detecting unauthorized access in remote access technologies.
You may already be familiar with some of the technologies that we will be using in our scenario: they include remote access, VPN, and python. Our main data set in our scenario is VPN logs. We will use Python to create a program that will process the VPN logs. Our goal is to use a variety of techniques to identify anomalies in our data set.
First off, let us talk about our data and the technology that is involved in it.

Remote Access and VPN

What is VPN?

Basically, VPN is a generic term to describe a combination of technologies allowing one to create a secure tunnel through an unsecured or untrusted network, such as public networks like the Internet. This technology is used in lieu of a dedicated connection, commonly referred to as a dedicated line, from which the technology derives its “virtual” name. By using this technology, traffic appears to be running through a “private” network.

How does VPN work?

Data in VPN are transmitted via tunneling. Packets are encapsulated or wrapped in another packet with a new header that provides routing information. The route that these packets travel through is what is considered as the tunnel. There are also different tunneling protocols, but since this is not within the scope of this book, we will not be covering these protocols. Another thing to note about VPNs is that the data are encrypted. Basically, data going through the tunnel, which is passed through a public network, are unreadable without proper decryption keys. This ensures that data confidentiality and integrity is maintained.

What are the Dangers of VPN?

Using VPN in general is considered good practice for remote access. This makes packets going through a public network such as the Internet unreadable without proper decryption keys. It also ensures data are not disclosed or changed during transmission. However, by default, VPN generally does not provide or enforce strong user authentication. Current VPN technologies support add-on two-factor authentication mechanisms, such as tokens and various other mechanisms, which were mentioned earlier. However, by default, it is simply a username and password for gaining access to the internal network. This can present a significant risk because there could be scenarios whereby an attacker gains access to these credentials and subsequently to your internal resources. Here are a few examples:
▪ A user can misplace their username and password.
▪ A user can purposely share their username and password.
▪ A user can fall victim to a spear phishing attack.
▪ A user might be using a compromised machine with malware harvesting credentials.
In any of the above scenarios, once an attacker obtains the user’s credentials, assuming there is no two-factor authentication, the attacker would be able to gain access to all internal resources to which the user currently has access via the user’s remote profiles and access rights. Thus, determining the access rights is a major factor in determining the potential extent of the compromise.

Monitoring VPN

As this chapter is about detecting potential unauthorized remote access, it is important to provide you with a brief background on logging VPN access. Most VPN solutions have, in one form or another, logging capabilities. Although much of the logging capability is dependent on the vendor, at the very least, your VPN logs should contain the following information:
▪ user ID of the individual,
▪ date and time of access,
▪ what resources were accessed, and
▪ the external IP from which the access was made.
There are many VPN solutions, so it would be impossible to outline all the necessary instructions to obtain your organization’s VPN log data, but your network administrator should be able to provide log data to you. For the purposes of this chapter, we will be providing you with a sample data set that contains the aforementioned data.
In general, log data are fairly easy to obtain. However, monitoring the logs to ensure that the people who are logging on are actually employees of your organization is another matter. Let us say your organization has 5000 employees and one-quarter of them are given VPN access. There are still over 1000 connections that you will have to review. Obviously, you will not be able to ask each and every employee if they made the connection, right? We certainly do not lack the data; however, we are limited by our analysis capabilities. This lack of analysis is what we will be focusing on in this chapter.

Python and Scripting

In most cases, we are stuck with whatever data that we have. If your VPN software provides robust detection and analytics capability helping you to identify potentially anomalous access cases, then your organization is off to a great start. Oftentimes, you just have a spreadsheet of VPN access, similar to what we will be providing to you in this chapter. Therefore, we will show you how to build this capability, with a little bit of programming, so that you may conduct your own analysis.
Typically, programming is not what 99% of security professionals do for a living. Unless you work directly in recreating vulnerabilities or exploits in software, it is a skill that most of us know about but rarely use. We believe that learning to program is a valuable and useful skill for security professionals. You do not need to know how to program complex software, but programming can help you to automate efforts that would otherwise take a lot of time. For example, let us say we wanted to review all of our VPN logs. This could be a significant task, so providing some degree of automation, particularly if the logic is repetitive, would really help you. In this regard, knowing how to program or use a “scripting language” would greatly benefit you in making the process more efficient.

What is a Scripting Language?

There is still some ambiguity on what can be considered a scripting language. In principle, any programming language can be used as a scripting language. A scripting language is designed as an extension language for specific environments. Typically, a scripting language is a programming language used for task automation, as opposed to tasks executed one-by-one by a human operator. For example, these could be tasks a system administrator can be doing in an operating system. For our purpose, you can think of a scripting language as a general-purpose language.
Scripting languages are often used to connect system components, and are sometimes called “glue languages.” One good example is Perl, which has been used a lot for this purpose. Scripting languages are also used as a “wrapper” program for various executables. Additionally, scripting languages are intended to be simple to pick up and easy to write. A good example of a scripting language that is fairly easy to pick up is Python. So, this is the language that we are going to use in our scenario.

Python

Python is relatively easy to learn while being a powerful programming language. The syntax allows programmers to create programs in fewer lines than it would be possible in other languages. It also features a fairly large, comprehensive library and third-party tools. It has interpreters for multiple operating systems, so if you are using a Windows-, Mac-, or Linux-based machine, you should be able to access and use Python. Finally, Python is free, and since it is open source, it may be freely distributed.
Python is an interpreted language, meaning you do not have to compile it, unlike more traditional languages like C or C++. Python is geared for rapid development, saving you considerable time in program development. As such, it is perfect for simple automation tasks, such as those we have planned for in our scenario for this chapter. Aside from this, the interpreter can be used interactively, providing an interface for easy experimentation.

Resources

As this book is not a Python tutorial book, we will point you to really good resources that will help you to start using Python. The following are lists of recommended resources:

Codecademy

A great resource that we highly recommend to start with is the Python track of Codecademy:
Codecademy is an online interactive Web site for learning programming languages. One of the key resources is Codecademy’s online tool, which provides a sandbox in your browser, where you can actually test your code. The site also has a forum for coding enthusiasts and beginners, which is helpful when you encounter problems.
Python.org
Python.org is the official Web site for Python. Python is a very well-documented language—it is apparent in the amount of documentation available on the site. The full documentation for Python 3.4 (the stable version during the time of this writing) is available on the following link:
As you will see, the documentation is comprehensive. When you become more experienced with Python, this will be a great source of information. However, before you go too deep, you should go to this link for a basic tutorial to first get your feet wet:
Learning Python the Hard Way
Contrary to the title, this is actually a really good resource in learning Python. It is a beginners programming course that includes videos and a downloadable book. Following is the main Web site:
But if you do not want to pay for the videos and the downloadable book, the content is also available on an online version here:
The course consists of about 52 exercises. Depending on your skill level and the amount of time you want to invest in learning the language, the author claims it can take as little as one week, and as long as six months. Nonetheless, it is a very good resource and should be something that you should consider reviewing.

Things to Learn

At the very least, you should consider learning the following Python topics: Python syntax, strings, conditionals, control flow, functions, lists, and loops.
If this is your first time with a scripting language, do not worry. You do not have to be an expert in Python to be able to continue with this chapter. As we go through the scenario, we will be explaining what each piece of the sample code is doing. But before that, let us go into more detail on our scenario and the techniques we will actually use to solve the problem.

Scenario, Analysis, and Techniques

Let us discuss the overall scenario we will be using. We will break this down based on the questions we need to answer:
▪ What is the problem?
▪ What are the data that we will be using and how do we collect them?
▪ How will we analyze the data? What techniques are we using?
▪ How will we be able to practically apply the analysis technique to the data?
▪ How to deliver the results?

Problem

In our scenario, we want to show how to identify potentially unauthorized remote accesses to an organization.

Data Collection

The data we will be using for our scenario are the VPN access logs. At a minimum, the data will contain the following information:
▪ User ID
▪ Date and time of access
▪ Internal resource accessed (internal IP)
▪ Source IP (external IP)
We will assume that the below-listed data were provided to us as a spreadsheet, as this is the most common way for exporting data. For now, you can leverage the data set provided as part of this book. Here is a sample extract from that data set (Figure 5.1):

Data Analysis

Before we go into identifying potentially anomalous VPN logins, let us think about a simpler scenario. If you were going through your credit card transaction statements and saw the below-listed events, what would you have concluded?
▪ Your credit card was used at the same time at two different locations;
▪ Your credit card was used in Russia (and you have never been there);
▪ Your credit card was used in two different physical locations in the same hour when it is physically impossible to get there in an hour; and
▪ Your credit card was used a hundred different times in the course of the week.
These are indicators that your credit card may have been compromised. While this is a simplistic example, we will be extending this type of analysis in our scenario by looking for anomalous behavior indicating a compromise.
So now, let us review our VPN access logs. Let us assume that you only had to review your access. How would you review the VPN access logs manually? What would you look for? It would be fairly straightforward, right? Let us use the same fact pattern we used for the credit card transactions.
▪ Your user ID logged in concurrently from two different IP addresses;
▪ Your user ID logged in from Russia (and you have never been there);
▪ Your user ID was used twice in an hour from your office and your home when it is physically impossible to get there in an hour; and
▪ Your user ID logged in from a hundred different IP addresses in the course of the week.
It makes sense, right? This is just plain logic and common sense, assuming we are only looking for the narrow fact patterns listed above. If you think about it, there could be other scenarios in which you could look for similar anomalous behavior. For example, listed below are sample questions that could lead us to finding anomalous user connections:
image
Figure 5.1 Sample data set: VPN logs.
▪ How much time does a user’s session usually take?
▪ What time does a given user usually log in?
▪ At what time does a given user’s connection usually originate?
▪ At what time does a given source IP address usually originate?
▪ At what time do all connections usually originate from?
▪ At what time do connections from a certain city (based on the IP address) usually originate?
▪ What is the relationship between log-in time and access time of an internal system?
▪ What time does a given user usually log off?
▪ What time does a source IP address usually log off?
▪ What time does a user’s country usually log off?
▪ What time does a user’s city usually log off?
▪ What time does an internally accessed system have in common with the log-off time from the VPN?
▪ From what source IP address does a given user originate?
▪ From what country does a given user originate?
▪ From what city does a given user originate?
▪ What internal system does a given username usually access?
▪ What is the IP address with which a country is usually associated?
▪ What is the IP address with which a city is usually associated?
▪ What users connected to an internal system?
▪ With what country is a given city associated?
▪ Which internal systems are accessed from which country?
▪ Which internal systems are accessed from certain cities?
As you can see, we have raised multiple questions that could indicate a potentially suspicious connection. But for now, let us focus on one potentially critical factor: distance of connection. Obviously, even if a user was working remotely, it would be suspicious if a user logs in from multiple locations when it is physically impossible to be there. Of course, there could be exceptions. For example, a user could log in from one machine at a particular location, log off that machine, and then log in from different machine at a different location; however, this is suspicious, in itself.
So, first we need to ask ourselves what would be a good way to determine if the distance between locations is significant. For this, we can use haversine distances.

Haversine Distances

Haversine distance is a formula for finding the great-circle distance between a pair of latitude–longitude coordinates. Basically, it is a calculation of geographic distance (latitude and longitude), which incorporates the concept of measuring spherical distance (as the Earth is nonperfect sphere). This equation is important in navigation, but can be applied in other applications. For example, it can be used to determine accessibility of health-care facilities within a certain geographical area. The haversine distance technique can also be used in crime analysis applications, such as finding incidents taking place within a particular distance.
We will not go through the math involved in calculating a haversine distance, but we will cover how we can apply this to our problem. Simply put, the greater the haversine distance, the greater the distance between the sources of the remote logins. And, the greater the distance between the remote logins of one particular user in a given time span, the greater are the chances that this was a potentially anomalous user access.

Data Processing

So now, we have the data (the VPN logs) and we have our analysis technique (haversine distances). But how do we put these together? This is when our scripting or “glue” language comes into play. In order to process the data, we will have to create a script that will do the following things:
▪ Import the data: First, we will need to be able to import the VPN logs so that our program can process it. For example, if the data are in the form of a spreadsheet, then we will need to be able load the data from the spreadsheet into memory so that we can preprocess the data and then apply our analysis technique.
▪ Preprocess the data: “Preprocessing” is making the data better structured, so it can be used by our analysis technique. For example, our VPN logs would only have source IPs. In order to actually get the haversine distance, we will need to be able to get the latitude and longitude values. Aside from that, we will need to do some error checking and validation to make sure the data we are entering for analysis are valid. As they say, “garbage in, garbage out.”
▪ Apply the analysis technique: Once we have all the necessary data, we will then use our analysis technique, which in this case is the haversine distance.
▪ Generating the results: Finally, once we get the haversine distance, we will need to determine a threshold for what is unusual for a certain amount of time. Obviously, we will look for a greater haversine distance in a shorter log-in frequency span as being more suspicious.
We have covered the basic steps we will be following in developing our Python program. In the next section, we start diving into the innards of our Python program. If you have some programming knowledge and can follow a program’s flow (i.e., loops and conditions), you should be able to follow the case study even without any Python knowledge. If you do not have the programming knowledge, feel free to go through the primer resources provided in the previous section.

Case Study

Importing What You Need

import argparse
import re
import csv
import math
from datetime import datetime
Now let us go over the code. First off, you will see several import statements. In most programming languages, a programmer is not expected to do everything from scratch. For example, if someone has already built scripts to handle processing of date and time, typically one does not have to write them from scratch. Oftentimes, there are “modules” a programmer can “import,” so they can reuse the scripts and incorporate them into other programs or scripts. This is basically what is happening with the programming code outlined above.
Python code gains access to the functionality provided by one module through the process of importing the module. The import statement, as seen, here is the most common way of invoking the import functionality.
Let us go through each of the modules that we are importing:
▪ The argparse module is used to create command-line interfaces for your script like:
python yourporgramname.py arguments
This module automatically defines what arguments it requires, generates help and usage messages, and issues errors when a user gives the program invalid arguments. We will use this module to accept arguments from our command line, such as the name of the VPN log file that we are going to process.
▪ The re module provides regular expression support to Python programs. A regular expression specifies a set of strings that matches it. Basically the functions in this module allow you to check if there are particular string matches that correspond to the given regular expression. If you have limited exposure to regular expressions, there is a good amount of reference material available from the Web. Since our VPN logs are mostly unstructured text, we will be using this module to parse the events in our VPN logs to produce a more structured data set.
▪ The csv module provides support and various functionalities for reading, writing, and manipulating CSV or “comma separated values.” The CSV format is probably the most common import and export format for spreadsheets and databases. It should be noted though that there is no standard CSV format, so it can vary from application to application. There are CSV files where delimiters are not even commas—they can be spaces, tabs, semicolons (;), carets (^), or pipes (|). The overall format is similar enough for this module to read and write tabular data. We will be using this module in our scenario to process VPN logs formatted using CSV and we will produce the results in the same format, as well.
▪ The math module provides access to the mathematical functions defined by the C standard. We need the math module for the computations we will be doing in the script, particularly when we use the haversine distance formula.
▪ The datetime module supplies classes for manipulating dates and times, in both simple and complex ways. While date and time arithmetic is supported, the focus of this implementation is on efficient attribute extraction for output formatting and manipulation. For related functionalities, see the time and calendar modules.
# requires maxmind geoip database and library

# http://dev.maxmind.com/geoip/legacy/install/city/
import GeoIP
We will also be using a third-party module called GeoIP for our program. This is the MaxMind’s GeoIP module, which will enable our program to identify the geographic information from an IP address. Most importantly, we are concerned with the latitude and longitude for our haversine distance computation, but it also allows us to identify the location, organization, and connection speed. MaxMind’s GeoIP module is one of the more popular geolocation databases. More information can be seen in this link:
For our scenario, we will be using the GeoLite 2 database, which is a free geolocation database also from MaxMind. It is comparable, but it is less accurate than the company’s premier product, which is the GeoIP2 database.
To get started with MaxMind GeoIP, go through this link and install it into your system:
The link above provides a brief outline of the steps needed to install GeoIP City on Linux or Unix systems. The installation on Windows is similar: You will just need to use WinZip or a similar ZIP program. The outline provides the following steps:
▪ Download database
▪ Install database
▪ Query database

Program Flow

def main():
“”” Main program function “””
args = parse_args()
# read report
header, rows = read_csv(args.report)
# normalize event data
events = normalize(header, rows)
# perform analytics
events = analyze(events)
# write output
write_csv(args.out, events)
if __name__ == ‘__main__’:
main()
The main function provides the flow of the actual program. The diagram below illustrates how the program will work (Figure 5.2).
The flow is fairly straightforward, since it is a very simple program. Here is an additional description of the overall program flow.
▪ The program will read and parse the command-line argument. This is how the program knows which VPN log it will need to process.
▪ Once the name of the file has been passed through the argument, the program will then read the file.
▪ While reading the file, the program will start normalizing the contents of the VPN logs. This means that the data are converted to the format that will be more conducive for processing.
▪ Once the data are normalized, the program will then run the analysis which in this case consists of GeoIP processing, which includes identifying the latitude and longitude, as well as the computation for the haversine distance.
▪ Finally, we will generate the report that will show the accounts that have the highest haversine distance.
image
Figure 5.2 The remote access Python analytics program flow.
In the subsequent sections, we will go through a more detailed review of each process and code snippets one-by-one.

Parse the Arguments

Let us go through the code that reads and parses the command-line argument. We parse the arguments using the “call” from the main.
args = parse_args()
The function we are calling is called parse_args():
def parse_args():
 # parse commandline options
 parser = argparse.ArgumentParser()
 parser.add_argument(‘report’,
type=argparse.FileType(‘rb’),
 help=’csv report to parse’)
parser.add_argument(‘-o’, ‘--out’, default=’out.csv’,
 type=argparse.FileType(‘w’),
 help=’csv report output file’)
return parser.parse_args()
Basically, this code snippet allows the program to be able to take a command-line argument. In our case, there are two arguments that we would like to be able to pass:
▪ The name of the VPN log file that we would like to process
▪ The name of the output file where the results will be written
The important part here, in the code, is the parser.add.argument method. You will notice that we have two statements corresponding to the two arguments we need to take.
Overall, this would allow us to issue a command in the following manner:
python analyze.py vpn.csv –o out.csv
You will also see that the “–o” is not required, because it will default to “out.csv,” as you will see in the second “add.argument” statement in the program.

Read the VPN Logs

Let us go through the code that reads the file that is containing the VPN logs. This is done through the following statements in main:
# read report
 header, rows = read_csv(args.report)
The function that is called read_csv():
def read_csv(file):
 “”” Reads a CSV file and returns the header and rows “””
 with file:
 reader = csv.reader(file)
 header = reader.next()
 rows = list(reader)
 return header, rows
This snippet of code allows the program to read the CSV file. Here are the various processes the code implements:
▪ A CSV object called “reader” is created. This uses the CSV module that was imported previously. The CSV module provides methods to manipulate tabular data.
▪ The reader object iterates over the lines in the given CSV file. Each row read from the CSV file is returned as a list of strings.
▪ Since the first row of our data file contains a header (the title of the rows), the program iterates to the first line and gets the header information. This is stored in the “header” variable.
▪ The contents of the file or the logs itself are then loaded into the “rows” variable.
At the end of all this, we loaded the entire content of the VPN log file into memory and returned it to the program for further processing.

Normalize the Event Data from the VPN logs

After we have loaded all the data into memory, the next step is to normalize the event data. This is done by calling the following code from main:
# normalize event data
events = normalize(header, rows)
The function to normalize data is called normalize():
def normalize(header, rows):
 “”” Normalizes the data “””
 events = []
 for row in rows:
 timestamp = row[header.index(‘ReceiveTime’)]
 raw_event = row[header.index(‘RawMessage’)]
 event = Event(raw_event)
 event.timestamp = datetime.strptime(timestamp, TIME_FMT)
 events.append(event)
 return sorted(events, key=lambda x: (x.user, x.timestamp))
The code snippet above normalizes the data from the VPN logs. We normalize the data because VPN logs, as most logs, are typically unstructured text similar to the one listed below.
<164>%ASA-4-722051: Group < VPN_GROUP_POLICY>
User < user1> IP <108.178.181.38> Address <10.10.10.10> assigned to session
Typically, if you would want to analyze data, you would want to process it so that it can be in a usable format. We use the normalize() method to do just that. In our case, we would like to structure our data so that we are able to separate the data into the following elements:
▪ the User ID,
▪ the external IP address,
▪ the internal IP address, and
▪ date and time.
Let us go through the code and see what it does:
▪ The program loads the “ReceiveTime” column and the “RawMessage.” We obtained these columns through the reader object via the CSV module.
▪ Then, the program processes the timestamp to a more usable format. There are certain formats that do not work well in manipulating data. In this case, the format in our VPN logs, such as “Apr 3, 2013 2:05:20 PM HST,” is a string conducive to data manipulation (e.g., sorting operations). We used the datetime.strptime() class method to convert the string to an actual date/time format, allowing us to perform date/time manipulation on the data.
▪ The program passes the “rawmessage” to an Event object. First let us look at the Event class. The Event class looks like the below code:
class Event(object):
 “”” Basic event class for handling log events “””
 _rules = []
 _rules.append(Rule(‘ASA-4-722051’, ‘connect’, CONNECT))
 _rules.append(Rule(‘ASA-5-722037’, ‘disconnect’, DISCONNECT))
def __init__(self, raw_event):
 for rule in self._rules:
 if rule.key in raw_event:
 self._match_rule(rule, raw_event)
 self.key = rule.title
def _match_rule(self, rule, raw_event):
 match = rule.regex.match(raw_event)
 for key, value in match.groupdict().iteritems():
 setattr(self, key, value)
def __str__(self):
 return str(self.__dict__)
def __repr__(self):
 return repr(self.__dict__)
The Event class then utilizes the Rule class, which looks like the following:
class Rule(object):
 “”” Basic rule object class “””
def __init__(self, key, title, regex):
 self.key = key
 self.title = title
 self.regex = re.compile(regex)
▪ What do the Event and Rule classes do? Basically, these functions are used to parse the VPN logs into “structured” events. This is done via the “Rules” class that uses regular expressions to break down the string. For example, “connect” events in the VPN logs are parsed using this command:
CONNECT = (r’.∗> User <(?P<user>.∗)> IP<(?P<external>.∗)> ‘
r’Address <(?P<internal>.∗)> assigned to session’)
▪ If you look at the command above, using the regular expression inside the CONNECT variable, the program will be able to extract the user, the external IP, and internal IP information from the raw message of the VPN log.
▪ Finally, once we have parsed and normalized all the needed information, we sort the events based on users and timestamp. By doing this, we will be able to compare the following:
when and where the user is currently logged in, and
when and where the user previously logged in before the current login.
The reason for this will be more readily apparent as we go through the analysis of the data.

Run the Analytics

def analyze(events):
 “”” Main event analysis loops “””
 gi = GeoIPopen(GEOIP_DB, GeoIP.GEOIP_STANDARD)
 for i, event in enumerate(events):
 # calculate the geoip information
 if event.external:
 record = gi.record_by_addr(event.external)
 events[i].geoip_cc = record[‘country_code’]
 events[i].geoip_lat = record[‘latitude’]
 events[i].geoip_long = record[‘longitude’]
 # calculate the haversine distance
 if i > 0:
 if events[i].user == events[i-1].user:
 origin = (events[i-1].geoip_lat, events[i-1].geoip_long)
 destination = (events[i].geoip_lat, events[i].geoip_long)
 events[i].haversine = distance(origin, destination)
 else:
 events[i].haversine = 0.0
 else:
 events[i].haversine = 0.0
 return events
This is the “meat” of the script we are creating. This is where we compute the haversine distance we will be using to detect unusual VPN connections. First, we need to get the location. We do this by identifying the location of the connection and utilizing the MaxMind GeoIP API:
gi = GeoIP.open(GEOIP_DB, GeoIP.GEOIP_STANDARD)
for i, event in enumerate(events):
 # calculate the geoip information
 if event.external:
 record = gi.record_by_addr(event.external)
 events[i].geoip_cc = record[‘country_code’]
 events[i].geoip_lat = record[‘latitude’]
 events[i].geoip_long = record[‘longitude’]
Here you see that we create a GeoIP object. Then, we go through all the events and pass the external IP address (using event.external) to get the following GeoIP information:
▪ country code,
▪ latitude, and
▪ longitude.
The latitude and longitude are the essential elements we need to compute the haversine distance here:
# calculate the haversine distance
if i > 0:
 if events[i].user == events[i-1].user:
 origin = (events[i-1].geoip_lat, events[i-1].geoip_long)
 destination = (events[i].geoip_lat, events[i].geoip_long)
 events[i].haversine = distance(origin, destination)
 else:
 events[i].haversine = 0.0
 else:
 events[i].haversine = 0.0
We compare before and after connections for one user in this section. Here is the pseudocode on how the code operates:
▪ Is the previous event from the same user?
▪ If yes, then:
Where did the user’s current connection come from?
Where did the connection before this current one come from?
Compute for the haversine distance
▪ If no, then:
Zero out the haversine computation.
Pretty simple, is it not? So now, how is the haversine distance computed? The distance method in the code is used:
def distance(origin, destination):
 “”” Haversine distance calculation
 https://gist.github.com/rochacbruno/2883505</u>
 “””
 lat1, lon1 = origin
 lat2, lon2 = destination
 radius = 6371 # km
 dlat = math.radians(lat2-lat1)
 dlon = math.radians(lon2-lon1)
 a = math.sin(dlat/2) ∗ math.sin(dlat/2) + math.cos(math.radians(lat1))
 ∗ math.cos(math.radians(lat2)) ∗ math.sin(dlon/2) ∗ math.sin(dlon/2)
 c = 2 ∗ math.atan2(math.sqrt(a), math.sqrt(1-a))
 d = radius ∗ c
 return d
This is a little bit hard to explain without teaching you math, so we will not be covering these details in this book. The important thing for you to know about the code here is the technique we are using and we know how to use Google!
In this case, a simple search for “Havesine Python” would lead you to a ton of resources. We are crediting Waybe Dyck for a piece of code made available in Github for the haversine calculation. And, that is the code we will be using! It is now time to run it and analyze the results.

Analyzing the Results

To run the code, all you really need to do is to type in the following command:
python analyze.py vpn.csv –o out.csv
When the program is run, it will do the following:
▪ Load the VPN log information from vpn.csv
▪ The program will run the analytics we discussed in the previous section
▪ The program will then write the results in a file called out.csv file
Let us open up the vpn.csv file in a spreadsheet and look at the results. The results should look like something similar to the following (Figure 5.3):
image
Figure 5.3 Sample output of the remote access script.
The important information here is the last column, containing the haversine distance. This should be the focus of your review. We want to look for the larger haversine distance because it means the locations between the logins are greater. Therefore, the greater the haversine distance, the more suspicious it is. Let us go through some examples to make it clearer. First off, here are some quick guidelines in doing the review:
image
Figure 5.4 Reviewing the access behavior of User8.
▪ Disregard haversine distances that are 0.
▪ Look for haversine distances that are large (e.g., greater that 1000). This is generally up to your discretion, but most of it is common sense. For example, let us look at “user8” (Figure 5.4):
User8 has a fairly large haversine distance. If you do a GeoIP lookup, for example, using http://www.geoiptool.com, it shows that the connections are coming from the same state (Hawaii) but in different towns. You can also see that the date is one day apart, so it is not as suspicious at it seems. But, based on your level of tolerance, you can develop a policy to call and verify if a user’s login was valid for that day.
▪ Let us look for larger haversine distances in the list. You will see some that are fairly large such as this one for “user90.” (Figure 5.5)
There are several fairly large haversine distances here. If you use a GeoIP locator, you will be able to piece together the connection behavior of this user:
▪ 64.134.237.89 (Hawaii)
▪ 66.175.72.33 (California)
▪ 64.134.237.89 (Hawaii)
▪ 66.175.72.33 (California)
Note that this is in the span of one day. Actually, the first three logins were in the span of a couple of hours. This is obviously something worth investigating and, at the very least, having a security officer question user90 about these logins. Of course, this does not automatically mean that the connections are malicious. There could be valid reasons causing a user to connect through remote machines. In any case, this is something worth investigating.
image
Figure 5.5 Reviewing the access behavior of User90.
image
Figure 5.6 Reviewing the access behavior of User91.
Let us look another one (Figure 5.6). This one has an even bigger haversine distance:
If we investigate this further, we see this connection behavior in the span of one day:
▪ 72.235.23.189 (Hawaii)
▪ 198.23.71.73 (Texas)
▪ 198.23.71.73 (Texas)
▪ 198.23.71.73 (Texas)
▪ 66.175.72.33 (Hawaii)
As we already discussed, since these connections happened in a span of a few hours, this is not an absolute indication of a malicious connection. Plausible reasons for these types of connections include the following:
▪ The user is connecting through a remote machine.
▪ The user is using some sort of proxy or mobile service.
▪ Some users are sharing accounts.
▪ The account is compromised and a malicious user is connecting as the user.
In any of these scenarios, it is worthwhile to verify if these are valid connections. Ultimately, this type of review can be incorporated as a regular remote access review program, whereby the goal is to identify potentially malicious remote connections. Aside from checking for haversine distances, you could use the script as a foundation for creating other analysis methods to identify other misuse of remote access connections. You could consider expanding your script by including the following:
▪ concurrent connection of the same user,
▪ concurrent users,
▪ connection between two times,
▪ connections from certain countries,
▪ connections greater than x amounts per day,
▪ user connects in unusual times,
▪ user connects from unusual locations,
▪ the frequency of connections, and many more…
The principles discussed here can also be applied to other data sets. For example, this technique can be utilized for examining server or database access logs. The scripts can be easily tweaked to review physical access logs, as well such for identifying physical access into facilities at unusual times or frequencies.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset