Chapter 6

Understanding and Forecasting Events

Appropriate analytical methodologies can help you understand how security events and issues develop over time and identify the factors that influence the events. Some methodologies can also help you go beyond and manipulate the identified factors to forecast how the event will evolve in the future or whether similar events will take place. Specifically, language and sentiment analysis (LSA), correlation and regression analysis (CRA), and volumetric analysis (VA) can help you use social media and other data to understand and forecast a variety of security events, including famines and illicit behavior. The analyses can help you discover relationships between content on social media and events or actions on the ground, track the evolution of the relationships, and detect emerging security problems. This chapter begins by illustrating how simple monitoring of social media data can help you detect and understand events more quickly than traditional methods. It next defines forecasting, explains how it differs from predicting, and describes what it entails. The chapter ends by explaining how to conduct LSA, CRA, and VA through examples and walkthroughs.

Introduction to Analyzing Events

This chapter provides a few ways you can use formal and some informal methods to analyze security events so you can understand them and forecast their development and emergence in the near future. When we say security events, we also refer to corresponding behaviors, issues, problems, and actions. Events include terrorist attacks, natural disasters, homicides, riots, the act of drug smuggling, the act of sex trafficking, and much more. The majority of this chapter deals with using various analytical tools and formal methodologies to analyze social media and other data to understand and forecast security events. However, before we delve into how you can use the tools, we first want to discuss the role of social media data in intelligence analysis and how simple, informal monitoring of social media data can often help you anticipate and understand events better than traditional means.

Social Media Data as Intelligence

Publications about intelligence analysis discuss the intelligence cycle, which describes how intelligence is processed in a civilian or military intelligence agency.1 You can formulate the cycle in many ways, and we are not all that interested in discussing the best way. We do, however, want to discuss the role of social media data in the intelligence cycle in general. This section helps you understand where social media data fits in the intelligence cycle and how you can integrate the rapid influx of social media data into your analysis.

Despite the type of intelligence cycle you prefer, it involves some variation of planning, collecting, processing, evaluation, analysis, and dissemination. It also involves the different intelligence collection disciplines that we list in Table 6.1. Assessing how social media data fits in with and compares to other collection disciplines can help you appreciate the role it plays in helping illuminate events and their causes.

Social media data is often superior to other collection disciplines because it cuts across and amalgamates the other disciplines. Social media is a form of OSINT because it is derived from publicly available sources. OSINT is a neglected collection discipline, and many people ignore it in lieu of data that is stamped “Secret” or “Top-Secret.” However, because of the explosion of data available on the Internet and social media, OSINT can prove very valuable. If used correctly and in the right cases, OSINT can even outperform intelligence that is not publicly available.2

Although social media is often categorized as a form of OSINT, it also has elements of the other collection disciplines and combines them in powerful ways. Intercepting and collecting social media data entails, for example, collecting from phones (COMINT), from crowdmaps (GEOINT), lots of imagery data (IMINT), data about Internet traffic (ELINT), and can enable collection from humans through crowdsourcing (HUMINT). It also has its fair share of RUMINT, or rumor intelligence. Additionally, you can often collect social media data in real time, which you often cannot do with other forms of OSINT or the other INTs. The example in the subsequent section illustrates the utility of social media data as an intelligence collection discipline and the advantages it can confer over other traditional disciplines.

Table 6.1 Intelligence Collection Disciplines

Collection Discipline Description (Interception and/or collection of…)
SIGINT (SIGnals INTelligence) Electronic signals
COMINT (COMmunications INTelligence) A subset of SIGINT, involving electronic signals between people that are directly used in communication
ELINT (ELectronic INTelligence) A subset of SIGINT, involving electronic signals not directly used in communication
MASINT (Measurement And Signals INTelligence) Emissive bioproducts such as radiation or heat
IMINT (Imagery INTelligence) Imagery via satellite and aerial photography
GEOINT (GEOspatial INTelligence) Imagery and geospatial information that depicts physical features and geographically references activities
HUMINT (HUMan INTelligence) Intelligence gathered via means of interpersonal contact as opposed to via technology
OSINT (Open Source INTelligence) Publicly available information

Monitoring Attacks through Social Media

During the writing of this book, a tragic security event illustrated how simply monitoring social media can help you anticipate and understand events in real time better than the other disciplines, including traditional OSINT. On September 11, 2012, the U.S. consulate in Benghazi came under attack while several people were protesting across the Middle East against a video produced in the U.S. that mocked Islam. Four U.S. embassy personnel, including Ambassador Chris Stevens, subsequently died in the attack. Because of the attention drawn by the protests and possibly the lateness of the events, many mainstream news outlets failed to fully address the scale of the attack on the consulate and the death of the ambassador until the following day, September 12. However, information about the attack, including pictures, started to appear on social media almost immediately following the attack.

The Americans were allegedly attacked at around 2200 local Benghazi time (+3 GMT or 1600 EST) on September 11. At 1945 EST, the Associated Press reported the death of one American in the Benghazi attack. Other news outlets communicated this fact but failed to grasp and communicate the full extent of the attack or how it came about. The next day at 0721 EST, President Obama announced the death of Ambassador Stevens in the attack.3 Five hours before the president's announcement at 0200 EST, we had already found gruesome pictures of Ambassador Stevens' body on social media.4 The pictures were in fact hours old and had been posted soon after the attack. The pictures showed him being pulled from the building and carried by Libyans trying to save him, and showed close-ups of his injuries and face.

Meanwhile, reports quickly appeared stating that a group of Salafi Islamists had carried out the attacks. The reports did not appear on news media outlets, but from known anti-American Twitter accounts.5 The involvement of the Salafis was initially denied, confirmed a week later, and is under investigation as of this writing.6 News media outlets and others did not begin reporting on the full scale of the events in Benghazi till 0900 EST on September 12, until after President Obama's press conference. However, information depicting the full scale of the attack and the death of Ambassador Stevens had already appeared on Twitter two hours after the attack at 1800 EST on September 11.

Subsequent investigations revealed that the DoD operation centers around the region, aside from the one that retrieved the embassy personnel's bodies, were not fully aware of the attack till the day after. They were consuming information about the protests through the mainstream news outlets and were not monitoring social media for information about the attacks. This indicates that even American military personnel in the region had information that was several hours old compared to the information on social media, and so they had little to no clue or anticipation of the reality on the ground. The centers also did not push out information about the attack until well after the actual event because of the formalities and rigors involved with traditional intelligence collection and dissemination. They waited to collect information on the events, write reports on what was collected, submit it for release approval, and then submit it to their internal databases before releasing it to the community. By then, thanks to social media, we already knew who was being attacked and even had a clue about who may have carried out the attack.

Our intention is not to figure out exactly what happened when and who knew what when. The events are confusing and the politics surrounding the U.S. presidential campaigns were exacerbating the confusion. We only want to point out that simply monitoring and filtering social media data can help you figure out what is happening and anticipate hotspots and events. In this case, social media would not have helped anticipate a seemingly well-planned clandestine attack. However, it could have helped discover and validate essential facts earlier, mitigate confusion, and help with response.

Social media will not supplant traditional collection disciplines or overtake OSINT, but can augment them immensely. If you are sitting in an operational center that has the tools we discuss in this book deployed, you can use social media to help lift the fog. Social media data can serve as a first line of reporting for events that may otherwise go unnoticed and bring you closer to total information awareness. Traditional collecting disciplines and news outlets can then confirm the events. Through social media, you can gain a much quicker and fuller understanding of what is going on, and task other collection resources such as aerial vehicles and HUMINT assets to confirm or add to the information appearing on social media.

Of course, as we repeatedly mention throughout this book, you should not blindly trust anything you see on social media; hence, the need to always confirm using other intelligence sources. Fortunately, social media often regulates and validates itself. An example involving attacks in Somalia illustrates the intelligence validating and refining loop inherent in the social media environment.

In mid-later 2012, the Kenyan government launched an incursion against Somalia's al-Shabaab forces, which it claimed were becoming a threat to its country. On October 28, the Kenyans were involved in an amphibious assault on Kismayo, Somalia, which is an al-Shabaab stronghold in Southern Somalia. During the assault, the Kenyans began tweeting from their official accounts, including @KDFInfo, that they had taken the city. Al-Shabaab immediately countered by tweeting through its official account, @HSMPress, that it still controlled the city and that the Kenyans were not even in the city yet. The Kenyans later “adjusted” their statements to reflect a more accurate reality and were more forthcoming about the realities of their actions in Somalia.7 During this episode, we saw two battles—one on the ground and one in social media. If you can watch and integrate information from both, you are much more likely to discern fact from fiction and do it at a much faster rate. You can then task your resources and organize responses more efficiently and effectively. You can also produce more robust and insightful intelligence analyses and forecasts. The following section focuses on how to use social media data to produce those analyses and forecasts.

Understanding Forecasting

Before you can start using tools to analyze social media data, and understand and forecast events, you need to have a firm grasp of what constitutes forecasting and the limitations it faces. Forecasting is the process of determining what will likely happen in the near future based on past experience and knowledge, and current trends of behavior. You cannot forecast an event if you do not understand it to some extent. To forecast, you typically complete the following steps:

1. Find relationships between one or numerous factors and an event.
2. Assess how accurately the factors indicate or cause the presence of an event by looking at past and current data.
3. Determine the presence of the factors in the future and changes in the relationships between them and future events.
4. Use the determinations to identify the likelihood of the event's presence in the future.

Simply put, you identify factors that have indicated an event in the past, and then determine the presence of those factors, and thus the event, in the future. For a forecast to be useful, it should ideally state what, when, and where an event will occur, and the likelihood that the forecast is correct.

Forecasting vs. Predicting

Forecasting is different from predicting, which also involves determining what will happen in the future based on a priori knowledge and behavioral trends. The difference between a forecast and a prediction is somewhat subjective. To us, the main difference is that a prediction has details and certainty, whereas in most cases a forecast lacks some details and is humble about its certainty. The determination that the sun will rise tomorrow at exactly 5:45 A.M. in Washington D.C. is a prediction. The determination that there is a 70 percent chance a famine will occur in Somalia in the next two weeks is a forecast.

Do not believe anyone who claims they can predict security events. They may get a few things right here and there, but their accuracy will not be very impressive. Prediction concerning security events, issues, and behaviors is difficult for three reasons. One, black swans—or seemingly random and unlikely events that have immense influence—can happen at any time and significantly impact security. By definition, past information cannot tell you when and where black swans will likely occur because they are so rare and so little information is available about them, so you cannot determine their trend of occurrence and predict them. The assassination of John F. Kennedy was a black swan—no one expected it and it changed American history. The uprising in Tunisia in 2011 was also a black swan. A few expected small protests and popular disgruntlement due to the poor economy in the area, but no one expected the populace to overthrow their autocratic government so quickly. No trend or event in recent Middle East history gave an indication that such a thing could or would occur. Ironically, people involved with security issues tend to be most interested in anticipating the black swans. Two, we do not know enough about human behavior to predict it. Security events are a function of human behavior, which is often complicated, confusing, lacking constraints, and difficult to comprehend. Three, security events also involve hundreds of other factors and variables, which we also often do not know much about. Accurately predicting a terrorist attack against a shopping mall involves assessing information about the terrorists, their capabilities and weapons, their willingness to commit an attack, the security in the mall, the behavior of shoppers at the mall, the counter-terrorism activities of local law enforcement and other agencies, and much more. Too many factors exist to account for, and predicting with certainty requires accounting for at least some of them.

Although predicting security issues with certainty is nearly impossible, forecasting them with some accuracy is possible and useful. In many cases, you have enough data and constraints to develop a robust hypothesis about if and/or when something will happen. You also have enough computing power and tools to create analytical models that can use the data to forecast events. The Internet and social media have made more data available than ever before. Also, the popularity of quantitative analytics and widespread availability of analytical tools and methods has encouraged and enabled people in all sorts of fields to come up with different forecasting models that are impressive in their accuracy. For example, Princeton's Sam Wang (http://election.princeton.edu) and the New York Times' Nate Silver (http://fivethirtyeight.blogs.nytimes.com) have built distinct models that forecast American election results with impressive accuracy using somewhat different methods. Lastly, the Internet has provided a voice to people who think differently or process different information and come up with more accurate forecasts. What may appear to be a black swan to most people is not so for such people. The economic crisis in 2008 appeared to be a black swan, but many people saw it coming and warned others about it.8 By listening to voices that are not in the mainstream concerning security issues and expanding your diet of information, you can also improve the chances that you are using appropriate information and ways of processing the information to come up with your forecasts.

Overall, forecasts are not always correct or precise, but they provide a good idea about what could happen and can spur preparations. In the security field, even a somewhat hazy forecast can save lives.

Forecasting Properly

Because forecasting is a function of available data, some events are more appropriate for forecasting with social media and related data. Generally, the best events to understand and forecast are the ones where more people use social media to talk about or organize them. Currently, such events include protests and riots, natural and other major disasters, disease epidemics, and to a lesser extent, terrorist attacks and gang or drug violence. As social media use expands globally, the list of events will likely grow. As discussed in Chapter 2, social media use differs by country and region, and so people in different areas use social media to talk about and organize events differently. Thus, you will be more likely to forecast certain events in certain areas.

Forecasting involves the following distinct process regardless of the type of analysis you use:

1. Determine the event you want to forecast. The more detailed the event, the less likely your forecast will be correct because you need to be much more precise. However, do not pick something that is too broad or vague, because it will not prove helpful. Conflict on a continent within the next few years is too broad; the time and location of the initial fight that sparks the conflict is too detailed; a clan conflict in a specific area within a range of days is just right.
2. Collect the appropriate data and refine the goal of your forecast based on the availability of the right data. You cannot forecast something you do not have data about. However, if you have a lot of data about a subject, you can either add details to your forecast and/or improve its accuracy, which is the likelihood that it comes true.
3. Deduce a theory from the data and then use the theory and induction to generate a forecast. Forecasting involves first coming up with a theory about when, where, and why certain events occur by discovering trends or factors that indicate the presence of the events in data. In some cases, the deduction process is not formal, and is based on experiences and knowledge instead of hard data. Next, if possible, check the theory on other unused past data to see if it is correct. Lastly, apply a correct theory to emerging data and extrapolate from it to uncover what will likely happen in the future. Identify how the relationship between an event and some factors or indicators is likely to develop or relate in the future, based on how it has developed or related in the past. The method of deduction and induction depends heavily on the type of data available:
a. If the data consists of a lot of text or social media content, use LSA.
b. If it contains a lot of statistics and details about variables over a long period of time, use CRA.
c. If it contains data about the data, or changes in data or volume, use VA.
d. If the data is mixed, which it likely will be, use a combination of analyses.

Other considerations also may affect the type of analysis you employ. Keep in mind the following guidelines that differentiate good and useful forecasts from poor and useless ones:

  • Forecast only a little out temporally in advance. Forecasts that tell you what will happen in the next few hours, days, and weeks are more likely to be correct than forecasts that tell you what will happen years from now. The correct temporal range depends on what you are trying to forecast. If your event of concern involves lots of factors and is heavily dependent on individual human behavior, then forecast very little in advance. In such a case, the subject is too complex and it will be difficult to account for all of the factors months and years in advance.
  • Use a variety of data types other than social media data. The more diverse the data, the more likely you are to find relationships and trends between events and other things, and indicators of future events.
  • Always provide the likelihood that your forecast is correct. Be humble, and appreciate that you and no one else completely understands the world. Black swans can pop up anywhere at any time, and all sorts of weird things can happen that completely ruin your forecast. Either use a formal method or use your judgment to come up with a percentage that describes the probability that your forecast will come true. Do not bother issuing forecasts with low probabilities. If the probability an event will occur is 50 percent, that means the event is equally likely to occur and not occur. Such a forecast is useless. You might as well flip a coin and save yourself the hard work. A forecast that says an event is 75 percent or 25 percent likely to occur is far more useful. People can make decisions on such forecasts; they cannot make decisions on forecasts of 50 percent.

The next few sections detail how to conduct each of the three types of analyses to uncover relationships between disparate objects or behaviors, identify and track security issues, and forecast security issues. Some of the examples and analyses are more appropriate for illustrating how to understand events and others are more appropriate for forecasting.

Conducting Language and Sentiment Analysis

What a person says and how she says it can be very revealing at times. Understanding the overt and hidden meaning in linguistic content, whether it be written or spoken, can help you understand and anticipate numerous security events. Language and sentiment analysis is the process of analyzing linguistic content such as forum posts and text messages to reveal answers to a variety of questions, such as:

  • Is the same person writing these two separate blogs?
  • Are certain groups communicating using secretive hashtags on Twitter?
  • Are certain groups using Facebook to plan an event?
  • Are certain groups increasingly worried about an event occurring based on the sentiment of their text messages?
  • What is the emotional state of the author of a forum post?
  • Which topics do forum members tend to discuss together?

LSA, also known as natural language processing (NLP), is a burgeoning field that will only grow in strength as the Internet and social media make linguistic content more easily available for analysis. Numerous LSA tools exist and they differ in their accuracy, method of analysis, customizability, cost, usefulness, and output.

Some LSA tools tend to be more deductive. They analyze content based on assumptions about how people use language and output firm answers to your questions. Creators of the tools program them based on assumptions derived from a combination of past knowledge, experience, and scientific studies. Some such LSA tools are accurate and useful, but others fail to correctly analyze language and, thus, offer little benefit. LSA tools that detect sentiment are a popular example of tools that often fail. Sentiment detection tools analyze a piece of content, such as a person's status update on Facebook, and then try to determine the person's emotional state or intended emotional effect on the reader. In other words, the tools determine the emotional polarity of the content and the emotion the author of the content wishes to convey through it. Even more simply put, the tools tell you whether the author made a statement that most people would consider emotionally positive, negative, or neutral. For example, the statement “I dislike my neighbors and find them annoying” conveys negative emotion and sentiment. The statement “I really like the weather in October” conveys positive emotion and sentiment. The statement “I own a black guitar” conveys a lack of emotion and neutral sentiment. By using such a tool, you can discern the emotional state of lots of people at the same time on a social media platform, which can help you determine whether a positive or negative event is taking place.

Most people are naturally very good at detecting sentiment in the statements others make. They can instinctively discern whether the author of a statement is conveying positive, negative, or neutral sentiment by looking at the words in the statement, the author's tone, the context, the environment, and various other variables. Computers and LSA tools, however, are not as smart as people when it comes to figuring out sentiment. They tend to use theories and shortcuts that often fail because they cannot understand the nuances in language. Most sentiment detection tools focus only on the type of words in a statement to discern sentiment. The tools rely on the theory that some words are inherently emotionally charged and correlated with a statement's sentiment. For instance, people typically use the words “dislike,” “hate,” “angry,” and “annoying” when they want to convey negative emotion and sentiment. The tools assume if a person uses such words then, in many cases, it is safe to assume the person is talking about something she sees negatively. However, that is not always the case. People often use such words sarcastically, atypically, or to describe what others have said. For example, most people would consider the statement “I hate it when I win!” to be sarcastic and conveying a positive emotion. Sentiment detection tools, however, would see that it contains the negatively charged word “hate” and classify it as conveying negative emotion. They fail to pick up on these nuances and so incorrectly classify statements as having negative sentiment when they really have positive or neutral sentiment.


Warning
Always be wary of social media analysis systems or tools that claim to detect emotional sentiment. As of this writing, most sentiment detection tools are terrible at picking up on nuances, and so regularly fail at accurately detecting sentiment.

Despite their flaws, some deductive LSA tools are built on correct and tested assumptions and so are useful and accurate. We explore one such tool later in this section.

Other LSA tools tend to be more inductive. They find patterns of correlation and regression in large amounts of content and let you make sense of the patterns to answer your questions in your own way. Such tools usually only output various statistics about the content they analyze, and never direct answers to your questions. They will tell you how many times a certain word appeared in content, what other words a word appeared near in a statement, and how such statistics differed over time. You then use the outputted statistics to detect changes in behaviors that help you formulate a theory and answer your question. We explore this method of LSA later in this section.

As you probably noticed by now, most LSA tools and methods are both deductive and inductive. Many LSA tools keep track of how accurate they are over time and refine their assumptions and theories based on data to become smarter and more accurate. Many people create deductive LSA tools based on theories they induce after looking at large amounts of content and conducting scientific studies on language. The difference between the two types of LSA tools is somewhat artificial, but thinking about them in this way can help you discern which tool is more appropriate to answer your question. Neither type of tool is necessarily more useful than the other, but one is often more applicable for a particular problem depending on the kind of data you have, the time and resources you have, the accuracy of the LSA tool, and the answer you want. We now explore two ways you can use LSA tools and methods to understand and forecast security events and related behavior.

Determining Authorship

We are often asked about tools that can help identify the author of a blog post, document, tweet, or text message. In this section, we teach you how to use such a tool to identify authorship.

Correctly identifying the author of a piece of content can prove very valuable for a number of reasons. Knowing the identity of the author of the content you are analyzing to understand an event or behavior will only improve your analysis. For example, say you are analyzing posts on an online forum that violent extremists frequent to determine whether the extremists are planning violence against a specific target. You know that the majority of the forum members have no idea about operational plans and are merely followers of the violent ideology. They post misinformation about attacks that will never take place. You label them as the followers. However, you also know that a few members actually take part in operations and know about the future attacks. They post information on the forum that provides hints about future attacks. You label them as the operators. You also possess documents such as letters written by some of the operators that do not appear on the forum. However, you do not know which members of the forum are the operators and which ones are the followers. Because you do not know the identity of the forum members, you do not know which members' posts you should analyze and which posts you should ignore.

Authorship identification tools can help you compare the letters you possess with the posts of the forum members to identify the operator members. They help you figure out which of the individuals who wrote the letters are writing certain forum posts. They rely on the well-tested assumption that people have a distinct way of writing. The distinctions emerge in how people use words, spell certain words, structure sentences, and through many other traits. By comparing the letters with the forum posts, the tools can help you determine whether the authors of the letters also wrote some of the forum posts, and if so, which forum posts they wrote. You can then determine which of the forum members are the operators and analyze only their posts for intelligence about future attacks.

Numerous tools can help you determine authorship and, as you would expect, they differ in their features and abilities. In this section, we teach you how to use the Java Graphical Authorship Attribution Program (JGAAP), which is a free program available for download at http://www.jgaap.com. Throughout this section, we refer to JGAAP version 6.0.

Preparing for Authorship Analysis

The process of preparing for the analysis is fairly simple. Obviously, you first need to download and open the JGAAP program. Follow the instructions on the website to ensure you are opening the program correctly. Also make sure to download the program's user manual from the website, which shows you how to use the program with visuals and easy-to-understand instructions. Refer to the user manual as you go through this example.

Then you need to prepare all content for analysis. The content includes language whose authorship you do not know, which are known as the “test documents,” and language whose authorship you do know, which are known as the “training documents.” For our example, we will not use real social media content because of copyright issues, but instead use content and documents recovered from Osama Bin Laden's compound in Abbottabad and subsequently translated. The documents are available at Jihadica's website (http://www.jihadica.com/abbottabad-documents/). We also provide links to the exact documents we use on our website in Microsoft Word format. Feel free to use other content or follow along with the documents we use to make sure you are using JGAAP properly. We have had trouble using PDF documents with JGAAP, so you may need to convert PDF documents into Microsoft Word document format.


Note
Most LSA tools, including authorship identification tools, work mostly on content made up of the English language or other popular languages that use the Latin alphabet. Although analyzing translated documents is not ideal, it can still provide insights and answers as long as the translations are done consistently and carefully. JGAAP can analyze Arabic, but we have had trouble getting it to analyze Arabic in PDF documents.

In our example, we intend to discover whether Osama Bin Laden or his deputy “Atiyya” Abd al-Rahman is the author of a recovered and translated letter written to Nasi al-Wuhayshi, the leader of Al Qaeda in the Arabian Peninsula. The letter in question is the test document. The training documents are numerous lengthy letters written by Bin Laden and Atiyya to each other and others. If you do not have documents of your own, download the documents we use from this book's companion website. Make sure to save the documents in an easily accessible place on your computer.

Conducting the Analysis

Complete the following steps to conduct the authorship analysis:

1. Open and run the JGAAP. Notice the tabs across the top of the JGAAP window. When you first open the program, the tab labeled Documents should be selected.
2. In the JGAAP window, make sure English is selected as the language in the drop-down menu on the top. Under the section labeled Unknown Authors, click Add Document. Select the appropriate test document, which in our case is Test_Doc.docx. Next, under the section labeled Known Authors, click Add Author. In the pop-up window, type in Bin Laden for the author name and click Add Document. Add the appropriate training document, which in our case is Training_Doc_BinLaden.docx. Repeat the process to add Attiya as an author and add the training document labeled Training_Doc_Attiya.docx. After adding the documents, click Next to go to the next tab.
3. The Canonicizers tab provides the option to remove elements from the document that you believe are not related to the author's writing style, and so are not relevant to the analysis. Elements can include the amount of punctuation and cases of letters. Adding canonicizers is optional and will not make a big difference in our case, so we will not add any. Click Next to go to the next tab.
4. In the Event Drivers tab, you pick the elements of the documents that you want the program to analyze. Numerous event drivers are available for you to choose. Feel free to select whichever events you think are appropriate. You can read their description when you select them in the box labeled Event Driver Description. We select Word NGrams and Sentences. By selecting these events, you can analyze how the authors use combinations of words and sentences. To select the events, click on them in the scroll menu labeled Event Drivers and then click the arrow pointing toward the right. When you have selected them, they will appear in the box labeled Selected. The Parameters section to the right of the Selected box for the Word NGrams event driver should indicate by default that N equals 2. If not, set N to 2 for the Word NGrams selection. Otherwise, you do not need to change the Parameters of your event driver selections. After selecting your event drivers, click Next.
5. The Event Culling tab provides the option to filter the event drivers you selected in the previous tab. Because the documents and the content being analyzed are not large, you do not need to worry about filtering. We will not add any event culling; click Next. Note that if the documents you use are large, then you may need to use event culling. Try the analysis with and without culling to see if the analysis differs. In many cases, the event culling will not make a big difference to the analysis.
6. The Analysis Methods tab provides numerous methodologies and algorithms with which to conduct your analysis. Getting into how the algorithms work is beyond the scope of this book, but feel free to try out different options. We will abide by the advice on JGAAP's website and select the Nearest Neighbor Driver method. To select the method, find it in the menu labeled Analysis Methods and click on it. Then, in the menu labeled Distance Functions, find and click on Cosine Distance. After making both selections, click the arrow pointing toward the right next to the Analysis Methods menu box. After making the selection, click Next.
7. The Review & Process tab gives you a summary of the options you selected in the previous tabs. You can go back through the other tabs and change the options if you want. After reviewing your options, click Process, which is in the bottom-right side of the Review & Process tab window.
8. A new window pops up with your results. Figure 6.1 shows the analysis results for our documents based on our options.

Reading the results is straightforward. The results indicate in order: the test document, the canonicizers, the event driver, the analysis method, a list of the training documents starting with the author most likely to have written the test document, and the numerical score outputted by the analysis next to the name of the training document. To understand the results, you need to look both at which author appeared first in the list and the numerical score. In the analysis for the event driver Sentences, the language in the test document was similar to the language in both documents. Both are ranked 1 and both have the score of 1.0. The Sentences event driver is not that helpful to our analysis.

Figure 6.1 Authorship analysis results

6.1

In the analysis for the event driver Word NGrams, the language and words in the test document were more similar to the language and words in the Bin Laden training document than in the Attiya training document. Because we are measuring a form of distance between the test document and the training document, a lower score means the documents are closer and thus more similar. In our case, the Bin Laden training document had a score of 0.577 and a rank of 1, and the Attiya training document had a score of 0.653 and a rank of 2. The Bin Laden training document was closer and more similar to the test document. This result suggests that, based on the analysis, Bin Laden was more likely to be the author of the test document than Attiya.

You need to be aware of several caveats with this type of analysis. One, as we mentioned before, analyzing translated documents is not ideal. Translators can significantly change a document's language and structure. Two, the analysis and the results are not perfect. We recommend you run the analysis a few more times by selecting different combinations of event drivers and analysis methods. If you keep getting the same result, you can be more confident in the analysis. Three, you need a lot more content to do the analysis correctly. The more content the program can analyze, the more likely it will output a more precise answer. This fact poses a problem when analyzing social media content. Blog posts, forum posts, tweets, and status updates are not always lengthy. You need to collect lots of them to do a meaningful analysis. Four, some social media content may not be as useful for the analysis, and you have to be careful that you are comparing similar types of language together. For example, when tweeting, most people tend to use different types of words and grammar because of the 140-character limit that Twitter imposes. The way they write lengthy blog posts or letters is likely very different than how they write their tweets. You should compare tweets to tweets and blog posts to blog posts to make sure analysis is consistent.

By taking these caveats into account, you can significantly improve your ability to do authorship identification analysis and understand certain security events and behavior. Knowing who wrote what or even who did not write what can help you make better sense of intelligence critical to your understanding of a related event.

Tracking and Forecasting the Behavior of Rioting Violent Crowds

The authorship analysis tool gave you a straightforward answer about the likely identity of the author, and so answered your question. Other LSA tools do not outright answer your questions in such a way. Instead, they help you explore the data so you can create and test your theory about possible answers to your question. With LSA tools and some ingenuity, you can analyze language on social media to track and forecast the behavior of crowds. In this section, we do not detail how to use specific software tools. Instead, we focus on how you can structure a solution to a problem by using any one of a variety of LSA tools.

As discussed before, many individuals and groups use social media to organize events involving complex crowd behavior. By analyzing the content these individuals are posting on social media, you can figure out how and where the crowds are moving. Imagine you need to track and forecast the behavior of violent rioters in a city. The rioters are using Twitter and other forms of social media to organize attacks against specific targets in the city and share information about law enforcement. As is often the case, some individuals usually are taking the lead in directing the other rioters. They are posting information about where, when, and how other rioters should move. Through social network analysis you can identify these key individuals. You then simply need to track what the key individuals are saying to monitor the development of the riot. You can then create an early-warning system that gives you an idea of where the riot has been, where it is at the moment, and where it will likely be in the future.

Although this simple form of language analysis will help, it will not serve as a thorough and complete solution for several reasons. One, the list of key individuals may be long, and you may not have the resources to actively monitor all their messages. Two, others who are not on your list of key individuals may also be helping direct the riot. Three, people who are not involved with the riot may be posting information about the riot that can help you get a sense of the riot's behavior and evolution. To take these factors into account, you need to use methodologies and tools to build a system that can comprehensively track the riot's past, present, and future behavior. You can use the wide variety of methodologies and tools in numerous ways. Some of those ways are very complex and require sophisticated expertise in crowd behavior, linguistic and semantic processing, data processing, and cognitive science. We present one simpler way that you can use to understand the concept of LSA and get a head start on your own riot early-warning and tracking system. Research other ways and feel free to adapt and experiment with our way and those of others as you see fit.

Our way involves creating a system that rapidly analyzes enormous amounts of social media data to:

  • Identify social media data relevant to the riot and filter out the rest.
  • Analyze the relevant data to pinpoint key words and phrases that indicate past, present, or future riot behavior.
  • Categorize the content based on the presence of key words and phrases, the author of the content, and other factors.
  • Analyze the categories of content to reveal information about past riot behavior, flag present riot behavior, and provide useful early warnings of riot behavior.

Preparing for Riot Behavior Analysis

Preparing to build the comprehensive early-warning LSA system first entails figuring out what amount of relevant data is available to you and then getting it. The relevant data obviously involves the social media content that the rioters and key riot leaders are generating. For the sake of this case, let us assume that the people involved are predominantly using public social media platforms such as Twitter along with other platforms from which you cannot get data. In some cases, law enforcement can gain access to messages on private social media platforms, but we will not worry about that here. The other relevant data also involves social media content that people who are not involved with the riot, such as witnesses, are posting about the riot on public social media platforms. This amount of data can be quite large. Consider the data from the rioters and the witnesses to be the test data on which you will test your tools and do your analysis. Yet another type of data you need to collect and one that is not as obvious is social media content from past riots. This third type of data will serve as the training data with which you can train your tools. Your system needs to collect the data about the riot in question, and you should assemble the data about past riots. Do not combine the two data sets. In Chapter 4, we covered how you collect, filter, and store the relevant data.

Apart from the data, you also need to acquire the appropriate LSA tools and integrate them into your system. You need tools that can help you analyze the frequency with which certain words and phrases appear in content, what words and phrases tend to appear together, and keep track of the source or author of particular content. These tasks are simple, and most LSA tools can easily complete them. We will not focus on which tools you should use because numerous ones exist. New versions of popular analysis programs such as SPSS tend to already come with LSA tools. The website KDNuggets (http://www.kdnuggets.com/software/text.html) features a list of popular LSA tools. Most tools on that website present more features than you will need for this case. The Internet is full of free, open source LSA tools that provide you with all the features you need and that you can easily download, adapt, and use with minimal development skill. Apache's OpenNLP library (available at http://opennlp.apache.org) is one such repository of free and appropriate LSA tools. For now, do not worry about which LSA tool to use, but how you would use that type of LSA tool.

Designing the Riot Behavior Analysis Methodology

Acquiring the test and training data and the right LSA tools is the easy part of the analysis. The most difficult and critical part is formulating the analytical methodology that the system uses to make sense of the data and output appropriate warnings and indications. Formulating the methodology will be a primarily deductive process, and we will refine it through an inductive process using the training data.

Essentially, the methodology will determine which words and phrases in social media content are important and to what extent they tell you about the rioters' past, present, and future behavior. To do so, the system needs to be able to categorize various words and phrases according to their context and what they tell you about the rioters' behavior. The system then must apply weights to the content according to how important it is to the ultimate analysis. To help the system do this, you need to create an ontology, which describes how linguistic content should be categorized and what its relationships are to other content and behavior. The system's algorithms, which follow the ontology, process data and output early warnings and indications of the riot's movements and actions.

To create an ontology and methodology, take note of previous experience and knowledge to lay out the theories and assumptions that will underlie the methodology. First, categorize social media content based on what it tells you about the rioters' behavior. According to our experience and knowledge, we surmise that certain textual content on social media can tell you whether people are providing facts about events that are happening or have happened. Such content falls under the ontological category of Facts and indicates past or present behavior. We can also tell whether an event is about to take place, which falls under the ontological category of Possibilities and indicates future behavior. For example, the tweet “Three guys looting store on 15th” would fall under the Facts category. The tweet “Attack on 3rd at 1500, then wait for flare” falls under the Possibilities category. The categories of the tweets are self-evident to most people because they can easily figure out the meaning and intent of the tweet by looking at certain characteristics of the language used in the tweet. For example, in the Facts tweet, we realize that the verb “looting” is in its present form and indicates an action happening right now that is related to riots. The words “store on 15th” indicate the target and its location. The same type of quick linguistic analysis applies to the Possibilities tweet. The verb “Attack” indicates a future violent action that is about to take place and the words “3rd at 1500” indicate a location and future time concerning the action.

You need to train, or program algorithms into, the LSA tools that make up your system to also quickly analyze tweets and other social media content and put them in the Facts and Possibilities categories. Content that does not relate to either, such as “I'm sick of campaign commercials, make them end already,” will be filtered out. Programming or training the LSA tools to filter out irrelevant content and categorize relevant content into Facts and Possibilities is much harder said than done. We address how to train tools to do this and other complicated tasks later.

Another part of the methodology is realizing that not all related content is as useful to you. We can surmise from experience that, for instance, a Possibilities tweet that comes from a key riot leader will have more influence on the behavior of the riot than a Possibilities tweet that comes from someone who is not at all involved with the riot. Also, a Facts tweet describing an event that is corroborated and validated by other Facts tweets is more trustworthy than a Facts tweet that addresses an event that no other Facts tweet does. Based on these considerations, you need to weigh the content you are analyzing. If you are not familiar with the concept of weights, think of it as if you are boosting the importance of certain content over other content. You and the system then pay more attention to what the boosted content says and give it more weight when making your decision. The weights of the content will then influence the outputs, which are the early warnings and indications of the riot behavior. You can and should analyze other parts of the content in other ways, such as figuring out location from language. For the sake of time and understanding, we stop here for now.

The system should look at the content in each category, assign weights to it, and then use that information to produce outputs. For example, it can look at the tweet “Consulate building on fire” and put it in the Facts category. It can then output to a crowdmap, indicating the location (the consulate) and the action taking place (on fire). This helps you track the behavior of the rioters. If the tweet contains temporal information, you can then also output that on a timeline so you can track the past versus present behavior of the rioters. Consider another example—the tweet “in 2 hours move west.” The system puts the tweet in the Possibilities category and outputs a text warning to you indicating the action about to take place (riot moving west) and the time it will take place (2 hours). You then have an idea of how the riot may move in the near future. The system may also come across another tweet in the Possibilities category, but not output it to you because the system realizes that it came from someone who is not a key riot leader. To review, our methodology involves:

  • Categorizing content into Facts and Possibilities
  • Weighing Possibilities more if it comes from key riot leaders
  • Weighing Facts more if it is repeated by multiple sources

You should now have an understanding of how even simply analyzing language and assigning it appropriate weights can provide you with important intelligence. However, to complete the system, you still need to program and train the LSA tools.

Program and Train LSA Tools

As mentioned before, numerous commercial and open source LSA tools can easily do the tasks required of them. Their capability rests on your ability to program and train them properly. By programming and training them, we mean teaching the LSA tools to follow specific algorithms so they know how to process the data and output the results you need. Unless you have a programming background, you will need a developer to help you program and train the LSA tools.

Programming an LSA tool is a two-step process (we refer to a single LSA tool but in reality your LSA tool may actually consist of a combination of numerous tools). The first step is to use your own knowledge and that of others to tell the LSA tool what content it should focus on and where it fits in the ontology. Part of that is telling it which sources to focus on, such as the list of key riot leaders and related hashtags. The other more significant part of it is telling it which language is the most revealing about the riots. In the riot scenario this language may include words and phrases that describe riotous action, the organization of logistics, times, dates, locations, and names of major landmarks. Many LSA tools and repositories contain lists that conveniently categorize words and phrases by the type of behavior or event they describe. You need to then program the LSA tool to correlate certain words and phrases that describe riot with the Facts and Possibilities categories. You literally tell the LSA tool that if it comes across the word “attack” and it is next to the word “tomorrow,” it should categorize the content containing the words as Possibilities. You of course do not need to spell out exactly all the combinations, because LSA tools usually come with options to quickly correlate lots of words together.

You also need to program the LSA tool to look at the sources of the content and provide appropriate weight depending on the source. Looking at the source can also help the LSA tool categorize the content and verify if it really is related to the riot or is irrelevant. For example, the LSA tool may pick up and analyze the tweet “the #cloudatlas movie is going to #blowup tomorrow when it comes out” because it is looking at all content with “#blowup.” The LSA tool can also download information about the source to see if he is on the list of key riot leaders, is tweeting other information related to riots, is following or followed by other people involved in the riot, or even located at the city where the riot is happening. After taking these factors into account, the LSA tool can then realize that the tweet is not relevant to the riots and discard it. In some cases, it may have trouble deciding. You can program the LSA tool to give scores to content based on how many of the criteria it meets (for example, its source, how many relevant words, and so on) and then classify the content based on its score. In the case where the aggregate score for the content is not high enough to categorize it but not low enough to discard it, you can program the LSA tool to create a separate category called Ambiguous. You as the user can then go into the system, read through the content in the Ambiguous category, and help the LSA tool properly categorize it. Lastly, you need to program the LSA tool to see if content in the Facts category is being repeated and validated by others and then giving that content more weight. This process is as simple as programming the LSA tool to scan how many times large phrases or sentences are repeated.

The second step is to train the LSA tool to work effectively and learn from its mistakes. This step is significantly harder to do, but it is necessary if you want to create a system that can handle the messy and complex data found in the real world. You need to test the LSA tool against the training data about past riots whose movements you know about. Act as if the past riot is happening at the present time, and input the appropriate data into the system. The LSA tool will go through it and report on the past, present, and future behavior of the riot. You can then assess how well the LSA tool did. Most likely, the first time you test the LSA tool, it will not do very well. You need to then implement machine learning algorithms into the LSA tool so that it can learn from the training data and refine its programming. In simplistic terms, the LSA tool looks at what kinds of words and phrases actually came up during the past riot and actually indicated real riot movement. It then goes back into its own programming and tweaks it because it then has a better idea of which words and phrases it should look for and what weights it should assign to them. Explaining this process in full is well beyond the scope of this book, but you can easily find resources online that explain it.9 Over time, you should train the LSA tool on more training data so its efficacy constantly improves.

This section should have given you a good idea about what LSA is and how you can use it to make sense of social media data to understand and forecast security events. A better understanding and forecasting capability requires you to create and use more complex tools and algorithms. You will find that to use LSA effectively, you need to understand correlation and regression analysis.

Correlation and Regression Analysis

A large number of variables influence any event in the real world, and it is usually impossible to account for all of them. However, a few are usually more influential than the others. Understanding their relationship with the event and how they appear in the future can help you forecast the appearance of the event in the future. Correlation and regression analysis (CRA) is one of the simplest ways to identify and understand the relationship between the most important factors and the event.

You likely are already familiar with correlations and, to a lesser extent, regressions. Humans are very good at intuitively doing correlations. We keep track of a number of things and behaviors in the real world and try to figure out how they influence an event of concern. This natural ability is the reason superstitions exist. For example, you notice that on days you wear red socks, the Boston Red Sox baseball team wins and when you do not wear them, they lose. You correlate the wearing of your red socks with the record of the Red Sox. You then take it a step further and think that somehow you wearing red socks influences whether the Red Sox win. You then always wear your red socks on the day that the Red Sox play. From a rational perspective, this type of thinking is very wrong because correlation does not imply causation. It may only be a coincidence that your wearing of red socks correlates with the Red Sox winning. Most likely, you ignore the days when the correlation did not occur.

Statistical CRA helps us determine whether coincidences actually occur and whether they mean anything. Although you can do it relatively easily, you can also misinterpret its results easily. Another hard part about doing CRA is getting the right data to do it. In lab settings, getting the data to do it is not difficult. However, in the real world and when dealing with security events, you are not likely to get all the data you need. In this section, we address the concerns of interpreting results correctly and humbly, and in the next section we address dealing with incomplete and frustrating amounts of data. We first show you how to do CRA through an example concerning the issue of creating a system that provides early warnings of famines and uses data collected through social media. The example involves made-up data that illustrates the process of doing CRA. In the next section dealing with volumetric analysis, we use real-world data that shows that doing analyses for security events is not as easy as you may think.


When to Use Regression Analysis
You should use regression analysis when you want to determine the extent to which one variable can help you estimate the behavior (or value) of another variable. When doing regression, you assume that either:
  • The independent variable directly affects the dependent variable.
  • The independent variable indirectly affects the dependent variable through intermediate variables.
  • An unknown variable affects both the independent and dependent variable.
In contrast, use correlation analysis when you want to determine simply whether two variables share some sort of associative relationship. You do not need to have any idea about the causality relationship between the variables when doing correlation analysis.
For example, say you collect data from children about their shoe sizes and their basketball ability. You analyze the data and find a correlation—the larger the shoe size, the better the child is at basketball. The correlation is only telling you about the association thus far. You need to use regression analysis to see how that association may look in the future.
You can easily reason that, clearly, shoe size does not directly or even indirectly cause a change in basketball ability, and vice versa. Shoe size and basketball ability share a correlation because a third variable, the height of children, affects both. The taller the child, the larger his shoe size and the better he is at basketball. You then conduct regression analysis and conclude:
“For every increase of X in shoe size, basketball ability tends to increase by Y.”
Note that this conclusion is very different from concluding:
“An increase of X in shoe size tends to cause an increase of Y on basketball ability.”
Concluding that shoe size causes change in basketball ability is absurd.
Think of the output of regression analysis as a tool. The tool uses the known behavior of one variable in the future to tell you the likely behavior of another variable in the future. The output of a correlation analysis is not a tool, but insight. The insight tells you that the behavior of two variables has an association such as when one variable always goes down the other always goes up. Simply put, regression analysis helps you forecast the future whereas correlation analysis helps you make sense of the past.

Creating Tools to Provide Early Warnings for Famines with Artificial Data

Imagine you want to build a system that provides warnings about famine conditions (and food insecurity) in different parts of the world. The system will need to collect data, analyze it to forecast possible famine conditions, and then output warnings to you about the possible famine conditions. To provide early warnings about famines in any part of the world entails collecting and analyzing information concerning a wide variety of factors, ranging from rainfall amounts, to market prices, to the state of the infrastructure in the region, to the needs of the affected population. Your system will need to do a lot of things and do them well. Indeed, the United States Agency for International Development's Famine Early Warning Network System known as FEWS NET works in a similar fashion (http://www.fews.net). This example is largely inspired by FEWS NET.

For now, we will focus on only part of the system and try to answer whether you can forecast famine conditions using only data about the price of grains in a region and the health status of the people in the region. Based on experience and knowledge, you deduce that the price and health status likely have a relationship to the famine. Rising grain prices indicates a lack of food and makes it less likely that people can buy food, which in turn increases the level of famine. A population's health status (people suffering through disease or not) obviously deteriorates when people are suffering through a major crisis like a famine. The analysis helps determine whether your deductions are correct. Social media plays a role in that it enables you to collect data about the grain prices and health status in near real time.


Cross-Reference
In Chapter 4, we briefly discussed an example where we collected crop price and health data from Colombian farmers. In Chapter 7, we describe a social media platform known as M-Farm that collects African crop price data. Part III of the book discusses how to build social media platforms that crowdsource price and health data from hard-to-reach populations. Later, we address other ways that social media data can play a role.

Assume you have access to a social media platform that collects grain price and population health data from a region in Africa. The social media platform is SMS based and regularly queries participants for information. You also have data from a non-profit organization that indicates the level of famine that a region is suffering at any time. This famine indicator is calculated as a function of many variables, including the level of malnutrition and mortality. Note that the non-profit uses other health data, collected through in-field surveys, to calculate the famine indicator level. The health status data you use is collected through other means and so is different. In this case, the grain price and population health status are independent variables and the famine indicator is the dependent variable. You need to figure out whether the independent variables affect the dependent variable, how, and to what extent.

Conducting Correlation Analysis

To uncover the relationships between the variables, first conduct correlation analysis to determine whether there is some sort of an associative relationship between them. If the independent variables do not correlate with the dependent variable in any way, they likely do not affect it. Also, if they do not correlate, an intermediate or unknown variable also does not affect both of them.

To conduct the analyses in this section, you need basic statistical software. We describe the process by using Microsoft Excel to keep things simple, but you can use whatever software you want. We assume you are familiar with Microsoft Excel and know how to create basic line charts and use formulas to calculate things. If not, you can easily find numerous resources and videos online that show you how to use Excel. To do only correlations, you can also use the free web-based Google Correlate (available at http://www.google.com/trends/correlate).


Cross-Reference
The data to do the analyses for the first example is available on our website as a comma-separated value (CSV) file with the name CRA_example.csv and an XLS file with the name CRA_exampe.xls. The XLS file also contains the outputs of our analyses.

The data file for the first example contains four columns. Column A, labeled Date, indicates the dates for which the data was collected. Each date point given corresponds to a week. In other words, assume the data is collected at the beginning of each week. Column B, labeled Health, contains data collected about the health status of people in the region. The value of the Health variable ranges from 0 to 10, where 0 indicates very poor health (suffering from lots of dangerous conditions and diseases) and 10 indicates good health. Column C, labeled Price, contains data collected about the price of grain in the region. The value of the Price variable is a numerical price that is equal to or greater than 0 (the currency is irrelevant). Column D, labeled Famine, contains data collected about the level of famine in the region. The value of the Famine variable ranges from 0 to 10, where 0 indicates very low levels of famine and 10 indicates very high levels. Twenty data points are given for each column. The amount of data is very small, but it serves instructive purposes. You need to determine whether the independent variables correlate with the dependent variable and to what extent. With your statistical software at hand, complete the following steps to conduct the analysis:

1. Open the data file in Excel.
2. Create time series line charts using all the variables. The line charts can help you visually assess whether the variables are correlated and if the relationship is linear. The X-axis variable for the line charts is the values from the Date column and the Y-axis variable is the values from the other columns. You need to create three line charts to see how the variables evolve over time. Ideally, you should chart the lines representing each variable in one chart. However, to do so you need to normalize the data so that the Y-axis range for each variable is the same. This process is unnecessary at this point because you can simply create three line charts and compare them visually. Create three line charts where one indicates how the health variable changes over time, the other the price variable, and the last the famine variable. The line charts are displayed in Figure 6.2. Notice that the Health line does not follow the same pattern as the Famine or Price line. That is, when the Famine or Price line spikes or drops, the Health line does not. Also, notice that the Price and Famine lines do tend to follow the same pattern. This suggests that there is likely a correlation only between the Price and Famine variables.

Figure 6.2 Line charts of Health, Price, and Famine

6.2
3. Compute the correlations for each variable with each other. Start with computing the correlation between the Health and Famine variables. Select an empty cell on the worksheet. In the cell, type the function =CORREL(D2:D21,B2:B21). Press Enter and the cell should display 0.20615415, which is the correlation coefficient. Excel compares all the data points in the two stipulated columns in the function to compute the correlation.
4. Calculate the correlation between the Price and Famine variables. Select an empty cell on the worksheet and type =CORREL(D2:D21,C2:C21). Press Enter and the cell should display 0.78758156.
5. Calculate the correlation between the Health and Price variables. Select an empty cell on the worksheet and type =CORREL(C2:C21,B2:B21). Press Enter and the cell should display 0.13626333.
6. Consider the outputs or correlation coefficients for each variable pair. Two variables correlate if their coefficient is near 1 or –1, and they do not correlate if their coefficient is near 0. If the coefficient is near 1, the variables are said to positively correlate. That is, an increase in one variable produces an increase in the other variable. If the coefficient is near –1, the variables negatively correlate. An increase in one variable produces a decrease in the other variable. The only pair of variables that have a high correlation coefficient is the Price and Famine variables with a coefficient of about 0.78. This result makes sense from our cursory analysis of the line chart. When there is an increase in the Price variable, there is an increase in the Famine variable, and vice versa.

Conduct Regression Analysis

You know that there is an associative relationship between the Price and Famine variables and have some idea of how they have associated so far. You also know there is no relationship between health and the famine level. This finding is surprising because the health status of a person suffering through a famine usually deteriorates. The finding suggests that either your initial deduction was partially incorrect or that you collected faulty data. Most likely, the population did not understand exactly what you meant when you queried them for health status data, or they happened to be the lucky ones who survived the famine. This highlights the need to collect data from a large population and to compare different data sets.

You now need to use regression analysis to establish how well the Price variable can help you forecast the Famine variable in the future. In geek-speak, you need to identify the predictive or forecasting value of the independent variable.

In this case, we use a priori knowledge to determine that there likely is a causal and predictive relationship between prices and famine level. Spikes in prices can noticeably increase the level of famine. By knowing how much the prices contribute to the level of famine, you can figure out how future spikes or drops in prices will affect the future levels of famine in the region. Complete the following steps to find out this information:

1. Return to the same data file in Excel. Make sure the Analysis Toolpak and Data Analysis add-in is installed. Microsoft explains how to install the add-in at http://office.microsoft.com/en-us/excel-help/load-the-analysis-toolpak-HP001127724.aspx.
2. On the ribbon, select Data and at the far right of the ribbon options select Data Analysis.
3. In the pop-up window, select Regression.
4. In the new pop-up window, put a checkmark in the boxes next to Labels and Confidence Levels. Also, check the box next to New Worksheet Ply to output the results to a new sheet.
5. At the top of the pop-up window, for Input Y Range, select Column D with the famine level values. Make sure when selecting the column that you also select the row with the column label. For Input X Range, select Column C with the price values. After making your selections, click OK.
6. A new worksheet should appear with the results of the regression analysis. Figure 6.3 shows the results.

Figure 6.3 Regression analysis results for the first example

6.3
7. Reading the results consists of paying attention to three parts of the results. The first block of results at the top is labeled Regression Statistics and it tells you about the overall regression. The R Square value tells you how much the independent variable explains the variance of the dependent variable. In other words, it tells you to what extent the Price variable forecasts and informs the Famine variable. Think of it as the likelihood that price forecasts the famine level. In our case, the R Square value is 0.62, which is not bad. The closer the R Square value is to 1, the better. This result says that the Price variable forecasts 62 percent of the value of the Famine variable. Other variables account for the other 38 percent of the forecast and effect.
8. Then pay attention to the block titled ANOVA. Take note of the number in the cell where the row labeled Regression meets the column labeled Significance F. The number in this case is 3.76E-05 or 3.76 x 10-5 or 0.0000376. The Significance F number denotes likelihood that the output appeared by random chance and not due to an actual pattern and relationship. In other words, it says that there is a likelihood of 0.00376 percent that the regression output was due to random chance. The smaller the number, the better.
9. Then look at the last block of results. First, look at the column labeled Coefficients. The coefficient for the row labeled Intercept indicates the value of the famine level if the price value is 0. In our case, the Intercept number is about –1.73, which does not make too much sense. You cannot have negative famine levels. However, you can also look at the number in another way. Consider it as an attenuation factor that decreases the amount of the effect that price has on the famine level. Next, look at the coefficient for the row labeled PRICE, which is about 0.3687. This says that for every increase of 1 in the value of the price, the value of the famine level goes up by about 0.3687. Essentially, it is the predictive value. These numbers allow you to write the equation:
Famine level = (Price x 0.3687) – 1.73
10. Second, look at the column labeled P-value. The P-values indicate how confident you should be in your coefficient values. Generally, you want a P-value of less than 0.05. The P-value for the row labeled Intercept is about 0.2159. This says that there is about a 21.59 percent likelihood that the Intercept number came about by chance and not due to any relationship. The P-value for the row labeled PRICE is 3.76E-05, which is the same as the Significance F-value. When you do regressions using only one independent variable, these numbers will match. However, when you do regressions using multiple independent variables, known as multiple regression analysis, these numbers may be different. In the multiple regression case, you have to look at the P-value of each independent variable's coefficient to see how confident you can be in their predictive value.

You have now completed the regression analysis and can use the equation given in step 9 as a way of forecasting how changes in price will affect changes in the famine level. We only went over a simple way of doing CRA. You can use a variety of statistical tools and methods to do more complex and, in some cases, more accurate, ways of doing CRA. No one way is necessarily better than the others—it all depends on context and the nature of the variables. Scour the Internet for more information on multiple and multivariate regressions to learn how to do more advanced CRA. Also, research more ways of testing the significance of your findings, such as t-tests. Lastly, compare different data sets and sources to ensure your results did not come about because of faulty data. The processes are basically the same, and once you understand how to do simple CRA, you can easily learn how to do the more advanced analyses.

Volumetric Analysis

Separating volumetric analysis (VA) from CRA is somewhat artificial. The tools and methodologies you use to do VA are the same as you would use to do CRA. The difference lies in the type of data you analyze. VA entails finding associative relationships and the strength of causal relationships in data dealing with volume of traffic. Traffic on the Internet includes the number of times someone searches a word on Twitter and the number of hits to a website. Traffic in the offline world includes the number of shipments between two countries and the number of transactions between two bank accounts. When doing VA, the focus is not on the nuances and make-up of the event or behavior. Instead, it is on how many times the event or behavior occurred and how such volume data corresponds to other events. Contrast that with LSA, where the nuances and make-up of the data and the text matter. In practice, when doing VA you will often need to use other types of data as well. Still, we separate VA out to highlight the importance of traffic volume data and how it can hold amazing insights in cases where you cannot get access to all the data you want.

In this section, we expand on the example given in the CRA section. We now consider the fact that in the real world, you do not always have access to all the data you want. Additionally, the data you do get is usually incomplete or incompatible with other data sets. We consider how you can use volume traffic data, which in many cases is easier to get than complete data sets. In the following example, we continue to consider ways you can create tools to provide early warnings for famine. This time, we use real-world data and Twitter data.

Creating Tools to Provide Early Warnings for Famines with Artificial Data

Imagine you are tasked with creating another tool for your famine early-warning system that is distinct from the one we discussed in the CRA section. This new tool should complement your other tools and work in conjunction with them to provide more robust early warnings that take into account a number of factors. This tool should analyze different types of data and see if other behaviors can help you anticipate the onset and intensity of famine in a region.

One possibly useful data source is Twitter. Usually, when a major event happens, people take to Twitter to either report on it, as they did with the Benghazi consulate attack, or voice their opinion about it. Examining what people say on Twitter, when they say it, and how many people say it can tell you a lot about the event that is taking place. This view, however, does not always hold true, especially when dealing with events taking place in parts of the world that are not fully developed. Twitter is expanding into Latin America, Asia, and Africa aggressively and will likely be ubiquitous very soon. However, such widespread use of Twitter has not yet occurred. Information about events occurring in less developed areas such as Somalia may not appear on Twitter right away. You need to test whether you can use Twitter as a data source to analyze and provide early warnings about famines and food insecurity.

To test the viability of using Twitter as a valid data source, you need to examine how it has fared recently in anticipating famine and food insecurity events. The Somalia famine in mid-2011 serves as a good case study. It occurred relatively recently and you can easily find data about it. Overall, you need to assess whether information traffic related to the famine appeared on Twitter when famine conditions started to take place in Somalia, and whether the levels of the traffic matched the intensity of the famine level. If it did not, you need to determine the time lag between famine conditions taking place and information about it appearing on Twitter. If everything fails, you also need to find another data source that is more useful than Twitter.

Assemble Appropriate Data

We spent some time searching the Internet for relevant data. Apart from using the data sources we discussed in Chapter 4, we simply Googled together the terms “data” “Somalia” “famine” and sifted through the links. We found some that were useful and much that was incomplete, incorrect, or incompatible with other data sets.


Cross-Reference
All the data that you need for this example is available on our website in the file VA_example.xls or VA_example.csv. The XLS file also contains the outputs of our analyses.

First, you need to obviously find the Twitter data. The British newspaper, the Guardian, has exactly the data you need on its website (the original raw data file is available on our website for download as a spreadsheet).10 The Twitter data consists of the number of tweets per day over a 19-month period that mentioned “Somalia” and either “famine,” “drought,” or “food.” The Guardian also provides data about the number of mentions in newspapers and total amount of aid flows by date.


Note
The Guardian is at the forefront of conducting social media analytics and using it to inform its reports. Check out http://www.guardian.co.uk/data for various data sets it makes available to the public. This example was largely inspired by the Guardian's work concerning the Somalia famine.

You need data to do the analysis of Twitter data against. This data should indicate the level of famine over the time period that the famine took place. We found some United Nations data on malnutrition and other famine indicators for Somalia, but they are not broken down by clear time periods and so are not as useful for analysis. Other data needs to serve as the famine level indicator. The market price of red sorghum in Marka in Southern Somalia may do the trick. Red sorghum is a grain that is widely grown and consumed in Somalia. As discussed before, drastic increases in the market price of grains such as red sorghum can indicate food insecurity and famine conditions. It is not perfect, but a causal look at the data and comparisons to reports from the United Nations documenting the famine shows that it indeed is a good indicator for food insecurity and famine conditions. The market price data is made available by the Food Security and Nutrition Analysis Unit (http://www.fsnau.org) and is broken up by month over several years.

Conduct Analysis

Determining a correlation between the market prices and the Twitter mentions can help determine the utility of using Twitter data to anticipate famine conditions in Somalia. The correlation analysis will tell you not only whether they are correlated, but by how much Twitter data tends to lag behind the appearance of famine conditions.

First, you need to collapse the Twitter data into monthly periods. When doing CRA analysis, you should compare apples to apples. If one data set gives statistics by month, the other data set should also give statistics by month. You also need to make sure you are comparing months for when the data exists. Data for both price and Twitter mentions exists from August 2010 to January 2012. Then you draw two time series line charts, as depicted in Figure 6.4.

The line charts indicate that there indeed is a correlation between price and mentions, but the time lag is significant. When price spikes, the number of mentions also goes up. However, it takes a long time for the change in price to be reflected in the mentions. Notice that the price for sorghum started to go up in August 2010 at a fairly high and consistent rate till it reached a peak in May 2011. The price then started to go down after July 2011. The United Nations' considerations of the famine match this trend. It started to see signs of the famine in late-mid 2010 and officially declared famine in mid-2011.10

Figure 6.4 Line charts of Twitter mentions and sorghum prices

6.4

The number of Twitter mentions about the Somalia famine was few till June 2011. They started to spike thereafter, reaching a peak in September 2011, almost two months after the peak of the price surge and famine conditions. A closer look at the mentions data indicates that the number of mentions did start to go up slightly after September 2010. However, the numbers are very small and do not reach a significant amount until June-July 2011. The correlation coefficient backs up this finding. It is only 0.39, which is not very high.

Based on these findings, you can conclude that although Twitter mentions about the famine did go up during the famine, they are not very useful in regards to providing early warnings about the famine and food insecurity. People tweeted about the famine when they heard about it from newspapers and traditional media. They then responded to the news stories. By then, however, the famine was already well under way. You should conduct the same type of analyses on other famine cases to see if your conclusion about the lack of utility of using Twitter mentions as an early warning holds up.

You can also go back to the data provided by the Guardian to see if you can use other data sources as an indicator of famine conditions. Casually looking at the aid flow data suggests that it may actually correlate closely to the emergence of famine conditions. We leave the analysis to you.

We touched on only a few simple ways you can use simple statistical tools to analyze social media data to understand and forecast security events. We could write entire books about using only one of the ways. However, we hope you got the point so far about how social media data can help you and encourage you to explore other ways by yourself. We now move on to Chapter 7, where we begin discussing how you can use social media in other ways, namely to crowdsource.

Summary

  • By using various analytical methodologies, you can uncover the relationships between events, behaviors, issues, problems, and actions concerning security.
  • Social media intelligence is a superior type of open source intelligence that also shares features with other types of intelligence.
  • Simply monitoring social media activity can help you recognize and anticipate security events, such as the attack on the Benghazi consulate, more quickly than consuming traditional media.
  • Intelligence is often validated, challenged, and disseminated through social media.
  • Apart from understanding events, social media data can also help you forecast the development of events.
  • Forecasting entails being humble about what you think may happen in the near future. Anyone who says they can predict complex events in the future is either being dishonest or naïve.
  • Language and sentiment analysis can help you discover the overt and hidden meaning in linguistic content, and its relationship to security events.
  • Through language and sentiment analysis, you can determine the authorship of anonymous documents and track and forecast the behavior of rioting violent crowds.
  • Correlation and regression analysis can help you understand and identify the associative and predictive relationships between different events and items.
  • Through correlation and regression analysis, you can create tools to provide early warnings for famines and food insecurity.
  • Volumetric analysis is a type of correlation and regression analysis that uses data concerning volume of activity and traffic on the Internet.
  • Through volumetric analysis, you can further refine tools to provide early warnings for famines and food insecurity.

 

 

Notes

1. Directorate of Intelligence (2012) “Intelligence Cycle.” U.S. Federal Bureau of Investigation. Accessed: 11 November 2012. http://www.fbi.gov/about-us/intelligence/intelligence-cycle; (2012) “How Intelligence Works” intelligence.gov. Accessed: 14 October 2012. http://intelligence.gov/about-the-intelligence-community/how-intelligence-works/

2. Scheuer, M. (2007) Through Our Enemies' Eyes. Potomac Books, Dulles.

3. Kelly, S. (2012) “Intelligence Official Offers New Timeline for Benghazi Attack.” CNN. Accessed: 11 November 2012. http://security.blogs.cnn.com/2012/11/01/intelligence-official-offers-new-timeline-for-benghazi-attack/; (2012) “Timeline on Libya and Egypt: Attacks and Response.” Washington Post. Accessed: 11 November 2012. http://www.washingtonpost.com/politics/decision2012/timeline-on-libya-and-egypt-attacks-and-response/2012/09/12/85288638-fd03-11e1-a31e-804fccb658f9_story_1.html

4. AKNN News (2012) “Timeline Photos.” Facebook. Accessed: 11 November 2012. https://www.facebook.com/photo.php?fbid=364777023602502&set=a.246675052079367.59435.246612585418947&type=1&theater; @zaidbenjamin (2012) Twitter. “A picture believed to be for…” Accessed: 11 November 2012. https://twitter.com/zaidbenjamin/status/245808824324333568/photo/1

5. @syriancommando (2012) Twitter “American Embassy worker in…. “Accessed: 11 November 2012. https://twitter.com/syriancommando/status/245753481649078272/photo/1

6. Kiely, E. (2012) “Benghazi Timeline.” Factcheck.org. Accessed: 11 November 2012. http://factcheck.org/2012/10/benghazi-timeline/

7. Check tweets of @Sefu_Africa on 28 September 2012, @kdfifno from 28 September 2012 to 31 October 2012, and @HSMPress on 28 September 2012.

8. Lewis, M. (2011) The Big Short: Inside the Doomsday Machine. W.W. Norton, New York.

9. Hastie, T., Tibshirani, R., and Friedman, J. (2009) The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York.

10. Provost, C., et al. (2012) “Somalia Famine: How the World Responded - Interactive.” Factcheck.org. Accessed: 11 November 2012. http://www.guardian.co.uk/global-development/interactive/2012/feb/22/somalia-famine-aid-media-interactive

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset