Chapter 6
Understanding and Forecasting Events
Appropriate analytical methodologies can help you understand how security events and issues develop over time and identify the factors that influence the events. Some methodologies can also help you go beyond and manipulate the identified factors to forecast how the event will evolve in the future or whether similar events will take place. Specifically, language and sentiment analysis (LSA), correlation and regression analysis (CRA), and volumetric analysis (VA) can help you use social media and other data to understand and forecast a variety of security events, including famines and illicit behavior. The analyses can help you discover relationships between content on social media and events or actions on the ground, track the evolution of the relationships, and detect emerging security problems. This chapter begins by illustrating how simple monitoring of social media data can help you detect and understand events more quickly than traditional methods. It next defines forecasting, explains how it differs from predicting, and describes what it entails. The chapter ends by explaining how to conduct LSA, CRA, and VA through examples and walkthroughs.
This chapter provides a few ways you can use formal and some informal methods to analyze security events so you can understand them and forecast their development and emergence in the near future. When we say security events, we also refer to corresponding behaviors, issues, problems, and actions. Events include terrorist attacks, natural disasters, homicides, riots, the act of drug smuggling, the act of sex trafficking, and much more. The majority of this chapter deals with using various analytical tools and formal methodologies to analyze social media and other data to understand and forecast security events. However, before we delve into how you can use the tools, we first want to discuss the role of social media data in intelligence analysis and how simple, informal monitoring of social media data can often help you anticipate and understand events better than traditional means.
Publications about intelligence analysis discuss the intelligence cycle, which describes how intelligence is processed in a civilian or military intelligence agency.1 You can formulate the cycle in many ways, and we are not all that interested in discussing the best way. We do, however, want to discuss the role of social media data in the intelligence cycle in general. This section helps you understand where social media data fits in the intelligence cycle and how you can integrate the rapid influx of social media data into your analysis.
Despite the type of intelligence cycle you prefer, it involves some variation of planning, collecting, processing, evaluation, analysis, and dissemination. It also involves the different intelligence collection disciplines that we list in Table 6.1. Assessing how social media data fits in with and compares to other collection disciplines can help you appreciate the role it plays in helping illuminate events and their causes.
Social media data is often superior to other collection disciplines because it cuts across and amalgamates the other disciplines. Social media is a form of OSINT because it is derived from publicly available sources. OSINT is a neglected collection discipline, and many people ignore it in lieu of data that is stamped “Secret” or “Top-Secret.” However, because of the explosion of data available on the Internet and social media, OSINT can prove very valuable. If used correctly and in the right cases, OSINT can even outperform intelligence that is not publicly available.2
Although social media is often categorized as a form of OSINT, it also has elements of the other collection disciplines and combines them in powerful ways. Intercepting and collecting social media data entails, for example, collecting from phones (COMINT), from crowdmaps (GEOINT), lots of imagery data (IMINT), data about Internet traffic (ELINT), and can enable collection from humans through crowdsourcing (HUMINT). It also has its fair share of RUMINT, or rumor intelligence. Additionally, you can often collect social media data in real time, which you often cannot do with other forms of OSINT or the other INTs. The example in the subsequent section illustrates the utility of social media data as an intelligence collection discipline and the advantages it can confer over other traditional disciplines.
Collection Discipline | Description (Interception and/or collection of…) |
SIGINT (SIGnals INTelligence) | Electronic signals |
COMINT (COMmunications INTelligence) | A subset of SIGINT, involving electronic signals between people that are directly used in communication |
ELINT (ELectronic INTelligence) | A subset of SIGINT, involving electronic signals not directly used in communication |
MASINT (Measurement And Signals INTelligence) | Emissive bioproducts such as radiation or heat |
IMINT (Imagery INTelligence) | Imagery via satellite and aerial photography |
GEOINT (GEOspatial INTelligence) | Imagery and geospatial information that depicts physical features and geographically references activities |
HUMINT (HUMan INTelligence) | Intelligence gathered via means of interpersonal contact as opposed to via technology |
OSINT (Open Source INTelligence) | Publicly available information |
During the writing of this book, a tragic security event illustrated how simply monitoring social media can help you anticipate and understand events in real time better than the other disciplines, including traditional OSINT. On September 11, 2012, the U.S. consulate in Benghazi came under attack while several people were protesting across the Middle East against a video produced in the U.S. that mocked Islam. Four U.S. embassy personnel, including Ambassador Chris Stevens, subsequently died in the attack. Because of the attention drawn by the protests and possibly the lateness of the events, many mainstream news outlets failed to fully address the scale of the attack on the consulate and the death of the ambassador until the following day, September 12. However, information about the attack, including pictures, started to appear on social media almost immediately following the attack.
The Americans were allegedly attacked at around 2200 local Benghazi time (+3 GMT or 1600 EST) on September 11. At 1945 EST, the Associated Press reported the death of one American in the Benghazi attack. Other news outlets communicated this fact but failed to grasp and communicate the full extent of the attack or how it came about. The next day at 0721 EST, President Obama announced the death of Ambassador Stevens in the attack.3 Five hours before the president's announcement at 0200 EST, we had already found gruesome pictures of Ambassador Stevens' body on social media.4 The pictures were in fact hours old and had been posted soon after the attack. The pictures showed him being pulled from the building and carried by Libyans trying to save him, and showed close-ups of his injuries and face.
Meanwhile, reports quickly appeared stating that a group of Salafi Islamists had carried out the attacks. The reports did not appear on news media outlets, but from known anti-American Twitter accounts.5 The involvement of the Salafis was initially denied, confirmed a week later, and is under investigation as of this writing.6 News media outlets and others did not begin reporting on the full scale of the events in Benghazi till 0900 EST on September 12, until after President Obama's press conference. However, information depicting the full scale of the attack and the death of Ambassador Stevens had already appeared on Twitter two hours after the attack at 1800 EST on September 11.
Subsequent investigations revealed that the DoD operation centers around the region, aside from the one that retrieved the embassy personnel's bodies, were not fully aware of the attack till the day after. They were consuming information about the protests through the mainstream news outlets and were not monitoring social media for information about the attacks. This indicates that even American military personnel in the region had information that was several hours old compared to the information on social media, and so they had little to no clue or anticipation of the reality on the ground. The centers also did not push out information about the attack until well after the actual event because of the formalities and rigors involved with traditional intelligence collection and dissemination. They waited to collect information on the events, write reports on what was collected, submit it for release approval, and then submit it to their internal databases before releasing it to the community. By then, thanks to social media, we already knew who was being attacked and even had a clue about who may have carried out the attack.
Our intention is not to figure out exactly what happened when and who knew what when. The events are confusing and the politics surrounding the U.S. presidential campaigns were exacerbating the confusion. We only want to point out that simply monitoring and filtering social media data can help you figure out what is happening and anticipate hotspots and events. In this case, social media would not have helped anticipate a seemingly well-planned clandestine attack. However, it could have helped discover and validate essential facts earlier, mitigate confusion, and help with response.
Social media will not supplant traditional collection disciplines or overtake OSINT, but can augment them immensely. If you are sitting in an operational center that has the tools we discuss in this book deployed, you can use social media to help lift the fog. Social media data can serve as a first line of reporting for events that may otherwise go unnoticed and bring you closer to total information awareness. Traditional collecting disciplines and news outlets can then confirm the events. Through social media, you can gain a much quicker and fuller understanding of what is going on, and task other collection resources such as aerial vehicles and HUMINT assets to confirm or add to the information appearing on social media.
Of course, as we repeatedly mention throughout this book, you should not blindly trust anything you see on social media; hence, the need to always confirm using other intelligence sources. Fortunately, social media often regulates and validates itself. An example involving attacks in Somalia illustrates the intelligence validating and refining loop inherent in the social media environment.
In mid-later 2012, the Kenyan government launched an incursion against Somalia's al-Shabaab forces, which it claimed were becoming a threat to its country. On October 28, the Kenyans were involved in an amphibious assault on Kismayo, Somalia, which is an al-Shabaab stronghold in Southern Somalia. During the assault, the Kenyans began tweeting from their official accounts, including @KDFInfo, that they had taken the city. Al-Shabaab immediately countered by tweeting through its official account, @HSMPress, that it still controlled the city and that the Kenyans were not even in the city yet. The Kenyans later “adjusted” their statements to reflect a more accurate reality and were more forthcoming about the realities of their actions in Somalia.7 During this episode, we saw two battles—one on the ground and one in social media. If you can watch and integrate information from both, you are much more likely to discern fact from fiction and do it at a much faster rate. You can then task your resources and organize responses more efficiently and effectively. You can also produce more robust and insightful intelligence analyses and forecasts. The following section focuses on how to use social media data to produce those analyses and forecasts.
Before you can start using tools to analyze social media data, and understand and forecast events, you need to have a firm grasp of what constitutes forecasting and the limitations it faces. Forecasting is the process of determining what will likely happen in the near future based on past experience and knowledge, and current trends of behavior. You cannot forecast an event if you do not understand it to some extent. To forecast, you typically complete the following steps:
Simply put, you identify factors that have indicated an event in the past, and then determine the presence of those factors, and thus the event, in the future. For a forecast to be useful, it should ideally state what, when, and where an event will occur, and the likelihood that the forecast is correct.
Forecasting is different from predicting, which also involves determining what will happen in the future based on a priori knowledge and behavioral trends. The difference between a forecast and a prediction is somewhat subjective. To us, the main difference is that a prediction has details and certainty, whereas in most cases a forecast lacks some details and is humble about its certainty. The determination that the sun will rise tomorrow at exactly 5:45 A.M. in Washington D.C. is a prediction. The determination that there is a 70 percent chance a famine will occur in Somalia in the next two weeks is a forecast.
Do not believe anyone who claims they can predict security events. They may get a few things right here and there, but their accuracy will not be very impressive. Prediction concerning security events, issues, and behaviors is difficult for three reasons. One, black swans—or seemingly random and unlikely events that have immense influence—can happen at any time and significantly impact security. By definition, past information cannot tell you when and where black swans will likely occur because they are so rare and so little information is available about them, so you cannot determine their trend of occurrence and predict them. The assassination of John F. Kennedy was a black swan—no one expected it and it changed American history. The uprising in Tunisia in 2011 was also a black swan. A few expected small protests and popular disgruntlement due to the poor economy in the area, but no one expected the populace to overthrow their autocratic government so quickly. No trend or event in recent Middle East history gave an indication that such a thing could or would occur. Ironically, people involved with security issues tend to be most interested in anticipating the black swans. Two, we do not know enough about human behavior to predict it. Security events are a function of human behavior, which is often complicated, confusing, lacking constraints, and difficult to comprehend. Three, security events also involve hundreds of other factors and variables, which we also often do not know much about. Accurately predicting a terrorist attack against a shopping mall involves assessing information about the terrorists, their capabilities and weapons, their willingness to commit an attack, the security in the mall, the behavior of shoppers at the mall, the counter-terrorism activities of local law enforcement and other agencies, and much more. Too many factors exist to account for, and predicting with certainty requires accounting for at least some of them.
Although predicting security issues with certainty is nearly impossible, forecasting them with some accuracy is possible and useful. In many cases, you have enough data and constraints to develop a robust hypothesis about if and/or when something will happen. You also have enough computing power and tools to create analytical models that can use the data to forecast events. The Internet and social media have made more data available than ever before. Also, the popularity of quantitative analytics and widespread availability of analytical tools and methods has encouraged and enabled people in all sorts of fields to come up with different forecasting models that are impressive in their accuracy. For example, Princeton's Sam Wang (http://election.princeton.edu) and the New York Times' Nate Silver (http://fivethirtyeight.blogs.nytimes.com) have built distinct models that forecast American election results with impressive accuracy using somewhat different methods. Lastly, the Internet has provided a voice to people who think differently or process different information and come up with more accurate forecasts. What may appear to be a black swan to most people is not so for such people. The economic crisis in 2008 appeared to be a black swan, but many people saw it coming and warned others about it.8 By listening to voices that are not in the mainstream concerning security issues and expanding your diet of information, you can also improve the chances that you are using appropriate information and ways of processing the information to come up with your forecasts.
Overall, forecasts are not always correct or precise, but they provide a good idea about what could happen and can spur preparations. In the security field, even a somewhat hazy forecast can save lives.
Because forecasting is a function of available data, some events are more appropriate for forecasting with social media and related data. Generally, the best events to understand and forecast are the ones where more people use social media to talk about or organize them. Currently, such events include protests and riots, natural and other major disasters, disease epidemics, and to a lesser extent, terrorist attacks and gang or drug violence. As social media use expands globally, the list of events will likely grow. As discussed in Chapter 2, social media use differs by country and region, and so people in different areas use social media to talk about and organize events differently. Thus, you will be more likely to forecast certain events in certain areas.
Forecasting involves the following distinct process regardless of the type of analysis you use:
Other considerations also may affect the type of analysis you employ. Keep in mind the following guidelines that differentiate good and useful forecasts from poor and useless ones:
The next few sections detail how to conduct each of the three types of analyses to uncover relationships between disparate objects or behaviors, identify and track security issues, and forecast security issues. Some of the examples and analyses are more appropriate for illustrating how to understand events and others are more appropriate for forecasting.
What a person says and how she says it can be very revealing at times. Understanding the overt and hidden meaning in linguistic content, whether it be written or spoken, can help you understand and anticipate numerous security events. Language and sentiment analysis is the process of analyzing linguistic content such as forum posts and text messages to reveal answers to a variety of questions, such as:
LSA, also known as natural language processing (NLP), is a burgeoning field that will only grow in strength as the Internet and social media make linguistic content more easily available for analysis. Numerous LSA tools exist and they differ in their accuracy, method of analysis, customizability, cost, usefulness, and output.
Some LSA tools tend to be more deductive. They analyze content based on assumptions about how people use language and output firm answers to your questions. Creators of the tools program them based on assumptions derived from a combination of past knowledge, experience, and scientific studies. Some such LSA tools are accurate and useful, but others fail to correctly analyze language and, thus, offer little benefit. LSA tools that detect sentiment are a popular example of tools that often fail. Sentiment detection tools analyze a piece of content, such as a person's status update on Facebook, and then try to determine the person's emotional state or intended emotional effect on the reader. In other words, the tools determine the emotional polarity of the content and the emotion the author of the content wishes to convey through it. Even more simply put, the tools tell you whether the author made a statement that most people would consider emotionally positive, negative, or neutral. For example, the statement “I dislike my neighbors and find them annoying” conveys negative emotion and sentiment. The statement “I really like the weather in October” conveys positive emotion and sentiment. The statement “I own a black guitar” conveys a lack of emotion and neutral sentiment. By using such a tool, you can discern the emotional state of lots of people at the same time on a social media platform, which can help you determine whether a positive or negative event is taking place.
Most people are naturally very good at detecting sentiment in the statements others make. They can instinctively discern whether the author of a statement is conveying positive, negative, or neutral sentiment by looking at the words in the statement, the author's tone, the context, the environment, and various other variables. Computers and LSA tools, however, are not as smart as people when it comes to figuring out sentiment. They tend to use theories and shortcuts that often fail because they cannot understand the nuances in language. Most sentiment detection tools focus only on the type of words in a statement to discern sentiment. The tools rely on the theory that some words are inherently emotionally charged and correlated with a statement's sentiment. For instance, people typically use the words “dislike,” “hate,” “angry,” and “annoying” when they want to convey negative emotion and sentiment. The tools assume if a person uses such words then, in many cases, it is safe to assume the person is talking about something she sees negatively. However, that is not always the case. People often use such words sarcastically, atypically, or to describe what others have said. For example, most people would consider the statement “I hate it when I win!” to be sarcastic and conveying a positive emotion. Sentiment detection tools, however, would see that it contains the negatively charged word “hate” and classify it as conveying negative emotion. They fail to pick up on these nuances and so incorrectly classify statements as having negative sentiment when they really have positive or neutral sentiment.
Despite their flaws, some deductive LSA tools are built on correct and tested assumptions and so are useful and accurate. We explore one such tool later in this section.
Other LSA tools tend to be more inductive. They find patterns of correlation and regression in large amounts of content and let you make sense of the patterns to answer your questions in your own way. Such tools usually only output various statistics about the content they analyze, and never direct answers to your questions. They will tell you how many times a certain word appeared in content, what other words a word appeared near in a statement, and how such statistics differed over time. You then use the outputted statistics to detect changes in behaviors that help you formulate a theory and answer your question. We explore this method of LSA later in this section.
As you probably noticed by now, most LSA tools and methods are both deductive and inductive. Many LSA tools keep track of how accurate they are over time and refine their assumptions and theories based on data to become smarter and more accurate. Many people create deductive LSA tools based on theories they induce after looking at large amounts of content and conducting scientific studies on language. The difference between the two types of LSA tools is somewhat artificial, but thinking about them in this way can help you discern which tool is more appropriate to answer your question. Neither type of tool is necessarily more useful than the other, but one is often more applicable for a particular problem depending on the kind of data you have, the time and resources you have, the accuracy of the LSA tool, and the answer you want. We now explore two ways you can use LSA tools and methods to understand and forecast security events and related behavior.
We are often asked about tools that can help identify the author of a blog post, document, tweet, or text message. In this section, we teach you how to use such a tool to identify authorship.
Correctly identifying the author of a piece of content can prove very valuable for a number of reasons. Knowing the identity of the author of the content you are analyzing to understand an event or behavior will only improve your analysis. For example, say you are analyzing posts on an online forum that violent extremists frequent to determine whether the extremists are planning violence against a specific target. You know that the majority of the forum members have no idea about operational plans and are merely followers of the violent ideology. They post misinformation about attacks that will never take place. You label them as the followers. However, you also know that a few members actually take part in operations and know about the future attacks. They post information on the forum that provides hints about future attacks. You label them as the operators. You also possess documents such as letters written by some of the operators that do not appear on the forum. However, you do not know which members of the forum are the operators and which ones are the followers. Because you do not know the identity of the forum members, you do not know which members' posts you should analyze and which posts you should ignore.
Authorship identification tools can help you compare the letters you possess with the posts of the forum members to identify the operator members. They help you figure out which of the individuals who wrote the letters are writing certain forum posts. They rely on the well-tested assumption that people have a distinct way of writing. The distinctions emerge in how people use words, spell certain words, structure sentences, and through many other traits. By comparing the letters with the forum posts, the tools can help you determine whether the authors of the letters also wrote some of the forum posts, and if so, which forum posts they wrote. You can then determine which of the forum members are the operators and analyze only their posts for intelligence about future attacks.
Numerous tools can help you determine authorship and, as you would expect, they differ in their features and abilities. In this section, we teach you how to use the Java Graphical Authorship Attribution Program (JGAAP), which is a free program available for download at http://www.jgaap.com. Throughout this section, we refer to JGAAP version 6.0.
The process of preparing for the analysis is fairly simple. Obviously, you first need to download and open the JGAAP program. Follow the instructions on the website to ensure you are opening the program correctly. Also make sure to download the program's user manual from the website, which shows you how to use the program with visuals and easy-to-understand instructions. Refer to the user manual as you go through this example.
Then you need to prepare all content for analysis. The content includes language whose authorship you do not know, which are known as the “test documents,” and language whose authorship you do know, which are known as the “training documents.” For our example, we will not use real social media content because of copyright issues, but instead use content and documents recovered from Osama Bin Laden's compound in Abbottabad and subsequently translated. The documents are available at Jihadica's website (http://www.jihadica.com/abbottabad-documents/). We also provide links to the exact documents we use on our website in Microsoft Word format. Feel free to use other content or follow along with the documents we use to make sure you are using JGAAP properly. We have had trouble using PDF documents with JGAAP, so you may need to convert PDF documents into Microsoft Word document format.
In our example, we intend to discover whether Osama Bin Laden or his deputy “Atiyya” Abd al-Rahman is the author of a recovered and translated letter written to Nasi al-Wuhayshi, the leader of Al Qaeda in the Arabian Peninsula. The letter in question is the test document. The training documents are numerous lengthy letters written by Bin Laden and Atiyya to each other and others. If you do not have documents of your own, download the documents we use from this book's companion website. Make sure to save the documents in an easily accessible place on your computer.
Complete the following steps to conduct the authorship analysis:
Reading the results is straightforward. The results indicate in order: the test document, the canonicizers, the event driver, the analysis method, a list of the training documents starting with the author most likely to have written the test document, and the numerical score outputted by the analysis next to the name of the training document. To understand the results, you need to look both at which author appeared first in the list and the numerical score. In the analysis for the event driver Sentences, the language in the test document was similar to the language in both documents. Both are ranked 1 and both have the score of 1.0. The Sentences event driver is not that helpful to our analysis.
In the analysis for the event driver Word NGrams, the language and words in the test document were more similar to the language and words in the Bin Laden training document than in the Attiya training document. Because we are measuring a form of distance between the test document and the training document, a lower score means the documents are closer and thus more similar. In our case, the Bin Laden training document had a score of 0.577 and a rank of 1, and the Attiya training document had a score of 0.653 and a rank of 2. The Bin Laden training document was closer and more similar to the test document. This result suggests that, based on the analysis, Bin Laden was more likely to be the author of the test document than Attiya.
You need to be aware of several caveats with this type of analysis. One, as we mentioned before, analyzing translated documents is not ideal. Translators can significantly change a document's language and structure. Two, the analysis and the results are not perfect. We recommend you run the analysis a few more times by selecting different combinations of event drivers and analysis methods. If you keep getting the same result, you can be more confident in the analysis. Three, you need a lot more content to do the analysis correctly. The more content the program can analyze, the more likely it will output a more precise answer. This fact poses a problem when analyzing social media content. Blog posts, forum posts, tweets, and status updates are not always lengthy. You need to collect lots of them to do a meaningful analysis. Four, some social media content may not be as useful for the analysis, and you have to be careful that you are comparing similar types of language together. For example, when tweeting, most people tend to use different types of words and grammar because of the 140-character limit that Twitter imposes. The way they write lengthy blog posts or letters is likely very different than how they write their tweets. You should compare tweets to tweets and blog posts to blog posts to make sure analysis is consistent.
By taking these caveats into account, you can significantly improve your ability to do authorship identification analysis and understand certain security events and behavior. Knowing who wrote what or even who did not write what can help you make better sense of intelligence critical to your understanding of a related event.
The authorship analysis tool gave you a straightforward answer about the likely identity of the author, and so answered your question. Other LSA tools do not outright answer your questions in such a way. Instead, they help you explore the data so you can create and test your theory about possible answers to your question. With LSA tools and some ingenuity, you can analyze language on social media to track and forecast the behavior of crowds. In this section, we do not detail how to use specific software tools. Instead, we focus on how you can structure a solution to a problem by using any one of a variety of LSA tools.
As discussed before, many individuals and groups use social media to organize events involving complex crowd behavior. By analyzing the content these individuals are posting on social media, you can figure out how and where the crowds are moving. Imagine you need to track and forecast the behavior of violent rioters in a city. The rioters are using Twitter and other forms of social media to organize attacks against specific targets in the city and share information about law enforcement. As is often the case, some individuals usually are taking the lead in directing the other rioters. They are posting information about where, when, and how other rioters should move. Through social network analysis you can identify these key individuals. You then simply need to track what the key individuals are saying to monitor the development of the riot. You can then create an early-warning system that gives you an idea of where the riot has been, where it is at the moment, and where it will likely be in the future.
Although this simple form of language analysis will help, it will not serve as a thorough and complete solution for several reasons. One, the list of key individuals may be long, and you may not have the resources to actively monitor all their messages. Two, others who are not on your list of key individuals may also be helping direct the riot. Three, people who are not involved with the riot may be posting information about the riot that can help you get a sense of the riot's behavior and evolution. To take these factors into account, you need to use methodologies and tools to build a system that can comprehensively track the riot's past, present, and future behavior. You can use the wide variety of methodologies and tools in numerous ways. Some of those ways are very complex and require sophisticated expertise in crowd behavior, linguistic and semantic processing, data processing, and cognitive science. We present one simpler way that you can use to understand the concept of LSA and get a head start on your own riot early-warning and tracking system. Research other ways and feel free to adapt and experiment with our way and those of others as you see fit.
Our way involves creating a system that rapidly analyzes enormous amounts of social media data to:
Preparing to build the comprehensive early-warning LSA system first entails figuring out what amount of relevant data is available to you and then getting it. The relevant data obviously involves the social media content that the rioters and key riot leaders are generating. For the sake of this case, let us assume that the people involved are predominantly using public social media platforms such as Twitter along with other platforms from which you cannot get data. In some cases, law enforcement can gain access to messages on private social media platforms, but we will not worry about that here. The other relevant data also involves social media content that people who are not involved with the riot, such as witnesses, are posting about the riot on public social media platforms. This amount of data can be quite large. Consider the data from the rioters and the witnesses to be the test data on which you will test your tools and do your analysis. Yet another type of data you need to collect and one that is not as obvious is social media content from past riots. This third type of data will serve as the training data with which you can train your tools. Your system needs to collect the data about the riot in question, and you should assemble the data about past riots. Do not combine the two data sets. In Chapter 4, we covered how you collect, filter, and store the relevant data.
Apart from the data, you also need to acquire the appropriate LSA tools and integrate them into your system. You need tools that can help you analyze the frequency with which certain words and phrases appear in content, what words and phrases tend to appear together, and keep track of the source or author of particular content. These tasks are simple, and most LSA tools can easily complete them. We will not focus on which tools you should use because numerous ones exist. New versions of popular analysis programs such as SPSS tend to already come with LSA tools. The website KDNuggets (http://www.kdnuggets.com/software/text.html) features a list of popular LSA tools. Most tools on that website present more features than you will need for this case. The Internet is full of free, open source LSA tools that provide you with all the features you need and that you can easily download, adapt, and use with minimal development skill. Apache's OpenNLP library (available at http://opennlp.apache.org) is one such repository of free and appropriate LSA tools. For now, do not worry about which LSA tool to use, but how you would use that type of LSA tool.
Acquiring the test and training data and the right LSA tools is the easy part of the analysis. The most difficult and critical part is formulating the analytical methodology that the system uses to make sense of the data and output appropriate warnings and indications. Formulating the methodology will be a primarily deductive process, and we will refine it through an inductive process using the training data.
Essentially, the methodology will determine which words and phrases in social media content are important and to what extent they tell you about the rioters' past, present, and future behavior. To do so, the system needs to be able to categorize various words and phrases according to their context and what they tell you about the rioters' behavior. The system then must apply weights to the content according to how important it is to the ultimate analysis. To help the system do this, you need to create an ontology, which describes how linguistic content should be categorized and what its relationships are to other content and behavior. The system's algorithms, which follow the ontology, process data and output early warnings and indications of the riot's movements and actions.
To create an ontology and methodology, take note of previous experience and knowledge to lay out the theories and assumptions that will underlie the methodology. First, categorize social media content based on what it tells you about the rioters' behavior. According to our experience and knowledge, we surmise that certain textual content on social media can tell you whether people are providing facts about events that are happening or have happened. Such content falls under the ontological category of Facts and indicates past or present behavior. We can also tell whether an event is about to take place, which falls under the ontological category of Possibilities and indicates future behavior. For example, the tweet “Three guys looting store on 15th” would fall under the Facts category. The tweet “Attack on 3rd at 1500, then wait for flare” falls under the Possibilities category. The categories of the tweets are self-evident to most people because they can easily figure out the meaning and intent of the tweet by looking at certain characteristics of the language used in the tweet. For example, in the Facts tweet, we realize that the verb “looting” is in its present form and indicates an action happening right now that is related to riots. The words “store on 15th” indicate the target and its location. The same type of quick linguistic analysis applies to the Possibilities tweet. The verb “Attack” indicates a future violent action that is about to take place and the words “3rd at 1500” indicate a location and future time concerning the action.
You need to train, or program algorithms into, the LSA tools that make up your system to also quickly analyze tweets and other social media content and put them in the Facts and Possibilities categories. Content that does not relate to either, such as “I'm sick of campaign commercials, make them end already,” will be filtered out. Programming or training the LSA tools to filter out irrelevant content and categorize relevant content into Facts and Possibilities is much harder said than done. We address how to train tools to do this and other complicated tasks later.
Another part of the methodology is realizing that not all related content is as useful to you. We can surmise from experience that, for instance, a Possibilities tweet that comes from a key riot leader will have more influence on the behavior of the riot than a Possibilities tweet that comes from someone who is not at all involved with the riot. Also, a Facts tweet describing an event that is corroborated and validated by other Facts tweets is more trustworthy than a Facts tweet that addresses an event that no other Facts tweet does. Based on these considerations, you need to weigh the content you are analyzing. If you are not familiar with the concept of weights, think of it as if you are boosting the importance of certain content over other content. You and the system then pay more attention to what the boosted content says and give it more weight when making your decision. The weights of the content will then influence the outputs, which are the early warnings and indications of the riot behavior. You can and should analyze other parts of the content in other ways, such as figuring out location from language. For the sake of time and understanding, we stop here for now.
The system should look at the content in each category, assign weights to it, and then use that information to produce outputs. For example, it can look at the tweet “Consulate building on fire” and put it in the Facts category. It can then output to a crowdmap, indicating the location (the consulate) and the action taking place (on fire). This helps you track the behavior of the rioters. If the tweet contains temporal information, you can then also output that on a timeline so you can track the past versus present behavior of the rioters. Consider another example—the tweet “in 2 hours move west.” The system puts the tweet in the Possibilities category and outputs a text warning to you indicating the action about to take place (riot moving west) and the time it will take place (2 hours). You then have an idea of how the riot may move in the near future. The system may also come across another tweet in the Possibilities category, but not output it to you because the system realizes that it came from someone who is not a key riot leader. To review, our methodology involves:
You should now have an understanding of how even simply analyzing language and assigning it appropriate weights can provide you with important intelligence. However, to complete the system, you still need to program and train the LSA tools.
As mentioned before, numerous commercial and open source LSA tools can easily do the tasks required of them. Their capability rests on your ability to program and train them properly. By programming and training them, we mean teaching the LSA tools to follow specific algorithms so they know how to process the data and output the results you need. Unless you have a programming background, you will need a developer to help you program and train the LSA tools.
Programming an LSA tool is a two-step process (we refer to a single LSA tool but in reality your LSA tool may actually consist of a combination of numerous tools). The first step is to use your own knowledge and that of others to tell the LSA tool what content it should focus on and where it fits in the ontology. Part of that is telling it which sources to focus on, such as the list of key riot leaders and related hashtags. The other more significant part of it is telling it which language is the most revealing about the riots. In the riot scenario this language may include words and phrases that describe riotous action, the organization of logistics, times, dates, locations, and names of major landmarks. Many LSA tools and repositories contain lists that conveniently categorize words and phrases by the type of behavior or event they describe. You need to then program the LSA tool to correlate certain words and phrases that describe riot with the Facts and Possibilities categories. You literally tell the LSA tool that if it comes across the word “attack” and it is next to the word “tomorrow,” it should categorize the content containing the words as Possibilities. You of course do not need to spell out exactly all the combinations, because LSA tools usually come with options to quickly correlate lots of words together.
You also need to program the LSA tool to look at the sources of the content and provide appropriate weight depending on the source. Looking at the source can also help the LSA tool categorize the content and verify if it really is related to the riot or is irrelevant. For example, the LSA tool may pick up and analyze the tweet “the #cloudatlas movie is going to #blowup tomorrow when it comes out” because it is looking at all content with “#blowup.” The LSA tool can also download information about the source to see if he is on the list of key riot leaders, is tweeting other information related to riots, is following or followed by other people involved in the riot, or even located at the city where the riot is happening. After taking these factors into account, the LSA tool can then realize that the tweet is not relevant to the riots and discard it. In some cases, it may have trouble deciding. You can program the LSA tool to give scores to content based on how many of the criteria it meets (for example, its source, how many relevant words, and so on) and then classify the content based on its score. In the case where the aggregate score for the content is not high enough to categorize it but not low enough to discard it, you can program the LSA tool to create a separate category called Ambiguous. You as the user can then go into the system, read through the content in the Ambiguous category, and help the LSA tool properly categorize it. Lastly, you need to program the LSA tool to see if content in the Facts category is being repeated and validated by others and then giving that content more weight. This process is as simple as programming the LSA tool to scan how many times large phrases or sentences are repeated.
The second step is to train the LSA tool to work effectively and learn from its mistakes. This step is significantly harder to do, but it is necessary if you want to create a system that can handle the messy and complex data found in the real world. You need to test the LSA tool against the training data about past riots whose movements you know about. Act as if the past riot is happening at the present time, and input the appropriate data into the system. The LSA tool will go through it and report on the past, present, and future behavior of the riot. You can then assess how well the LSA tool did. Most likely, the first time you test the LSA tool, it will not do very well. You need to then implement machine learning algorithms into the LSA tool so that it can learn from the training data and refine its programming. In simplistic terms, the LSA tool looks at what kinds of words and phrases actually came up during the past riot and actually indicated real riot movement. It then goes back into its own programming and tweaks it because it then has a better idea of which words and phrases it should look for and what weights it should assign to them. Explaining this process in full is well beyond the scope of this book, but you can easily find resources online that explain it.9 Over time, you should train the LSA tool on more training data so its efficacy constantly improves.
This section should have given you a good idea about what LSA is and how you can use it to make sense of social media data to understand and forecast security events. A better understanding and forecasting capability requires you to create and use more complex tools and algorithms. You will find that to use LSA effectively, you need to understand correlation and regression analysis.
A large number of variables influence any event in the real world, and it is usually impossible to account for all of them. However, a few are usually more influential than the others. Understanding their relationship with the event and how they appear in the future can help you forecast the appearance of the event in the future. Correlation and regression analysis (CRA) is one of the simplest ways to identify and understand the relationship between the most important factors and the event.
You likely are already familiar with correlations and, to a lesser extent, regressions. Humans are very good at intuitively doing correlations. We keep track of a number of things and behaviors in the real world and try to figure out how they influence an event of concern. This natural ability is the reason superstitions exist. For example, you notice that on days you wear red socks, the Boston Red Sox baseball team wins and when you do not wear them, they lose. You correlate the wearing of your red socks with the record of the Red Sox. You then take it a step further and think that somehow you wearing red socks influences whether the Red Sox win. You then always wear your red socks on the day that the Red Sox play. From a rational perspective, this type of thinking is very wrong because correlation does not imply causation. It may only be a coincidence that your wearing of red socks correlates with the Red Sox winning. Most likely, you ignore the days when the correlation did not occur.
Statistical CRA helps us determine whether coincidences actually occur and whether they mean anything. Although you can do it relatively easily, you can also misinterpret its results easily. Another hard part about doing CRA is getting the right data to do it. In lab settings, getting the data to do it is not difficult. However, in the real world and when dealing with security events, you are not likely to get all the data you need. In this section, we address the concerns of interpreting results correctly and humbly, and in the next section we address dealing with incomplete and frustrating amounts of data. We first show you how to do CRA through an example concerning the issue of creating a system that provides early warnings of famines and uses data collected through social media. The example involves made-up data that illustrates the process of doing CRA. In the next section dealing with volumetric analysis, we use real-world data that shows that doing analyses for security events is not as easy as you may think.
Imagine you want to build a system that provides warnings about famine conditions (and food insecurity) in different parts of the world. The system will need to collect data, analyze it to forecast possible famine conditions, and then output warnings to you about the possible famine conditions. To provide early warnings about famines in any part of the world entails collecting and analyzing information concerning a wide variety of factors, ranging from rainfall amounts, to market prices, to the state of the infrastructure in the region, to the needs of the affected population. Your system will need to do a lot of things and do them well. Indeed, the United States Agency for International Development's Famine Early Warning Network System known as FEWS NET works in a similar fashion (http://www.fews.net). This example is largely inspired by FEWS NET.
For now, we will focus on only part of the system and try to answer whether you can forecast famine conditions using only data about the price of grains in a region and the health status of the people in the region. Based on experience and knowledge, you deduce that the price and health status likely have a relationship to the famine. Rising grain prices indicates a lack of food and makes it less likely that people can buy food, which in turn increases the level of famine. A population's health status (people suffering through disease or not) obviously deteriorates when people are suffering through a major crisis like a famine. The analysis helps determine whether your deductions are correct. Social media plays a role in that it enables you to collect data about the grain prices and health status in near real time.
Assume you have access to a social media platform that collects grain price and population health data from a region in Africa. The social media platform is SMS based and regularly queries participants for information. You also have data from a non-profit organization that indicates the level of famine that a region is suffering at any time. This famine indicator is calculated as a function of many variables, including the level of malnutrition and mortality. Note that the non-profit uses other health data, collected through in-field surveys, to calculate the famine indicator level. The health status data you use is collected through other means and so is different. In this case, the grain price and population health status are independent variables and the famine indicator is the dependent variable. You need to figure out whether the independent variables affect the dependent variable, how, and to what extent.
To uncover the relationships between the variables, first conduct correlation analysis to determine whether there is some sort of an associative relationship between them. If the independent variables do not correlate with the dependent variable in any way, they likely do not affect it. Also, if they do not correlate, an intermediate or unknown variable also does not affect both of them.
To conduct the analyses in this section, you need basic statistical software. We describe the process by using Microsoft Excel to keep things simple, but you can use whatever software you want. We assume you are familiar with Microsoft Excel and know how to create basic line charts and use formulas to calculate things. If not, you can easily find numerous resources and videos online that show you how to use Excel. To do only correlations, you can also use the free web-based Google Correlate (available at http://www.google.com/trends/correlate).
The data file for the first example contains four columns. Column A, labeled Date, indicates the dates for which the data was collected. Each date point given corresponds to a week. In other words, assume the data is collected at the beginning of each week. Column B, labeled Health, contains data collected about the health status of people in the region. The value of the Health variable ranges from 0 to 10, where 0 indicates very poor health (suffering from lots of dangerous conditions and diseases) and 10 indicates good health. Column C, labeled Price, contains data collected about the price of grain in the region. The value of the Price variable is a numerical price that is equal to or greater than 0 (the currency is irrelevant). Column D, labeled Famine, contains data collected about the level of famine in the region. The value of the Famine variable ranges from 0 to 10, where 0 indicates very low levels of famine and 10 indicates very high levels. Twenty data points are given for each column. The amount of data is very small, but it serves instructive purposes. You need to determine whether the independent variables correlate with the dependent variable and to what extent. With your statistical software at hand, complete the following steps to conduct the analysis:
You know that there is an associative relationship between the Price and Famine variables and have some idea of how they have associated so far. You also know there is no relationship between health and the famine level. This finding is surprising because the health status of a person suffering through a famine usually deteriorates. The finding suggests that either your initial deduction was partially incorrect or that you collected faulty data. Most likely, the population did not understand exactly what you meant when you queried them for health status data, or they happened to be the lucky ones who survived the famine. This highlights the need to collect data from a large population and to compare different data sets.
You now need to use regression analysis to establish how well the Price variable can help you forecast the Famine variable in the future. In geek-speak, you need to identify the predictive or forecasting value of the independent variable.
In this case, we use a priori knowledge to determine that there likely is a causal and predictive relationship between prices and famine level. Spikes in prices can noticeably increase the level of famine. By knowing how much the prices contribute to the level of famine, you can figure out how future spikes or drops in prices will affect the future levels of famine in the region. Complete the following steps to find out this information:
You have now completed the regression analysis and can use the equation given in step 9 as a way of forecasting how changes in price will affect changes in the famine level. We only went over a simple way of doing CRA. You can use a variety of statistical tools and methods to do more complex and, in some cases, more accurate, ways of doing CRA. No one way is necessarily better than the others—it all depends on context and the nature of the variables. Scour the Internet for more information on multiple and multivariate regressions to learn how to do more advanced CRA. Also, research more ways of testing the significance of your findings, such as t-tests. Lastly, compare different data sets and sources to ensure your results did not come about because of faulty data. The processes are basically the same, and once you understand how to do simple CRA, you can easily learn how to do the more advanced analyses.
Separating volumetric analysis (VA) from CRA is somewhat artificial. The tools and methodologies you use to do VA are the same as you would use to do CRA. The difference lies in the type of data you analyze. VA entails finding associative relationships and the strength of causal relationships in data dealing with volume of traffic. Traffic on the Internet includes the number of times someone searches a word on Twitter and the number of hits to a website. Traffic in the offline world includes the number of shipments between two countries and the number of transactions between two bank accounts. When doing VA, the focus is not on the nuances and make-up of the event or behavior. Instead, it is on how many times the event or behavior occurred and how such volume data corresponds to other events. Contrast that with LSA, where the nuances and make-up of the data and the text matter. In practice, when doing VA you will often need to use other types of data as well. Still, we separate VA out to highlight the importance of traffic volume data and how it can hold amazing insights in cases where you cannot get access to all the data you want.
In this section, we expand on the example given in the CRA section. We now consider the fact that in the real world, you do not always have access to all the data you want. Additionally, the data you do get is usually incomplete or incompatible with other data sets. We consider how you can use volume traffic data, which in many cases is easier to get than complete data sets. In the following example, we continue to consider ways you can create tools to provide early warnings for famine. This time, we use real-world data and Twitter data.
Imagine you are tasked with creating another tool for your famine early-warning system that is distinct from the one we discussed in the CRA section. This new tool should complement your other tools and work in conjunction with them to provide more robust early warnings that take into account a number of factors. This tool should analyze different types of data and see if other behaviors can help you anticipate the onset and intensity of famine in a region.
One possibly useful data source is Twitter. Usually, when a major event happens, people take to Twitter to either report on it, as they did with the Benghazi consulate attack, or voice their opinion about it. Examining what people say on Twitter, when they say it, and how many people say it can tell you a lot about the event that is taking place. This view, however, does not always hold true, especially when dealing with events taking place in parts of the world that are not fully developed. Twitter is expanding into Latin America, Asia, and Africa aggressively and will likely be ubiquitous very soon. However, such widespread use of Twitter has not yet occurred. Information about events occurring in less developed areas such as Somalia may not appear on Twitter right away. You need to test whether you can use Twitter as a data source to analyze and provide early warnings about famines and food insecurity.
To test the viability of using Twitter as a valid data source, you need to examine how it has fared recently in anticipating famine and food insecurity events. The Somalia famine in mid-2011 serves as a good case study. It occurred relatively recently and you can easily find data about it. Overall, you need to assess whether information traffic related to the famine appeared on Twitter when famine conditions started to take place in Somalia, and whether the levels of the traffic matched the intensity of the famine level. If it did not, you need to determine the time lag between famine conditions taking place and information about it appearing on Twitter. If everything fails, you also need to find another data source that is more useful than Twitter.
We spent some time searching the Internet for relevant data. Apart from using the data sources we discussed in Chapter 4, we simply Googled together the terms “data” “Somalia” “famine” and sifted through the links. We found some that were useful and much that was incomplete, incorrect, or incompatible with other data sets.
First, you need to obviously find the Twitter data. The British newspaper, the Guardian, has exactly the data you need on its website (the original raw data file is available on our website for download as a spreadsheet).10 The Twitter data consists of the number of tweets per day over a 19-month period that mentioned “Somalia” and either “famine,” “drought,” or “food.” The Guardian also provides data about the number of mentions in newspapers and total amount of aid flows by date.
You need data to do the analysis of Twitter data against. This data should indicate the level of famine over the time period that the famine took place. We found some United Nations data on malnutrition and other famine indicators for Somalia, but they are not broken down by clear time periods and so are not as useful for analysis. Other data needs to serve as the famine level indicator. The market price of red sorghum in Marka in Southern Somalia may do the trick. Red sorghum is a grain that is widely grown and consumed in Somalia. As discussed before, drastic increases in the market price of grains such as red sorghum can indicate food insecurity and famine conditions. It is not perfect, but a causal look at the data and comparisons to reports from the United Nations documenting the famine shows that it indeed is a good indicator for food insecurity and famine conditions. The market price data is made available by the Food Security and Nutrition Analysis Unit (http://www.fsnau.org) and is broken up by month over several years.
Determining a correlation between the market prices and the Twitter mentions can help determine the utility of using Twitter data to anticipate famine conditions in Somalia. The correlation analysis will tell you not only whether they are correlated, but by how much Twitter data tends to lag behind the appearance of famine conditions.
First, you need to collapse the Twitter data into monthly periods. When doing CRA analysis, you should compare apples to apples. If one data set gives statistics by month, the other data set should also give statistics by month. You also need to make sure you are comparing months for when the data exists. Data for both price and Twitter mentions exists from August 2010 to January 2012. Then you draw two time series line charts, as depicted in Figure 6.4.
The line charts indicate that there indeed is a correlation between price and mentions, but the time lag is significant. When price spikes, the number of mentions also goes up. However, it takes a long time for the change in price to be reflected in the mentions. Notice that the price for sorghum started to go up in August 2010 at a fairly high and consistent rate till it reached a peak in May 2011. The price then started to go down after July 2011. The United Nations' considerations of the famine match this trend. It started to see signs of the famine in late-mid 2010 and officially declared famine in mid-2011.10
The number of Twitter mentions about the Somalia famine was few till June 2011. They started to spike thereafter, reaching a peak in September 2011, almost two months after the peak of the price surge and famine conditions. A closer look at the mentions data indicates that the number of mentions did start to go up slightly after September 2010. However, the numbers are very small and do not reach a significant amount until June-July 2011. The correlation coefficient backs up this finding. It is only 0.39, which is not very high.
Based on these findings, you can conclude that although Twitter mentions about the famine did go up during the famine, they are not very useful in regards to providing early warnings about the famine and food insecurity. People tweeted about the famine when they heard about it from newspapers and traditional media. They then responded to the news stories. By then, however, the famine was already well under way. You should conduct the same type of analyses on other famine cases to see if your conclusion about the lack of utility of using Twitter mentions as an early warning holds up.
You can also go back to the data provided by the Guardian to see if you can use other data sources as an indicator of famine conditions. Casually looking at the aid flow data suggests that it may actually correlate closely to the emergence of famine conditions. We leave the analysis to you.
We touched on only a few simple ways you can use simple statistical tools to analyze social media data to understand and forecast security events. We could write entire books about using only one of the ways. However, we hope you got the point so far about how social media data can help you and encourage you to explore other ways by yourself. We now move on to Chapter 7, where we begin discussing how you can use social media in other ways, namely to crowdsource.
1. Directorate of Intelligence (2012) “Intelligence Cycle.” U.S. Federal Bureau of Investigation. Accessed: 11 November 2012. http://www.fbi.gov/about-us/intelligence/intelligence-cycle; (2012) “How Intelligence Works” intelligence.gov. Accessed: 14 October 2012. http://intelligence.gov/about-the-intelligence-community/how-intelligence-works/
2. Scheuer, M. (2007) Through Our Enemies' Eyes. Potomac Books, Dulles.
3. Kelly, S. (2012) “Intelligence Official Offers New Timeline for Benghazi Attack.” CNN. Accessed: 11 November 2012. http://security.blogs.cnn.com/2012/11/01/intelligence-official-offers-new-timeline-for-benghazi-attack/; (2012) “Timeline on Libya and Egypt: Attacks and Response.” Washington Post. Accessed: 11 November 2012. http://www.washingtonpost.com/politics/decision2012/timeline-on-libya-and-egypt-attacks-and-response/2012/09/12/85288638-fd03-11e1-a31e-804fccb658f9_story_1.html
4. AKNN News (2012) “Timeline Photos.” Facebook. Accessed: 11 November 2012. https://www.facebook.com/photo.php?fbid=364777023602502&set=a.246675052079367.59435.246612585418947&type=1&theater; @zaidbenjamin (2012) Twitter. “A picture believed to be for…” Accessed: 11 November 2012. https://twitter.com/zaidbenjamin/status/245808824324333568/photo/1
5. @syriancommando (2012) Twitter “American Embassy worker in…. “Accessed: 11 November 2012. https://twitter.com/syriancommando/status/245753481649078272/photo/1
6. Kiely, E. (2012) “Benghazi Timeline.” Factcheck.org. Accessed: 11 November 2012. http://factcheck.org/2012/10/benghazi-timeline/
7. Check tweets of @Sefu_Africa on 28 September 2012, @kdfifno from 28 September 2012 to 31 October 2012, and @HSMPress on 28 September 2012.
8. Lewis, M. (2011) The Big Short: Inside the Doomsday Machine. W.W. Norton, New York.
9. Hastie, T., Tibshirani, R., and Friedman, J. (2009) The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York.
10. Provost, C., et al. (2012) “Somalia Famine: How the World Responded - Interactive.” Factcheck.org. Accessed: 11 November 2012. http://www.guardian.co.uk/global-development/interactive/2012/feb/22/somalia-famine-aid-media-interactive