Chapter 3

Introduction to Social Media Analytics

Analyzing social media data improves operational planning and execution. It can help you understand social networks and what they are discussing, identify key people and relationships, and understand and forecast events. However, analysis offers benefits only if it is conducted accurately and honestly. This chapter lays the foundation for learning social media analytics by defining what analysis is, what it is not, and how it can help, providing an analysis overview, and introducing the analytical methodologies this book covers.

Defining Analysis

Conducting any type of analysis requires knowing what it is and how it differs from other ways of making sense of the world. However, it has limitations and can easily fall prey to corrupt practices. Only by understanding the powers and limitations of analysis can you start to learn how to analyze social media to solve various problem sets. The subsequent sections may seem academic and dry at times, but unless you are comfortable with doing quantitative analysis, you should go through them. It will help you analyze social media and other types of data, and appreciate and critique the analyses of others.

What Is Analysis

Analysis is the systematic study of relevant data to gain insights about a topic. Analysis involves carrying out a variety of objective methodologies on evidence to find answers to specific problems. The methodologies are a series of steps, derived from past analytical studies and theories that if applied correctly to data will likely lead to accurate answers. This part of the book is dedicated to learning about the different methodologies and the proper ways of applying them.

Analytical methodologies can be applied to virtually any problem set, but we are focused on solving a few security-related sets. Table 3.1 shows the relevant problem sets, their descriptions, and specific examples. Occasionally we also touch on how social media analysis can help solve other problem sets.

Table 3.1 Relevant Problems for Analysis

Problem Set Description Specific Problem Examples
Understand the structure of social networks Track the development of relationships between people online and offline, and understand how people use social media to maintain online and offline relationships. To what extent and why are human traffickers using social media to communicate with each other?
Identify key people and relationships Determine who in social networks wields influence over people in the networks, and has the ability to affect their behavior and relationships. Which online violent extremist recruiters are the most effective at recruiting at-risk youth?
Determine the proliferation of ideas in networks Understand which topics and ideas individuals and groups are discussing and sharing. What type of violent extremist literature, rhetoric, and ideals disseminate through online social media?
Understand and forecast behavior Understand the relationship between behavior, environmental constraints, and discussions and networks on social media. Also, use the understanding and real-time data to determine how likely individuals or groups are to undertake a specific behavior in the future. How are gangs using social media to inflame tensions with rival gangs, and is their use changing?
Understand and forecast events Understand the relationship between events, environmental constraints, and discussions and networks on social media. Also, use the understanding and real-time data to determine the likelihood of specific events occurring in the future. What is the likelihood that there will be a famine?

Cross-Reference
Chapter 5 details using specific analytical methodologies to study social network structures, identify key people and relationships, and determine the ideas and topics of discussions of the networks. Chapter 6 details using methodologies to understand and forecast behavior and events.

The outputs of the analyses are concrete answers to sample problems. They are the Twitter handles of the most influential people in networks, the probability that violence will break out in a certain area, the name of the social media platform that played the most pivotal role in helping rioters organize, and much more.

Limits of Analysis

As you read this chapter, keep in mind that analysis has limits. If used correctly, it offers tremendous insight into the most complicated subjects. However, analysis rarely results in certain predictions, precise rules describing human behavior regardless of environment, or perfect understanding. Analytical methodologies are tools that help you discover, describe, and forecast human behavior in the context of security. Due to the complexity of human behavior and the incomplete nature of the data in question, the relevant analytical methodologies are somewhat flawed and imprecise. The best way to push against the limits of analysis is to adopt analytical tools from other unrelated fields and to be intellectually honest at all times.


Note
We are not teaching all the nuances and complexities involved with conducting analytical studies. Our focus is on preparing the operator to quickly understand complex events and behavior, not to publish research studies in academic journals. We are knowingly sacrificing some academic rigor for practicality, but not enough to tarnish the accuracy and honesty of the analysis.

What Is Not Analysis

Analysis comes naturally to humans as the basis of critical thought, but it is often riddled with cognitive fallacies and pitfalls that lead to incomplete or incorrect answers. Also, a meaningful portion of the security and defense world is staffed with people with no formal training in applying objective analytics to complex problems. What often passes for “analysis” is the subjective opinion of a biased individual with little real-world experience and/or no grounding in the scientific method. The process for such “analysis” is usually simplistic, and consists of the following steps:

1. Based on unexamined biases or your “gut,” determine the point you want to prove or the prediction you are certain will come true.
2. Pick out data that supports your starting point.
3. Discredit conflicting data or, even better, simply ignore it.
4. Write a mostly qualitative piece full of anecdotes and quotes from self-proclaimed “experts” supporting your point.
5. Realize that you need some quantitative material, and insert the result of a survey you conducted. The survey usually involves asking three people what they think about a topic they do not know much about.
6. Focus a lot on formatting your “analysis” and ensure it contains lots of shiny graphics.

This faulty analytical process has resulted in misleading predictions about major foreign policy events and wastes of taxpayer money. The faulty analytical process is especially harmful when it comes to the business of forecasting and discovering the causal effects that produce security events. The United States government and other organizations often fund “analysts” who claim to forecast events but have no idea how to conduct analysis and routinely engage in the analysis don'ts, as described later. If you are native to the foreign policy or defense world, learn to disregard the “analysis” of pundits and so-called experts. Ignore those who proudly call themselves experts or claim they can “predict” something with 100 percent certainty. “Experts” are more likely than randomly generated results to be incorrect about major security and foreign policy issues. Chimpanzees randomly throwing darts at a board plastered with predictions are almost as likely to choose the right answer as “experts.”1 Unless you have a strong interest or background in quantitative methods, statistics, or science, forget what you know about “analysis.”

Analysis Overview

The overall analytical processes and methodologies described herein are similar to what is taught in sophisticated political science and social science courses. One process uses a set of methodologies to create theories that aid understanding of security events and relevant behavior, and the second process uses another set of methodologies to apply those and other theories to other cases. Doing so will help us answer the types of problem sets described in Table 3.1. Specifically, the theories we develop and use will help describe rules of individual and group behavior in relevant contexts, determine the probability that a future event or action will take place, and identify relationships and causal effects between people, objects, and environments.

Unlike the “hard sciences” such as physics and chemistry, few hard-and-fast universal laws or strong theories exist in political and social science. We are primarily studying humans, not large static objects (although a lot of humans are increasingly behaving as large, static objects). The complexity of our subject makes determining laws for individuals and groups very difficult. Compared to the axiomatic laws of physics that describe the behavior of objects, determining the theories that describe humans and their “rational behavior” is incredibly complex and requires accounting for a dizzying array of factors.

Therefore, focus more on the data and allow the data to tell you about how humans are behaving in specific contexts. Then use what the data has said to help understand how humans are behaving in other similar contexts. This process or type of reasoning is known as inductive reasoning. In some cases, use existing proven and tested theories about how humans or other similar beings and objects behave individually and in groups. This process or type of reasoning is known as deductive reasoning. Of the analytical methodologies we describe, some are primarily inductive and others are primarily deductive, but all involve both types of reasoning. In reality, the two processes are often mixed and intertwined. We separate the two to aid understanding.

The following sections detail the overall analysis process as follows. First, we describe the preliminary procedure to formulate the problem you are trying to solve. Second, we explore each of the two processes or types of reasoning and the method of choosing the appropriate one for the problem. Third, we list the analysis dos and don'ts. Fourth, we introduce several methodologies and describe when they are most useful.

Preliminary Procedure

The preliminary procedure helps lay the foundation of the analysis and mitigates the likelihood you will waste time and resources later. The procedure consists of four steps that we describe next and also summarize in Figure 3.1.

The very first step when conducting analysis involves figuring out what you want to analyze and why; in other words, formulating a problem of interest. In most cases involving security operations and analysis, formulating your problem of interest is not difficult. Based on your mission and role, a third party, commander, or boss has probably tasked you with either discovering more about something or someone and/or forecasting what they might do in the future. See Table 3.1 for sample problems.

Figure 3.1 The preliminary procedure for analysis

3.1

The key is ensuring that your problem of interest is narrow and specific enough to generate an analysis that is feasible and results that are accurate. The following guidelines will help you narrow the problem:

  • Examine the behavior of only specific individuals and/or groups. The fewer individuals or groups examined, the better. For example, consider that you are attempting to study the behavior of people in violent protests, and how the violence comes about. Narrow your problem to identify the groups and individuals who participate in the more violent parts of a protest.
  • Limit the size of the relevant physical and/or virtual space. If you know that an event is taking place, most protests must get permits that designate location. Look for the groups operating in that location and focus on them. Additionally, look at online social networks most frequently used or visited in the past and concentrate on a few key individuals (we discuss how to recognize who to look at later).
  • Specify the type of behavior examined. If you are looking specifically for those engaging in violence, look at which are using the angriest or most profane language in the spaces you have designated.
  • Identify the time period in which the behavior and/or event takes place. If you know that the protest takes place over a weekend, but preparations are being made well in advance, start collecting data a few weeks or months out depending on the size of events, and collect for a brief time afterward.

After adequately formulating and focusing the problem, check the Internet to see if others have tackled a similar problem (scholar.google.com is a good resource). Being smart and lazy is the way to go. If others have paved the way intelligently and honestly, feel free to learn from their journey. You might stumble upon a methodology or a ready solution that will save you time and resources. Of course, do not outright steal or plagiarize the work of others. Check with the corresponding author to see how much you can borrow—he or she may be flattered and happy to help.


Note
When adopting the work of others, use the lists in the section titled “Analysis Dos and Don'ts” to evaluate their analysis.

If your Internet search did not turn up the solution, begin assembling relevant data, which includes social media data and even other data such as weather information, stock market indices, and demographic information. Chapter 4 is dedicated to determining what constitutes relevant data and collecting and managing social media data. Due to the inconsistent nature of social media data, the type and amount of available data will vary considerably. Collecting and thus getting a sense of the data available this early in the analytical procedure may drive the choice of analytical process and methodology.

Formal and academic guides to analysis usually suggest that you formulate the hypothesis before you start assembling data. A hypothesis is an educated guess about the likely solution to the problem. This advice should not go unheeded because a hypothesis can help focus your study. If you can formulate a strong hypothesis based on existing studies and theories, you have a better sense of the type of data you need to collect and the types of analytical methodologies you need to apply. Do understand that this is a much more effortful and different process than formulating a hypothesis based on your unexamined biases or “gut.” Later paragraphs explain how and why.

However, first you should understand that in the emerging field of social media analytics and based on the complex and unique problems you are likely trying to solve, the advice is not always practical and in some cases even harmful. A misguided hypothesis can introduce bias into your analysis and lead you to ignore certain data. You may then miss out on a range of solutions you could not even have imagined. Consider that you want to identify the most influential person on Twitter in regards to spreading violent neo-Nazi propaganda. You select to study only one group of neo-Nazis, because the group is routinely mentioned in the news media as an example of a neo-Nazi group. Keep in mind that popularity in the news media is not necessarily equal to influence within the specific population. You then miss out on the influence of other groups and how individuals in those groups may be influencing the group you choose to study.

Generally, you should formulate a hypothesis if three criteria are fulfilled. One, the problem is not that unique and others have successfully attacked it or similar ones. Your hypothesis will then likely be a version of their solution. Two, established and tested theories exist that help describe the types of human behaviors in question. For instance, social networks tend to follow certain rules regardless of the context. Apply the theory to the problem to create a hypothesis that is grounded in experimental evidence. Three, the physical, virtual, and temporal factors constituting the problem's environment are so limited that the menu of likely solutions is small. For example, environmental factors may make it so that an event will certainly take place either tomorrow or next week. Based on your experience, knowledge of similar past events, and cues, you can then reasonably guess when the event may happen.

We caution against using hypothesis in our case for several reasons. One, few people have published rigorous studies assessing the types of security problems and social media data most relevant to you. Two, social media analytics is far from formalized and few proven theories exist that adequately and reliably explain behavior involving social media. Behaviors that describe human interaction and behavior offline may not always translate to interaction and behavior on social media. Three, the likely relevant environmental factors are not limiting enough to limit the menu of likely solutions. In fact, the environmental factors are probably relatively more numerous because the data and problems likely involve the behavior of many individuals that live in many different parts of the world, and use many different social media platforms.

After formulating or bypassing the hypothesis and collecting the data, the preliminary procedure winds to a close. The next step is determining the type of reasoning or process most appropriate for conducting the analysis. In reality, the line between the preliminary procedure and determining the appropriate analytical process is blurry. Well into your own analysis, you may stumble onto evidence or theories that provide you with an adequate solution quickly or further focus your problem. Or, you start out with a hypothesis and then ignore it if the analysis leads you down unimagined roads. Or, you will likely need to collect more data to complete the analysis. Be intellectually honest with yourself and flexible, and the blurry line will cease to be an issue.

The Analytical Processes

Determining which process is most adequate for your needs depends on the problem, other available work on the problem, and the amount of available data. Each process has different requirements, and provides distinct advantages and disadvantages. Understanding the concepts underlying each reasoning process will help you determine the ideal process and methodology. The most realistic option is to combine the different types of reasoning and modify the combination of processes to fit your needs.

Inductive Reasoning

Inductive reasoning or induction is a bottom-up, data-driven approach that involves identifying patterns in data and then codifying the patterns into theories that can also explain other data.

Inductive reasoning is an exploratory process that provides you with insights when you have little idea about possible solutions to your problem. When you fail to generate a hypothesis according to the aforementioned criteria, you need to use the inductive process. Analyzing social media data entails using a lot of induction. As we mentioned before, the study of social media data, and the effect of social media on behavior and vice versa is fairly new. The study of social media and its relationship with security-relevant behavior is even more nascent. You will likely find few established theories that will give you an idea about the solution to your problem. Figure 3.2 outlines the induction process.

Figure 3.2 The inductive process

3.2

First, gather a substantial sample of data or observations concerning the problem of interest. Try to collect a large enough sample size of observations. However, do not become obsessed with collecting data. More data does not always give you more knowledge. Chapter 4 briefly covers how to determine the correct size of an adequate data set using statistical tools. However, do not worry if you cannot meet all statistical requirements. Usually, external factors will limit the size of your data set. The likely issue will be the lack of data, not its overabundance. Also, you may not have the time and resources to collect data and meet all the statistical requirements.

After you have the sample, use a number of statistical tools to apply the methodologies. We describe this process and the relevant tools in detail later. The methodologies and tools will help you pick out patterns in the data that, if strong enough, you can develop into rules that describe past, present, and future human behavior.

Lastly, codify the rules into a theory that elegantly describes a solution to your problem and general behavioral rules that could help solve other similar problems. Test the theory on other data sets and on other problems to make sure your theory is sound and applicable. Even if the theory holds true only for your data set and problem, do not despair. You may have discovered something unique about that data, which is a theory in its own right.

There is nothing inherently bad about induction. However, misguided or overzealous use of induction can lead you down the wrong path. Incorrectly applying inductive methodologies to data will reveal incorrect patterns that lead to the development of incorrect theories and solutions. Also, induction will not work if you do not have an adequate amount of data. If your data consists of only a few samples, many of which may be outliers or unrepresentative data points, then induction will hurt more than help.


Induction Example: Social Media Use During Riots
Consider that the problem is determining how young people use location-based smartphone applications such as BlackBerry Messenger (BBM) to organize violence during riots in major cities. The solution involves detailing the extent to which young people use the application, when they use it, why they use it, and how it affects the violence. The following steps describe the ensuing overall analytical process and the use of induction:
1. Narrow the problem and limit the people, behaviors, event, and time period involved. Rephrase the problem as: How did the British rioters use BBM during the 2011 London riots to organize acts of violence at specific locations?
2. Check around for similar work and solutions. Many reports examine the London riots but few do in-depth analytics measuring the effects of BBM messages, mainly because the messages are very difficult to come by.
3. Collect relevant data including the messages (perhaps by contacting the UK government and the makers of BBM, Research in Motion), news reports detailing the locations of rioters at given moments, openly available social media data such as tweets about the riots, and police reports about damaged stores.
4. Stop and take account of what you know so far. You have a well-defined problem and data, but few hints telling you what the data might suggest. For example, the data could suggest that only certain groups of rioters used BBM to organize the targeting of very specific stores, or that messages on Twitter and not BBM were used to organize the violence.
5. Apply inductive reasoning and corresponding methodologies to see if noticeable patterns correlate BBM use and violence. Perhaps you discover that moments before a store was attacked, there was a spike of BBM messages mentioning the store's location. You may also uncover that the creators of those BBM messages were almost always males aged 20+. BBM messages from teenagers or females never mentioned the locations that suffered violence. Finally, you can discover that the only stores BBM users mentioned and thus targeted were sports stores.
6. Codify your discoveries into a theory. The theory could say that during riots, males in their twenties are very likely to use location-based social media platforms to target violence against stores that carry goods they find appealing, such as sports equipment and clothing.
7. Test your theory on other data sets. Examine the data from riots in the U.S., Latin American cities, and European cities to see if the behavior your theory describes is universal.
8. Use the tests' findings to refine your theory and its scope of application.


Warning
We fabricated this example. We have no idea if males aged 20+ really did use BBM to loot sports stores in London.

Deductive Reasoning

We already introduced deductive reasoning in the induction case study. Step 7 in the case study is an example of deduction.

Deductive reasoning or deduction is a top-down, theory-driven approach that involves applying established theories and well-developed hypotheses on data to test the validity of the theory and hypothesis.

Deductive reasoning is a more formal and focused process that helps you confirm if your educated guess about the likely solution to the problem is valid. When you can generate a hypothesis according to the aforementioned criteria, you should use the deductive process. You may still use induction, but deduction will save you time and effort by focusing your analysis on examining only a few possible solutions. In a few cases, deduction will be applicable. For example, many new studies confirm that theories that govern how the social networks of humans develop offline can in some specific cases explain how social networks develop on social media. Other studies confirm that how ideas spread through social networks is partly independent of the mode of communication. Assembling a comprehensive list of existing theories that apply to social media is difficult. You will have to do your own research, especially because it is heavily dependent on your problem. Figure 3.3 outlines the deductive process.

Figure 3.3 The deductive process

3.3

First, determine which existing theories, studies, and solutions to similar problems are the most relevant to your problem. Evaluate them to ensure they are analytically sound and applicable. They are applicable if they consider populations, behaviors, and environmental factors similar to those in your problem, and they suggest their rules and results are applicable to other problems and data sets. Compare competing theories if applicable, and formulate a hypothesis.

Next, gather data necessary to prove or nullify your hypothesis. In some cases, deduction requires gathering less data than induction. If substantial literature and evidence backs up your hypothesis, you can be reasonably confident that you only need to collect data about items and factors your hypothesis considers. However, if you have the time and resources, collect other data. Later, you can apply the inductive process and corresponding methodologies to the extra data to ensure you did not miss anything.

After you have enough data, use a number of statistical and other tools to apply the methodologies. We describe this process and the relevant tools in detail later. The methodologies and tools will help you evaluate whether the theories that inspired your hypothesis are applicable, and the extent to which your hypothesis is valid.

Lastly, determine the solution to your problem by refining or junking the hypothesis and applicable theories on the basis of the analysis' results. You may discover that only parts of the hypothesis are valid as a solution. Likely, you will find that special conditions related only to the items and factors in your problem are required to validate the hypothesis. You may also discover that the hypothesis is completely wrong and what you thought was the solution is not valid, in which case you will need to either apply other hypotheses or apply the inductive process. You may also refine existing theories and make them more elegant and simple, which you may further validate through other analyses or leave others to do it.

Deduction is most appropriate when you can generate a well-informed and specific hypothesis. However, executing a deductive process with a poorly developed or weak hypothesis will lead to an invalidation of the hypothesis and frustration, and a waste of time and effort. You will then have to start all over again. As a general rule, use deduction only when your hypothesis is strong, or when you cannot collect enough data to do induction.


Deduction Example: Key Terrorism Promoters on Social Networks
Consider that the problem is determining which individuals promoting terrorism on social media networks are the most influential at spreading propaganda and appealing to at-risk youth. The solution involves detailing how terrorist recruiters and sympathizers use social media, and determining the effectiveness of recruiters and their ability to influence the behaviors of others through social media communication. The following steps describe the ensuing overall analytical process and the use of deduction:
1. Narrow the problem and limit the people, behaviors, time, and communication tools involved. Rephrase the problem as: Which known recruiters sympathetic to al-Shabaab were the most influential on Twitter during al-Qaeda's merger with al-Shabaab in terms of spreading positive propaganda about the merger to American youth?
2. Examine and evaluate other similar work and solutions. Several studies describe methods to quantitatively measure the influence of specific people in online social networks, and their ability to propagate messages through online social networks.
3. Collect relevant data including the tweets of known al-Shabaab sympathizers, al-Shabaab affiliates, al-Qaeda affiliates, and American youth that al-Shabaab routinely target for recruitment. Also collect the statements that al-Shabaab and al-Qaeda propagated about their merger in news stories.
4. Stop and take account of what you know so far. You have a well-defined problem, data, and several theories that have successfully solved problems similar to yours.
5. Compare the theories, and formulate a specific hypothesis that states a likely solution to your problem. The hypothesis could say that the known recruiters sympathetic to al-Shabaab most effective at spreading positive propaganda on Twitter are the recruiters who have at least a 5 to 1 ratio of followers to following.
6. Apply deductive reasoning and corresponding methodologies to see if your hypothesis is valid. Perhaps you discover that the hypothesis is indeed valid. You may find that at-risk American youth tend to read, retweet, and discuss (as one methodology describes influence) more messages spread by recruiters with a 5 to 1 ratio than by recruiters with a less than 5 to 1 ratio. You may also discover, perhaps by happenstance or because you applied some inductive processing, that recruiters with a 10 to 1 ratio have far more influence.
7. Refine the theory to say that Twitter accounts with a 10 to 1 ratio of followers to following are far more influential on Twitter than those with a 5 to 1 ratio, who are in turn more influential than those with a smaller ratio.
8. Test your theory on other data sets. Examine the data from the Twitter recruitment and propaganda efforts of Hezbollah, non-Jihadi groups, and even non-terrorists such as celebrities marketing a product, to see if the behavior your theory describes is universal.


Warning
We fabricated the theory about measuring influence on Twitter. Later, we discuss actual methodologies for defining and measuring influence on social networks.

Combining Reasoning Processes

You probably noticed that the final steps of each reasoning process involve applying parts of the other reasoning process. Few social media and security analyses are clearly defined as requiring either induction or deduction. Most require variants of both, sometimes in the middle of the analytical process and often at the end. As you conduct more analyses, you will learn when to utilize a reasoning process. There is no right answer because it is a cyclical process. Given time and resources, use induction constantly to discover insights and deduction to test them. Your analysis will be much stronger for it and you will develop your own analytical tools that you can deploy quickly in the future or use to educate others. Figure 3.4 graphically illustrates this process.

Figure 3.4 Combining induction and deduction

3.4

Analysis Dos and Don'ts

Before learning the different methodologies, it is necessary to adopt good analysis habits. Many of you are conducting analyses to support sensitive and dangerous operations. Adopting good habits and taking care to avoid pitfalls will drastically improve the reliability and integrity of your analysis. Think of abiding by the analysis dos and don'ts as insurance against charges of incompetence or sloppiness. Also, you will help develop the field of social media analytics by creating and sharing well-done analyses.

The following is a list of analysis dos. In an ideal world, you should abide by all of them, but in the real world where resources and time are limited, we encourage you to at least try. The act of trying will put the quality of your analysis head-and-shoulders over much of existing “analysis.”

  • Ensure to the best of your ability that the methodology you undertake is repeatable. Given the same data and a description of your analytical process, another person whom you have not worked with should be able to reach similar conclusions. This rule is the cornerstone of the scientific process and of real analytics. Some of you may be working in classified environments where realistically few others will look at your analysis. Still, as you go through your analysis, imagine that someone else will be looking at your work and trying to repeat what you did.
  • Abide by Occam's razor, which states that a theory explaining a behavior, prediction, or conclusion that is simpler and has fewer assumptions is more precise and accurate than a more complicated theory. In other words—keep it simple, stupid. If a theory that uses social media data to forecast famine requires hundreds of variables and lots of assumptions—such as “This theory applies only if the price of wheat is exactly USD $5.76/bushel in the area”—then it is not very useful for future purposes. It might still tell you something about that particular event, but your goal should always be to uncover the underlying rules governing human behavior concerning security so you can quickly analyze other similar problems.
  • Make any theory you develop falsifiable, which means that a specific data point or observation should be able to disprove your theory. For example, the statement, “It will rain tomorrow” is falsifiable. A specific data point, in this case measuring whether it rains tomorrow, can disprove or prove the conclusion. The point is to keep you from producing solutions to problems that do not hold up to analysis or scrutiny, and thus do not help your mission. Consider the induction case study about social media use in riots. In the case study, we came up with a falsifiable conclusion, that during riots only males aged 20+ use location-based social media to organize violence against stores carrying goods they like. If examination of other riots shows that it was mostly females aged 40+ organizing violence, the theory is falsified. This allows you to then limit your theory to only the 2011 London riots or to redo your analysis. Also, if others can test your conclusions, they can validate them and you can pat yourself on the back for a job well done.

As is often the case, the list of ways to do something wrong is a lot longer than the ways to do something right. The following is a list of analysis don'ts. Do your best to avoid them to ensure your analysis is intellectually honest, stands up to scrutiny, and provides you with accurate insights. Resource and data constraints may compel you to fall prey to an analysis mistake. Still, being aware of your analysis' faults will help you prepare for possible eventualities and negative fallout.

  • Do not discard or ignore data because it does not support your hypothesis or because it goes against any patterns you discover in the overall data. You are primarily analyzing humans—subjects that are complicated and rarely do what you want them to do. Finding data that goes against established theories about human behavior is the norm. Often unusual data or outliers are clues to more expansive solutions, or special exceptions for solutions. Sometimes, the outliers are not outliers but representatives of a much larger data set that will completely invalidate your analysis. Also, if you consciously leave out conflicting data, you leave yourself open to criticism. We discuss how to deal with outliers later. Humans are especially susceptible to discarding data that clashes with our beliefs due to cognitive bias. We naturally tune out information that challenges the theories we develop about how the world works. Take extra care to ensure you are not ignoring conflicting data.
  • Do not confuse anecdotes with indicators of real patterns, also known as the exception fallacy. You cannot generalize based on a few qualitative examples or the thoughts of a few people because your generalizations could be completely incorrect. For example, you may be investigating if drug dealers use certain code words on Twitter to indicate different kinds of drug dealing activity. You may then look at the tweets of a few known drug dealers and recognize they use the word “wedding” to indicate a drug drop. You cannot then generalize that all drug dealers use the word “wedding” to also indicate drug drops. If you do, you will likely miss out on other code words. Also, some drug dealers might actually be going to, and so discussing, a wedding. To conduct induction properly, you need to collect a much larger set of examples to ensure you are not only citing outliers. The exception fallacy is the basis for racial stereotyping. Simply because you see a few people of a race commit crime, does not mean that everyone of that race commits crimes.
  • Do not overstate the strengths of your forecasts. Predicting complex events such as humanitarian crises and drug trafficking activity with certainty is extremely difficult. Be up front or at least aware that there is a chance that your forecasts may not come true. Generally, the fewer the number of things, people, and conditions you are dealing with, the greater the accuracy of your forecast.

We provide only the most relevant dos and don'ts. Several more are available and we encourage you to read the book we cite in the notes and other books and articles on analysis so you can continue to improve your analysis.

Analysis Methodologies

Methodologies are the series of steps that will allow you to apply the two reasoning processes on the data. In this section, we introduce four methodologies that are the most pertinent for social media analysis and ones we use the most. Chapters 5 and 6 describe them in greater detail and illustrate how to use them to solve security-related problem sets. To solve complex problems, you will need to combine the methodologies. Keep in mind that social media analysis is in no way limited to only these four methodologies. Therefore, we also briefly touch on other methodologies and recommend you adopt various other methodologies as you see fit. Be creative and flexible.


Warning
Again, our focus is on helping you conduct quick and accurate analysis to support operations, not to publish in academic journals. Therefore, in the name of clarity and ease of understanding we have taken liberties with definitions and the use of various concepts. If you are a professor of statistics or quantitative methods, please do not send us hate mail or hate tweets.

So far, the information we have covered, such as the preliminary procedure and dos and don'ts, applies regardless of the type of methodologies you employ. Information from now on will apply only to specific methodologies with the exception of a brief overview of variables, a key component of all methodologies. If you are familiar with conducting analyses you can skip the Variables section.

Variables

Variables are symbols that represent a variety of quantitative or qualitative values. Anything that varies can be a variable. For example, the various types of illegal drugs can be variables, or the age of the top Facebook users in Africa. Analysis involves manipulating and comparing variables at different times in different situations to tease out patterns, causal effects, and insights. Understanding variables and how they differ will help you formulate meaningful analyses. Every robust analysis has the following three types of variables:

  • Dependent—The variable that you will measure and that other factors affect.
  • Independent—The variable that varies during the analysis and directly or indirectly affects or is correlated with the dependent variable.
  • Controlled—The variable that you hold constant during the analysis so you can isolate the effects of particular independent variables and eliminate the possibility that other variables are affecting the dependent variable.

If the variable types are new or confusing, then consider the following example.


Determining Variables Example: Violent Flash Mobs
Say you want a method to tell if a violent flash mob is likely to happen at a certain location. A flash mob is when people assemble suddenly in a place at a specific time. Flash mob organizers and participants usually use social media to spread word about the flash mob and help organize it. Imagine that prior analyses indicate that a violent flash mob is likely to happen at a specific location if a significant number of tweets mention the name of the location within 20 minutes, regardless of who is tweeting. Your analysis then needs to determine the value of the significant number. Restate the problem as “What number of tweets must mention the name of the location within the 20-minute time period?” Your hypothesis could be that more than 100,000 tweets need to mention the name of the location within the 20-minute time period. Notice that to get this far you used deduction—looking at prior analyses and theories to determine your hypothesis.
For now, do not worry about the methodology you will use to verify the hypothesis. Focus on determining the variables you need to conduct the analysis. The dependent variable in this case is the appearance of a violent flash mob. The independent variable is the number of tweets. The number of tweets affects the appearance of the violent flash mob.
Unfortunately, security-related analyses are rarely this simplistic. The variables are usually a number of things. In this case, the dependent variable can be anything that indicates the appearance of the violent flash mob. Possible dependent variables include news reports about mobs, police reports about mobs, or a certain number of people committing violence in a specific area of physical space.
To make things more complicated, perhaps prior analyses are not certain about the time period within which the tweets need to appear. Now you have two independent variables. One is the number of tweets and the other is the time period. Perhaps as an indicator of a violent flash mob, the number of tweets does not matter as much as tweets with mentions of a location that appear within three minutes of each other. The analysis then needs to determine whether one or both independent variables affect the dependent variable and to what extent. In one part of the analysis, one of the independent variables will become a controlled variable. In another part, the other independent variable will become the controlled variable and vice versa. In other words, hold one independent variable constant while measuring the effect of the other independent variable. In the third part of the analysis, measure the combined effect of or relationship between the independent variables on one or all of the dependent variables.

Methodologies in this Book

The following sections briefly introduce the methodologies you will learn to use to analyze social media and related data.

Social Network Analysis

Social network analysis (SNA), a type of network analysis, is the study of the social structure known as social networks comprised of individuals and their relationships. A social network can consist of the relationship between two people or of the relationships between everyone on Earth. See Figure 3.5 for an example of a social network. Because social media is all about creating and sustaining social networks and relationships between people, understanding SNA is essential to understanding social media.

Figure 3.5 Social network example

3.5

SNA emerged from the interaction of disparate fields including psychology, graph theory, and statistics, and forms part of the emerging field of network science. Several network science theories, some of which contradict each other, describe how social networks behave. They sacrifice the importance of a specific individual and the attributes that describe the individual to generalize about how typical human relationships function. The theories exist as algorithms that conduct specialized mathematics on social network data and output specific answers. Therefore, you can consider SNA to be a somewhat deductive process, yet one that also draws from induction.

SNA enables you to map, measure, and describe almost anything about a social network and its components. SNA can provide information about individuals, a few relationships, or large-scale networks. You can use SNA to understand the ideas of interest of social networks, how individuals gain influence in social networks, how individuals form relationships with others, how the relationships evolve over time, and how the relationships affect the behavior of individuals in the social network. You can also use SNA to determine which individuals are the most influential, which individuals are the most vocal, which relationships are the most influential, and which relationships are necessary to sustain the structure of the network. SNA also enables you to measure the relationships between different types of social networks. For example, with SNA you can measure whether and how an individual's social network on Badoo influences his or her social network in the physical world (friends, family, and so on) and vice versa. SNA is very useful for understanding how violent extremists use social media to develop relationships with at-risk populations, forecasting how the social networks of human traffickers and narcotics smugglers evolve over time, identifying the key individuals and relationships in drug trafficking networks, and much more.

The rise of social media and especially social networking platforms has produced gargantuan amounts of data about social networks. Now, it is easy to find sample social network data on the Internet, which researchers usually derive from Facebook, to test network science theories. The explosion of readily available social network data and sophisticated SNA software preloaded with algorithms is producing the golden age of network science. Many theories underlying SNA are now being put to the test like never before, and several companies are creating automated tools to analyze social networks exhibited on social media platforms.

Chapter 5 delineates how to conduct SNA. However, the following describes the overall SNA process:

1. Formulate the problem.
2. Collect related data about social networks, specifically information about the structure, strength, and type of relationships between individuals.
3. Input the data into SNA software to map or draw the social network.
4. Use the SNA software to run specific algorithms on the data.
5. Compare and contrast the algorithm outputs to answer the problem.

Language and Sentiment Analysis

Language and sentiment analysis (LSA) is the study of patterns in linguistic content, such as Facebook status updates and text messages. It is the application of theories and tools from fields including text analytics, computational linguistics, statistics, and natural language processing (NLP). LSA is actually an umbrella term we use to address a variety of language processing tools and analyses. The various language tools and analyses help identify what individuals and groups are saying, why they are saying it, who is saying it, what they mean exactly by what they are saying, and how they feel about what they are talking about. As with SNA, the explosion of social media data consisting of billions of tweets, status updates, private messages, and texts from people globally has significantly bolstered the use of LSA. Because social media is all about communication, and much of the communication is text-based, understanding LSA is essential to understanding the meaning of communication. LSA is very useful for geolocating and understanding texts from victims of humanitarian crises, determining the identity and likely location of a suspicious blog's author, forecasting the coordination of criminal activity such as planning of violent riots, and much more.

LSA involves both inductive and deductive processes. Many existing LSA theories and algorithms process language and output answers with varying reliability. While most LSA tools focus on processing the English language, few process non-Western and less mainstream languages such as Swahili. Existing LSA tools also have difficulty processing unstructured data with lots of slang, idioms, and sarcasm such as the tweets of teenagers, although advances are being made rapidly. You will use both existing LSA tools and algorithms and create your own through deduction and induction.

Chapter 6 delineates how to conduct LSA. However, the following describes the overall LSA process:

1. Formulate the problem.
2. Collect related data consisting of massive amounts of textual content.
3. Evaluate existing LSA tools for relevance. If relevant, input data into the LSA tool (usually available as commercial or open source software).
4. If no relevant LSA tool is found, create your own LSA tool and algorithm either by adapting existing LSA tools, creating entirely new LSA theories through deduction, and/or by conducting correlative analyses (described later) on the textual data.
5. Compare and contrast the LSA tools' outputs to answer the problem.

Correlation and Regression Analysis

Correlation and regression analysis (CRA) is the study of correlations and/or relationships between anything, or more accurately, between a dependent variable and one or more independent variables. CRA uncovers correlations between seemingly separate things and enables you to determine whether and to what extent the two things either directly or indirectly affect each other, or whether a third thing affects both simultaneously or concurrently. CRA is usually the first type of analysis one learns in a basic statistics course and is used in virtually every field.

To be more precise, correlation and regression analysis are different types of analyses. You use correlation analysis when you simply want to find associative relationships between distinct things, regardless of if they affect each other. For example, there is a correlation between the sale of winter coats and gloves. When there is a spike in sales of winter coats, there tends to be a spike in the sale of gloves. Note that correlation does not imply causation. It is not that the purchase of a winter coat causes or affects a buyer to purchase gloves or vice versa. It is that a third thing, cold weather, causes the buyer to purchase both a winter coat and gloves.

You use regression analysis when you want to uncover predictive relationships between two things, regardless if they cause each other. By a predictive relationship, we mean that the change in value of one variable can help you forecast how the other variable will change. For example, you can do regression analyses on data about the sales of winter coats and gloves. You can then create an equation that tells you that a 15% increase in the sale of winter gloves tends to be associated with a 20% increase in coats. Regression analysis also tells you about the accuracy of your predictive equation.


Warning
Use common sense when using CRA, and understand the direction of causal relationships. You have to use experience and knowledge to figure out that it is cold weather that causes humans to purchase more winter coats and gloves, and not the sale of winter coats that causes the sale of gloves.

CRA is an essential tool for understanding social media data and insights hidden therein. Expect to regularly use CRA to support other methodologies and to uncover correlations and causal relationships between various things. In the case of CRA, think of social media primarily as the vehicle for delivering data that enables you to uncover the intuitive and counterintuitive correlations and relationships between disparate things. CRA is very useful for establishing correlation relationships between microeconomic indicators and weather data to forecast famines, establishing causal relationships between specific messages on social media and violent acts on the ground, determining the existence of causal relationships between drug smuggling activity in rural areas and the momentary sentiment of rural populations, and much more.

Chapter 6 delineates how to conduct CRA and further discusses how correlation analysis is different from regression analysis. However, the following describes the overall CRA process:

1. Formulate the problem.
2. Collect related data. Generally, the more data the better.
3. Input data into an appropriate CRA tool (tools are usually available as commercial or open source).
4. Analyze the CRA tool output and establish whether a correlation exists.
5. Conduct regression analysis using CRA tool to determine whether the variables share a predictive relationship.
6. In the case of associative and predictive relationships, use a priori knowledge to determine the existence and direction of causality.

Volumetric Analysis

Volumetric analysis (VA) is a type of CRA that focuses only on discovering associative and predictive relationships between events or behavior and changes in the volume of traffic or activity on social media platforms. Data traffic and activity on social media platforms, hereafter known as data volume, includes the number of tweets in a day that mention a specific word to the number of texts from a specific location. Information about data volume is often easier to collect than content data. Social media platforms like to make such data available to showcase their popularity and success.

We separate out VA from CRA because of the powerful insights that analyzing data traffic and activity can reveal about security-related issues. Examining changes in data volume and focusing on unique spikes or drops in data volume often reveal the presence of unique events. In the security world, unique events are the most important events because civil wars, terrorist bombings, and natural disasters are not everyday occurrences. VA is very useful for identifying sharp drops in social media activity from a specific location to uncover security threats in the area, correlating spikes in communication between two countries with a spike in drug smuggling activity in the countries, and much more.

Chapter 6 delineates how to conduct VA. However, the following describes the overall VA process:

1. Formulate the problem.
2. Collect related data including data volume over numerous periods of time, including periods that do and do not include the scope of time relevant to the problem.
3. Use data volume collected from all periods of time to establish a baseline data volume that describes “normal” data volume. Factor in normal growth or decline in data volume, as needed.
4. Use VA tools to compare data volume from the period of time of interest with baseline data volume. Identify atypical changes in data volume.
5. If atypical changes exist, use a priori knowledge to correlate changes in data volume with event or behavior.
6. Use CRA tools to determine the existence of an associative or predictive relationship between changes in data volume and event or behavior.

Methodologies Not in this Book

We encourage you to frequently discover and use other types of analyses on social media data. A great way to discover interesting and powerful analytical methodologies is to explore other fields such as physics, biology, and ecology. In fact, we started using VA after reading about how ecologists measured changes in population volumes to uncover unique ecological events. VA is actually a term we borrowed from chemistry. Researchers studying the presence of alien life are using VA to determine changes in the composition of chemicals to identify the presence of life.

To gain a better understanding of SNA and to come across more powerful SNA algorithms, read about the application of network science on other fields. We stumbled onto SNA after reading about neural networks and how network science helps explain the influence of relationships on neurons and vice versa. Likewise, we started to appreciate the power of CRA by reading about complexity science, an emerging field that involves examining the behavior and structure of complex systems. Geoffrey West, a prominent physicist and complexity scientist, has used variations of CRA to uncover startling and interesting rules between population size, gross domestic product (GDP), income, patents filed, and crime rates for any city, be it in the U.S., China, Europe, or elsewhere.2 Simply put, West found that if you double the size of a city, you get a 15 percent increase in any of the aforementioned areas, as well as many others. West's findings highlight the underlying strength and influence of citywide social networks.

If your time to study other fields is limited, fear not. Chapter 12 briefly explores exciting but more complex analytical methodologies and tools that you can apply to social media data. They include:

  • Agent modeling—Create virtual avatars of individuals and social networks to model their behavior in various situations and environments, and responses to changes in their environments. Social media data provides information about individuals and social networks to program the avatars.
  • Geo-spatial network analysis—Uncover how the structure of towns and cities, and the distribution of resources such as waterholes, can help forecast the likelihood of criminal or violent activity in specific areas. Social media data and crowdsourcing platforms provide information about the location and distributions of relevant items.
  • Cluster analysis—Group millions of social media users by various attributes to focus SNA and targeting efforts.

But before you start exploring more advanced methodologies, you need to learn the more basic ones. Chapter 5 continues your education in analysis by teaching you how to conduct social network analysis.

Summary

  • Analysis is a systematic process that reveals solutions to many, but not all, concrete problems.
  • This book focuses on helping you do analysis quickly to support operations, not publish in academic journals.
  • Analyzing precisely and honestly is essential to getting correct solutions. Follow the dos and don'ts.
  • All analysis starts with the preliminary procedure, which involves formulating a focused and narrow problem, exploring existing solutions and theories, collecting related data, and generating a hypothesis only if existing solutions and theories back up your hypothesis.
  • After the preliminary procedure, select a methodology based on the reasoning you want to employ.
  • Inductive reasoning is data-driven and involves identifying patterns in data and then codifying the patterns into theories that can also explain other data.
  • Deductive reasoning is theory-driven and involves applying established theories and hypotheses on data to test the validity of the theory and hypothesis.
  • Most analyses will involve a mix of the two types of reasoning processes.
  • This book focuses on applying four types of analytical methodologies:
    • Social network analysis involves studying social networks online and offline, and the relationships and individuals that make up the social networks.
    • Language and sentiment analysis involves identifying patterns in linguistic content on social media platforms that reveal insight about events and behavior.
    • Correlation and regression analysis involves determining correlative and direct and indirect causal relationships between various factors and things.
    • Volumetric analysis involves determining correlative and direct and indirect causal relationships between behavior and events, and data traffic and activity on social media platforms.

 

 

Notes

1. Tetlock, P. (2006) Expert Political Judgment: How Good Is It? How Can We Know? Princeton University Press, Princeton, NJ.

2. West, G. (2011) “The Surprising Math of Cities and Corporations.” TED. Accessed: 24 May 2012. http://www.ted.com/talks/geoffrey_west_the_surprising_math_of_cities_and_corporations.html

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset