Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 4

Collecting and Managing Social Media Data

Analyzing social media first requires collecting and managing enormous amounts of social media and other types of data. The widespread use of social media globally has produced petabytes of data, most of which is not relevant to your purpose and analysis. The most important part of conducting social media analysis is adequately and intelligently finding and manipulating relevant data without becoming overwhelmed. To this end, this chapter explains what constitutes social media and related data; details the process to determine your data needs; and describes how to collect the data, filter the data, and store and manage the data. The chapter also discusses the benefits and drawbacks of building your own data management system and buying a commercially available one. We do not expect you to have the technical acumen to actually build the data collection apparatus. However, knowing the technical concepts behind data collection will greatly inform your analysis and expectations, and help you select and use the appropriate data collection technologies.

Understanding Social Media Data

Social media data is all the user-generated content and corresponding metadata on social media platforms.

User-generated content includes the pictures on Facebook, the diaries on Qzone, the videos on YouTube, the tweets on Twitter, and much more. Most of it is unstructured, which means that for the most part it does not follow predefined rules. A 15-year-old in Malaysia can tweet something in Malay that is full of local slang and does not follow grammatical rules, and a 60-year-old in the United Kingdom can tweet the Queen's English using perfect grammar. Apart from both being text and containing less than 140 characters, each tweet will probably look completely different. The unstructured data concept also takes into account that content on different platforms can have very little in common. A picture on Orkut is very different than an invite on LinkedIn.

Metadata is information about the user and the content they post or, in other words, data about the data.

Social media data includes not only the content you see, but also metadata that includes the location where the picture on Facebook was taken, the date and time when the diary was uploaded on Qzone, the countries where people watched the video on YouTube, the number of people who retweeted a tweet on Twitter, the number of friends a person has on hi5, and much more.

The amount of user-generated content and metadata is enormous and growing every day at a rapid rate. Consider that two years ago, Twitter was generating 8 terabytes of data per day.¹ This amount grows significantly when you take into account current-day use, all the other data on social media platforms, and then the metadata. Finally, the amount grows even larger when you include any relevant data you might need to complete your analysis. Such data can include information about weather and climate, geospatial maps, terrorist attacks, maritime piracy incidents, demographic statistics, and much more. As you can infer, maritime piracy data looks very different and is thus structured very differently than data on hi5. “Big data” is the buzzword people use to describe all this complex and unstructured data that standard database solutions cannot solve.

Using and making sense of big data is necessary for analysis and a difficult task, but not an impossible one. In fact, dealing with big data will soon cease to be a problem and people will go back to calling it simply “data.” Numerous organizations are making rapid advances in data-related technologies, many of which we will explore. You should start to get a sense of how to deal with big data now before these technologies become ubiquitous and you are left behind. Overall, dealing with big data requires executing the following process:

1. Determine collection needs—Figure out what data you need and how much of it, who has it, and if it is possible to even get it.

2. Collect the data—Download publically and commercially available data using a variety of methods.

3. Filter the data—Eliminate noise from the data you collected, and validate it to ensure its usefulness and legitimacy.

4. Store and manage the data—Keep the filtered data in a secure and flexible database that you and others can add to and access as needed.

5. Analyze the data—Either manually analyze the data or create automated algorithms and analytical tools that use the analytical methodologies we explore in Chapters 5 and 6 to output solutions to your problems.

The following sections further describe the steps. Figure 4.1 illustrates the overall process. Many commercially available social media data and analysis software can complete steps 2 through 5 without you lifting a finger. We go over evaluating and choosing software later. But first we want to go through the steps as if you were going to do them yourself or with the help of technical staff, so you understand the overarching technical concepts behind data collection and management. Keep in mind that this book is not for programmers and technical developers who are coding and building the systems. We expect you to be the person who will work or manage the developers, or make the decision to purchase the relevant commercially available software. Understanding what goes into creating such systems will improve your ability to evaluate your developers and software purchases.

Figure 4.1 Overall data process

Cross-Reference

If you would like to try coding some quick and relatively painless ways to download data autonomously, watch for points where the following sections refer you to the website and further instructions.

Determining Collection Needs

Before you can deal with big data, you need to establish what constitutes big data for you. In some cases, a small amount of data will do the trick, but in others you will need a lot. In some cases, one type of data will be enough, but in others you will need numerous types. It depends on the problem you are trying to solve, the analytical methodology you want to use, the time and resources you have, the accuracy you require or desire, and the environmental constraints you face including access to data. Overall, to determine your data collection needs, you need to answer the following questions:

What data will solve the problem?
How much data is enough?
Who has the data?
Will they give the data?

This section explains how to answer each of these questions. You should not expect to answer the questions sequentially. Answer the questions in the order in which they are easy for you to answer. The answer to one may significantly influence the answer to another one, and compel you to come up with another answer to the question. For example, if you know about the types of data available in or about a specific area of interest, that will greatly inform and limit your answer to the question, “What data will solve the problem?” An answer to one question may even completely disrupt your approach to the problem and force you to come up with another approach. Consider the following example.

Determining Data Collection Needs Example: Conflict in Rural Colombia

Imagine you want to use what people are saying on social media to determine if there is conflict occurring in specific areas of rural Colombia. You are aware of several non-governmental organizations (NGOs) running social media platforms for different reasons in the area, and determine one way to answer the question is through volumetric analysis. Based on past studies, you conclude that sudden changes in the price of crops in one area compared to the price of crops in nearby areas can tell you if something is happening in that one area. You can then infer that that something is probably related to security and conflict, because the area is known for having lots of conflict and because nearby areas have the same weather, proclivity to natural disasters, and health.

You first ask the question: What data will solve the problem?

You determine that you need data about the price of crops in certain areas over a long period of time.

You then ask: How much data is enough?

You then ask: Who has the data?

You identify an existing social media platform, run by a local NGO called mPrice that incentivizes Colombian farmers to text the price at which they sell their crops. In response, the Colombian farmers receive the average crop prices in their area, so they know what the market looks like and if their price is too high or too low. You look at mPrice's platform more closely and find out that they only started two months ago. You then need to go back and refine the answer to the first question. Again you ask, How much data is enough?

You can only get data from the past two months till today. You decide to make up for lack of data over time by collecting more daily reports. Luckily, mPrice collects 1000 daily reports of a specific crop's price from five areas. You then ask, Will they give the data?

You politely ask mPrice, but they refuse. They say they do not like to work with your government, and besides, they make some data available publically. You look at their public data, but it only gives you the average weekly crop price in an area. You decide that is not nearly enough data to conduct volumetric analysis. You have no choice but to start over.

You then decide to determine if security problems exist in specific areas through analysis of other types of data. You identify an NGO called mDisease that queries rural Colombians about health problems. The NGO then uses the information to provide diagnostic services to its users. You reason that by analyzing changes in health problems, you can identify whether the rural Colombians are suffering through extreme conditions such as security problems in their areas. You have already answered the first question—you know you need information about the health status of the rural Colombians. Again, you go through the process of asking the remaining three questions.

What Data Will Solve the Problem?

Expect to require several different types of data to solve one problem. Security events and related behavior are complex, and numerous factors influence them. The data that solves the problem has to take into account at least some of those factors. Use your creativity and knowledge of the problem topic to determine which data will account for the factors or variables in your problem. Additionally, thinking about the data you need reduces the chance that you unwittingly become overwhelmed with irrelevant data.

For example, consider forecasting if peaceful protests in a Middle Eastern country will erupt into violence and revolution. At first glance, the following factors could play a role in this forecasting behavior problem: the authority and power of the government, the demographics of the protestors, the entire population's view of the government, the protestors' view of the government, the intensity of the protests, government response to protest, and many more. Fortunately, numerous studies exist that posit theories about crowd behavior and the birth of revolutions. They allow you to then reduce the number of factors that affect your problem and solution. Imagine that the theories reduce the number of factors to the demographics of the protests and the protestors' perception of the government response to the protest. Then you only need to collect data on those two factors or variables. Data that can provide information about those factors could include census data the government routinely collects and mentions on Twitter about the government response arising from the location. You now know exactly what type of data to collect. If the Twitter data is not available, you can look at videos protestors post on YouTube about the protests. If they are posting many videos showing government violence against the protests, you can infer their perception of the government response.

Over time, you will develop an intuition for quickly identifying the types of data that will help you solve your problem. Prior knowledge about the people, the location, and the issues involved will go a long way toward developing your intuition. We have developed our own intuition about what types of data are most relevant to solving security-related problem sets. See Table 4.1 for a list of data types you will most need to solve the five problem sets.

Table 4.1 Example Data Types for Problem Sets

Problem Set	Data Type
Understand the structure of social networks	Lists of user accounts, the user's followers or friends, the user's groups
Identify key people and relationships	The user's followers or friends, the user's groups, how many times a user messages someone else, who reads the user's messages
Determine the spread of ideas	Retweets, reposts, “Likes”, number of times content has been favorited or shared
Understand and forecast behavior	Past behavior of people involved, demographic statistics, geospatial constraints, political and sociological factors, real-time social media feeds of people involved
Understand and forecast events	Past event information, behavior of people involved, demographic statistics, geospatial constraints, political and sociological factors, real-time social media feeds of people involved

How Much Data Is Enough?

Determining the size and amount of adequate data is far from an exact science, especially when it comes to social media data. Frankly, statisticians are not sure what to make of social media data. Ideally, in a traditional statistical analysis, you want to assemble a random assortment or sample of data. A random sample ensures you do not have biased data, which can lead to misleading conclusions. However, social media data is not random, and hence is inherently biased. Despite its popularity, only certain types of people use social media, and those people may have distinct behavioral patterns. Acknowledge this bias and take it into account when considering the strength of conclusions or theories you infer from your analyses.

In many cases, social media data is also lacking or limited. The realities on the ground limit the amount of data you can collect and what you consider enough data. Billions of people use social media in urban areas, but only a few hundred may use social media in specific rural areas. In other cases, such as for specific events, only a certain amount of social media may be available. Perhaps, only a few hundred people tweeted about an event, and that will limit what you consider enough data. Also, you may not realize that you are missing out on enormous amounts of social media data. During the Arab Spring, the protestors used secret hashtags made up of nonsense words to share logistical information among protestors. If you did not know about the hashtags, you would have missed out on a lot of critical data you need to understand how protestors organize protests using social media.

Despite the difficulties of dealing with social media data, you can use a few guidelines to determine what constitutes enough data. The guidelines also depend on the type of analysis. Generally, collect as much data as you possibly can, given time and resources. Contemporary computers and statistical software can handle millions of data points. The more data you have, and the more diverse it is (data about tangential topics or from many sources), the less likely your analyses will suffer from bias. Collecting all the data you can is much more important for social network analysis than the other statistical analyses in this book. If you want to analyze the social network of human traffickers in Eastern Europe who use cell phones, you need to know who all those people are and how they are connected with each other. If you do not know about certain people and relationships, you may miss out on key components of the social network, leading to a junk analysis.

If collecting as much as possible is out of the question, you should collect at least a large enough sample of the population you want to analyze. Determining the exact percentage that constitutes a large enough sample is difficult and depends on the population size. However, a good rule of thumb to maintain confidence in your analysis and reduce errors is to collect at least a high single-digit percent of the population you want to analyze, somewhere around 10 percent. The concept behind increasing confidence and decreasing error levels is simple. For example, consider that you want to determine what percentage of the global population is male and female. To do your analysis, you will ask people what gender they are and keep track of what they say. You then ask the next 10 people you see for their gender—eight of them say male, and two say female. Based on your observations, you conclude that 80 percent of the world's population is male, and 20 percent is female. Of course, that conclusion is wrong. About 50 percent of the population is male, and the other 50 percent is female. The world's population is 6 billion people, but you only asked 10 people, or 0.0000000017 percent of the total population size. You should ask at least 600 million people, or 10 percent of the total population size. That is still a lot of data to collect. Fortunately, you do not need to always collect 10 percent of the total population size if the population size is extremely large. Statistical theories and tests show that collecting 500–1000 data points per population is usually enough. If you ask only 500–1000 people about their gender, you can be fairly confident in your analysis. Technically, with 500–1000 data points, you will get a confidence level of 95 percent, which says that there is a margin of error of about +/– 5 percent, or your numbers are off by 5 percent. In other words, there may be 55 percent males and 45 percent females, or 47 percent males and 53 percent females. A margin of error of only +/– 5 percent is usually very good. Cases where you have much less data points, or where the total population size is tiny, such as 10, will result in high error rates. You cannot be confident in cases where the population size is so low.

Warning

In reality, determining sample sizes and confidence levels is more complicated than what we posit, but we assume you are not doing academic studies. Also, we would fall asleep while writing out all the caveats.

Analyzing social media data is of course more difficult than the simple polling study about genders. The following short example describes the process and challenges.

Social Media Data Analysis Example: Adequate Population Size for Analysis

Consider that you want to conduct volumetric analysis and correlate suicide bomber activity in a certain country last year with online activity. You may have a theory that right before a suicide bomb, there is a spike in activity on extremist forums, probably because sympathizers post supportive messages. You do not care about who posted what. You only want to see if a few days before a suicide bomb, there was a spike in forum postings. You have a comprehensive list of all the suicide bombs that occurred in the country last year, and detailed data about some of them (time, location, motive). You also have data from extremist forums including forum data logs (posting times), user logs (who, from where, what time, how long), and other usage statistics from last year.

You have data and now you need to determine if you have enough, and how much of it you need to do your analysis. The less you need to use to get an accurate result, the better, because that will reduce the time and resources you need to spend on acquiring more.

To see if you have enough data, you need to identify the total population size of the various variables you are considering. According to your findings, there were a total of 100 suicide bombings and 100,000 forum posts.

Now, you check to see how much detailed data you have about each population set. In this case, detail is the date and time of each bombing and posting. You find that you have detailed data for 50 bombings and 80,000 postings.

Because the total population size of suicide bombings is relatively low, the more data you have about the entire population size, the better. In this case, you have 50 percent of the total population size, resulting in 50 data points. The total number of data points is low, which reduces your confidence level. However, you have a substantial portion of the total population size, which increases the confidence level.

You have enough data about forum postings. You have 80 percent of the total population size, which is quite large. Overall, you can be reasonably confident in your analysis and should continue doing the correlation.

To conclude, for most statistical analyses, collect at least 10 percent of the total population size. If that is not possible, collect at least 500–1000 data points from each population you want to analyze. For social network analysis, collect as much as you can.

Cross-Reference

The website has calculators that will help you calculate the confidence and error levels in your analysis, based on how much data you use.

Who Has the Data?

Identifying who has the data is usually the easiest step of the data collection process. In most cases, the answer to the question is obvious. If you are after specific social media data, the owner of the social media platform will have the data. Facebook owns and has all the data users create and share on Facebook. Twitter owns and has all the data users create and share on Twitter. Some companies run platforms that do not share the company name. For instance, Google runs Orkut. Click the About Us or Contact page on social media platform sites and, in most cases, you will find out the identity of the ultimate owner of the platform and the data. In some cases, informal groups, individuals, non-profit organizations, or governments will run social media platforms. You will likely have to contact the individuals or members of the informal groups, non-profits, or of the governments. In a few cases, the owner of the social media platform will provide the data and requests for the data to third-party organizations. Twitter allows a few companies such as Gnip to sell data users create and share on Twitter. Often, the data you request will exist on several different social media platforms, and numerous people or organizations will own the data. Many people in the West link together their Twitter, LinkedIn, and Facebook accounts. A posting on one platform is then available on another platform. Either platform can then be the source of the data.

Collecting related, non-social media data is somewhat more difficult. If you are working in this field, you likely already have a list of data sources. See Table 4.2 for a list of non-social media data sources that we find useful when solving security analytical challenges. Part III of the book covers collecting non-social media data using crowdsourcing.

Table 4.2 Useful Non-Social Media Data Sources

Source Name	Data Type	URL
U.S. Census	Census and demographics	www.census.gov
World Bank	Various country statistics and indicators	data.worldbank.org
World Bank Worldwide Governance Indicators	Various country governance indicators	info.worldbank.org/governance/wgi/sc_country.asp
Reddit	Miscellaneous	www.reddit.com/r/opendata
University of Maryland START GTD	Terrorist attacks	www.start.umd.edu/start/data_collections/
Harvard Institute for Quantitative Social Science	Miscellaneous social science related	dvn.iq.harvard.edu/dvn/
MIT DSpace	Miscellaneous science and social science related	dspace.mit.edu
RAND Data	Miscellaneous political and social science related	www.rand.org
National Counterterrorism Center	Terrorist incidents	www.nctc.gov/site/other/wits.html
Open Data Initiative	Miscellaneous	www.opendatainitiative.org
Jane's IHS	Miscellaneous security related	www.janes.com/products/janes/index .aspx
Google Correlate	Miscellaneous	www.google.com/trends/correlate
Google Public Data	Miscellaneous social science related	www.google.com/publicdata/directory
New America Foundation	Drone strikes in Pakistan	counterterrorism.newamerica.net/drones
UN Open Data	Various country statistics	data.un.org

When evaluating non-social media data sources, keep the following questions in mind. These questions also can apply to social media data sources.

Does the data source provide enough data? If you are trying to find a causal relationship between insurgent attacks in Baghdad and Iraqi conversations on social media, make sure that the source for the insurgent attacks lists an appropriate amount of attacks. A list consisting of only two attacks will not get you anywhere.
Can other sources appropriately validate or make up for missing data? Often, several organizations keep track of and make data available about the same topic. In our case, combining and cross-referencing data sources is fine, as long as the data looks similar, and the organizations collect the data using similar methodologies. If one organization uses on-the-ground researchers to collect data about insurgent attacks, and the other uses what they heard on Facebook, the methodologies are dissimilar and combining them may cause problems. However, you can use dissimilar data sets to cross-reference and validate the data.
Do reputable organizations or individuals use or collect the data? University programs routinely collect data about specific topics using sound methodologies. For example, the University of Maryland's START Global Terrorism Database collects information about terrorist attacks around the world and other related activities. Other reputable organizations include think tanks and government departments (depending on the government). Increasingly, individuals are starting blogs and websites where they collect data about niche topics. These individuals are often not connected with a university or think tank, and are often located in places such as rural Africa and Latin America. However, do not let the lack of organizational affiliation dissuade you from using the data. Some unaffiliated blogs and websites are great sources for data, especially where the authors live in the region they are blogging about. See if reputable organizations and individuals use the data source. If they do, then you can trust the data. If they do not, contact the website's owners and ask questions about their intentions and how they go about data collection. You never know when you might stumble on a hidden data source gem. Generally, keep away from websites that are openly biased or partisan. They likely scrub away data that does not support their biases.
How much of the data is already cleaned and filtered? Before they make the data available, many organizations use their own methodologies to filter the data and eliminate incorrect and redundant information. Some go further, and structure the data so they meet a specific format. Consumers of the data then know what to expect each time they download the data, improving overall efficiency. However, sometimes organizations delete information that, although irrelevant for 99 percent of their consumers, is relevant to you. Look through the source's filtering methodologies so you know exactly what is in the data and what is not, and how much resources you will need to devote to further filter the data. In some cases, especially if the source of the data is an academic organization, you can ask and get access to the raw, unfiltered data.

Will They Give the Data?

Identifying the data source is only half the battle. The real challenge lies in actually getting the data, and in some cases persuading the data owner to give you the data. Persuasion may take the form of asking nicely or paying money. Overall, you will face one of the following four situations when trying to get the data.

The Data Is Publically Available

Three categories of organizations own publically available data that you can download for free at any time. The first category involves nonprofit organizations and transnational nonprofit organizations collecting the data, including the African Development Bank, the United Nations, the World Health Organization, Amnesty International, and university-sponsored initiatives. They are collecting the data as a public service, and in most cases make the data available in a variety of formats openly on the web. They often want people to consume the data and will go to great lengths to make sure their data collection methods are sound and routine, and that you can download the data from them in an easy and reliable manner, such as through monthly XLS file data sets. Getting data from them is easy. If they have data they have not made publically available, you can usually ask.

The second category involves nonprofit organizations running social media (most likely crowdsourcing) platforms that are free to users and focus on sharing specialized data, including Wikipedia and Ushahidi's Crowdmaps. In some rare cases, for-profit organizations run similar types of platforms, including IBM's Many Eyes (http://www-958.ibm.com/) data visualization platform, where users can upload and share data sets. As in the first category, the organizations expect and encourage the users to download the data from these sites. However, unlike in the first category, the organizations usually do not filter and validate the data.

The third category involves for-profit organizations running social media platforms that are free to users, including Twitter, Facebook, and Orkut. Organizations in the third category differ from those in the first two, primarily by their intention. Organizations such as Twitter do want you to openly and freely share some data with each other, but they do not care if you can download the data in an easy and reliable manner. Thus, they do not make available most of the data through easily accessible methods such as XLS file data sets. They will usually allow you to access part of the data through their application programming interfaces (APIs). We discuss APIs later. Getting data from organizations in the third category is more difficult than getting it from organizations in the first two categories. Also, third category organizations will limit the amount and type of data you can get. Asking them nicely will usually not do the trick, but you could pay for it.

The Data Is Privately Available

As big data grows in importance, more organizations are realizing that they can sell data. Two categories of organizations own privately available data that you can purchase. The first category is relatively new and involves social media platforms like Twitter indirectly selling primarily social media data. They are starting to realize that they need to generate revenue and so are looking toward selling the data users create on their platforms. They still make some of the data publically available through APIs, but make the majority of it only privately available. Because this category is nascent, buying social media platform data is not as easy. You will likely have to go through third-party companies such as Gnip that manage the data sales for the social media platforms. Online marketing companies are also getting into the game and striking special deals with social media platforms to sell the data. Expect the data to be expensive.

The second category is more established and involves for-profit organizations such as Western Union and IHS Jane's selling primarily non-social media data. In some cases, they openly advertise the sale of the data, but in other cases, they will only sell data if you ask. For the latter, prices will vary, especially if the data is unique and exclusive.

The Data Is Classified

Governments including intelligence agencies and law enforcement agencies have access to enormous amounts of data that is usually not available to the public. Governments may have special relationships with social media platform owners that allow them to access much of the platform data. To access government data, you likely need a clearance and you need to convince the government to sponsor your project. The rules for using sensitive but fundamentally open source information are complex and highly dependent on the project, and we wish you the best of luck figuring them out.

People Who Do Not Like You Own the Data

You will face this situation more than you expect, especially if you work on areas that are not friendly to your government. If you are working on behalf of the U.S. Department of Defense, do not expect Chinese platforms such as QZone to provide you with data, no matter how nicely you ask. Many non-profit organizations, especially those in the developing world, are also skeptical of sharing data with people working for or with governments and militaries. The best way around this is to get someone else to ask for you. In our experience, organizations are more open and willing to work with graduate students. Much of society considers them harmless and insignificant. Partner with universities, dust off your old college e-mail, or hire students to do the data sourcing for you. When asking for data, you can then only say you are working on a project that requires the data. You do not want to lie about your identity, project, or intentions, but you do not have to tell them the whole truth.

Collecting the Data

After determining what data you need, and where and if you can get it, you can start collecting or downloading the data. Expect to download data from numerous sources simultaneously and over a substantial period of time. Automate the collection and routinely download the data so you can focus on other tasks. You have numerous manual and automated ways to collect the data, and they depend on the type of data you are trying to collect and the format in which it is available. We explore the ways that are easiest and quickest to implement and will serve most of your needs.

Note

We wrote this section assuming you will desire some sort of automated data collection system. However, the information is also relevant if you want to only manually collect data for a one-off project.

Data Framework

The first step for creating an automated data collection system is building a data-gathering framework. The framework will house the technologies that go out into the world, download data, and bring it back to you. The framework can be as simple as a Google Document, which you can learn how to create on our website, or as a robust Django-based application, which you should hire developers to create. The technical details of a robust framework are beyond the scope of this book, and generally not something you should worry about. Commission trusted developers to create the framework, and communicate to them the importance of making the framework light, adaptive, and fast.

APIs

The second step is creating a technology that collects data from web APIs and housing it in the framework. Dealing with online data requires understanding what APIs are and how they work. They are one of the most important ways to share information and processes on the Internet among different organizations. Most of the websites you regularly use either offer an API or use several APIs. In simple terms, a typical web API enables you to query an online service using a small bit of code. Essentially, you ping an API or send it some information about what you want through the Internet, and the API outputs other information or modifies your inputted information in response. See Figure 4.2 for a visual description of the API process.

Figure 4.2 API process

Table 4.3 lists examples of a few relevant APIs. They range from APIs that tell you what your friends are tweeting to APIs that spell-check a sentence for you. Services that combine various APIs to create a whole new application are known as mashups. You are learning how to create a mashup. More and more of the Internet is populated with websites and platforms that are mashups.

Table 4.3 Examples of Relevant APIs

Name	Description	URL
Yahoo! Term Extractor	Extracts significant words or phrases from content	developer.yahoo.com/search/content/V1/termExtraction .html
Yahoo! Geo Technologies	Collection of various APIs that help identify, extract, and share locations from content	developer.yahoo.com/geo/
Facebook Graph API	Get various information about a Facebook page or profile	developers.facebook.com/docs/reference/api/
OpenStreetMap Editing API	Fetch data from the crowdsourced OpenStreetMap database about the locations of places	wiki.openstreetmap.org/wiki/API
TwitterCounter API	Retrieve information about a Twitter username such as number of followers	twittercounter.com/pages/api
Face API	Detect, recognize, and tag faces in any photo	developers.face.com/docs/
Wolfram Alpha API	Return information about anything—similar to a smart search engine	products.wolframalpha.com/api/
Cadmus API	Receive various popular social media feeds	thecadmus.com/api/docs

Several free and payment-based APIs offer specific data that you can then download and use as you please. Free APIs allow anyone to ping them with some restrictions. Payment-based APIs, a burgeoning industry, allow subscribers who are making some sort of monetary payments to ping them with usually fewer restrictions. Some charge a flat monthly fee or even per ping.

We prefer collecting data through APIs. Because the platform host or data source is usually the one offering the API, you can trust the data. Most social media platforms have some sort of APIs, and they are usually technologically reliable. They are also legal and safe to use. Twitter may sue you if you use other ways to get data from Twitter. But they will not sue you if you are using their APIs and following their API rules. Using APIs has a few downsides, especially free ones. Most free APIs have rate limits where they limit the number of times you can ping them over the course of a time period. They often change the rate limits and institute other confusing restrictions. Also, because APIs are a relatively new phenomenon, standardized rules are only now emerging about how data is shared across different APIs and which coding language the APIs should process. Social media platforms are leading the charge on APIs, and other data websites have only recently started to catch up. Most APIs you will run into are known as REST APIs that you ping using HTTP code and that output information in JSON, XML, RSS, or ATOM code. Because Twitter is a pioneer in offering data through APIs, we will explore its APIs as an example. The details of the Twitter APIs will help you understand the boundaries of data collection using APIs, and keep you from making impossible demands of your developers or software. The details will also illustrate how the technologies of data collection can change your analysis. Later, we discuss using APIs to analyze, and not only collect, data.

Twitter's APIs

Twitter offers several APIs and it encourages the public to use them to create mashups. The more mashups that use Twitter, the more ubiquitous Twitter becomes. Twitter and other social media platforms have created websites that teach people how to use their APIs. You only need a basic understanding of coding to get started.

Note

See http://dev.twitter.com/ for Twitter's API website. To get access to other platform API sites, Google the name of the platform and the word “API” and click the first link that pops up. Most API sites are labeled “[Name of social media platform] for Developers.”

By using Twitter's APIs, you can in near real-time get information about what specific people are tweeting, who is retweeting a specific tweet, what tweets contain a specific word, how many followers a person has, the profile pictures of specific people, and much more. Different APIs or portions of the APIs provide different types of information, and so you should expect to create a technology that pings several APIs simultaneously, sometimes using data queried from one API to query another API. To use some of the APIs you may need your own Twitter account; an API key that lets Twitter track who is pinging its APIs; and OAuth authorization, which is the emerging standard for authenticating API requests. Visit Twitter's website to learn how to acquire the authentications.

The Search and Streaming APIs are two of the most popular Twitter APIs. To use the Search API, you send a search query to the API; in other words, ping the API with a keyword. The API then returns the most recent tweets at that time that contain the keyword. So if you ping the API with the word “President,” it will return a list of the most recent tweets containing the word “President.” You can also send more complicated queries. You can ask for tweets containing some keywords and/or missing certain keywords, tweets created in a certain time period, tweets in specific languages, and more. You also do not need to provide authentication—anyone can ping the Search API.

Warning

Twitter and other social media companies frequently change the requirements for accessing their APIs. As of this writing, rumors are swirling that Twitter may require authentication to use any of its APIs. Always check their website to find out about their newest policies regarding APIs before you begin a data collection project.

However, the Search API has limits. It only gives you tweets going back about a week, and it limits the number of times you can ping it per hour but does not tell you the limit. To be safe, ping it only once per hour. It also does not provide all the tweets containing the keywords, but only an unknown percentage. The Streaming API, in contrast, does not have a time-based rate limit. However, it requires authentication and only provides you about 1 percent of tweets that contain your search query at the moment you ping the Streaming API. It also does not return tweets from the past.

Cross-Reference

The website teaches how to create a Google Document that pings the Search and Search API routinely using your search criteria. It requires a basic understanding of programming to build from scratch. The website also provides a pre-constructed Search API application that you can customize without any coding knowledge.

You should start to appreciate how the boundaries of the free Twitter APIs can influence your data collection and thus analysis efforts. If you can only get data from the free Twitter APIs, you cannot do analysis on all the tweets, but only a tiny percentage of all tweets. You are missing out on enormous amounts of data, regardless if the vast majority of tweets are irrelevant to you. You can try getting around the rate limits by having numerous different accounts ping the API, and then incorporating the data later. However, if Twitter catches you, it will shut off your access to it APIs completely. Thus, we do not recommend you do so. Also, you will not be able to do analysis on tweets older than a week, unless you keep pinging the APIs for a long time and actively store the tweets.

RSS Feeds

The third step for creating a data aggregation system is creating technology that downloads data from RSS (Resource Description Framework Site Summary or Really Simple Syndication) feeds. An RSS feed of a website is basically a URL where the website regularly spits out the content and associated metadata of the website in XML (eXtensible Markup Language) format. XML is a language that allows someone to deliver data using a customized structure so humans and computers can easily make sense of it. RSS feeds are sort of like APIs, except you do not ping an RSS feed with a search query. Instead, the RSS feed spits out the data and you simply hook into it at any time and download the data to your heart's content, like a hose hooking into an open fire hydrant.

RSS feeds are commonplace, easy to use, and popular. Many news sites and blogs deliver textual data, pictures, and podcasts through both their websites and their RSS feeds, so you can trust the data. Popular applications like Google Reader and Flipboard aggregate numerous RSS feeds from sites like the New York Times, Wired, and the Atlantic, and use the data to deliver news. Downloading data from RSS feeds is ideal for us, because the analysis of many security issues requires assessing data from news sites and blogs.

Note

To find the RSS feed of a website, click on links on the website that say “RSS,” “Feed,” or “ATOM,” which is a cousin of RSS feeds. The RSS feed's URL is the URL of that link.

Several RSS feed readers exist either as standalones or integrated with other data aggregators. You should never pay for a standalone RSS feed reader because numerous high-quality free ones exist. The only downside to RSS feeds is that the data you can get from an RSS feed including content, metadata, and the time span of historical data is limited to what the owner of the RSS feed wants to provide.

Cross-Reference

The website teaches how to create a simple RSS feed reader and aggregator, and integrate it with the Google Doc Twitter API aggregator.

Crawlers

The fourth step is creating technologies called crawlers to download data from websites and platforms that do not offer APIs or RSS feeds. Crawlers are a relatively old technology that browse or crawl through specified parts of the Internet in a systematic fashion looking for specified data. A specific crawler can, for example, collect all URLs mentioned on a specific website, download all textual data and pictures on a specific page of a website, or download data from large offline XLS or CSV data set files. Search engines use crawlers to go through the Internet and collect information about websites that they then index and organize. The crawlers usually jump from link to link, scouring the Internet for information like spiders crawling through their spider webs hunting for mosquitoes. Crawlers are useful for collecting data from older websites that do not have APIs or RSS feeds, forums, websites that are trying to prohibit data collection, and websites that routinely provide large data sets.

Programming a crawler requires minimal coding knowledge, usually of the language Python. Several free crawlers are available that you can download and greatly modify. Crawlers come in the following five types:

Link—Identify, collect, and index all URLs on a website
Content—Identify, collect, and index all content on a website
Search—Identify and index the relationships between several websites and collect information about them like a search engine
Focused—Identify, collect, and index specific data from specific websites
Non-web—Identify, collect, and index specific data from existing data sets and other offline files

Regardless of the crawler you use, you have to tell it four things:

Which websites or pages to target
How to interact and communicate with other crawlers
When to check for changes to the targeted pages so that it knows when to crawl through them again
How to avoid being banned from the targeted page by, for example, crawling it only once every few days. Major websites will usually make a page specific for crawlers that tell it how to behave.

Cross-Reference

The website lists several open source and proprietary crawlers that you can use and modify.

The flexibility and modifiability of crawlers confer several advantages. They are easy to create and modify, can crawl through virtually any website, and can download almost any type of data. However, their strengths are also their greatest weakness, because much of the Internet dislikes crawlers and thwarts them. Many popular platforms like Twitter identify and block crawlers. They allow the crawlers of search engines like Google and Bing to crawl through websites but not yours. Also, websites are increasingly incorporating human verification checkpoints on their pages, the most popular of which is CAPTCHA (or RE-CAPTCHA), which most crawlers have difficulty getting past. See Figure 4.3 for a picture of a CAPTCHA. Although most crawlers are easy to implement, the strongest ones require creative developers. Using crawlers to download information from Neo-Nazi or Jihadi forums that are often password-protected requires ingenuity that you cannot get for free. You may also need developers on hand if your targeted sites change in look and structure. You then have to reprogram your crawler and teach it to navigate the changed site, which can be tedious and frustrating.

Figure 4.3 CAPTCHA example

Filtering the Data

After downloading data through APIs, RSS feeds, and crawlers, you need to filter the data and eliminate parts of it that are irrelevant. Unless you have stumbled onto some amazing hidden relationship between the international drug trade and Justin Bieber, most of the data generated on social media platforms is irrelevant to you, and is thus noise. A significant portion of activity and information on social media platforms, especially those that are Western-centric, are about popular culture events or mundane topics, and are at best entertaining and at worst self-aggrandizing. One study found that people consider nearly a third of all tweets to be completely useless to them.² Sifting through social media data to delete irrelevant data is difficult but necessary. You will not face a similar problem with non-social media data, because most non-social media data is already filtered. The University of Maryland's START databases will not likely start mixing lists of terrorist attacks with pictures of bunnies with pancakes on their heads (unless maybe terrorists start putting pancakes on the bunnies' heads).

We have already covered the first step in eliminating noise—determine and focus your data needs. If you are not pinging Twitter's Streaming API with the keyword “Justin Bieber,” you will likely not receive tweets containing only “Justin Bieber” in them. Similarly, do not download the RSS feed of the gossip site TMZ, and do not crawl websites dedicated to Justin Bieber. Focus your search queries and target websites that are tangential, if not directly relevant.

We expect the furrowing of your brow and the criticism that you are about to put forth. Indeed, we have heard it numerous times before. The criticism is that by focusing your search queries, you may miss something. Maybe Mexican drug cartels are impersonating 15-year-old girls on the Internet and coordinating drug activity under the guise of talking about Justin Bieber, and you need to crawl through Bieber fan sites and tweets. That could absolutely be true. Drug cartels could also be impersonating fans of Twilight or the New York Yankees. The possibilities are endless and a black swan may pop up anywhere. In our field, the lookout for ever more intelligence and information in fear of missing something has understandably become more intense since the attacks of September 11th, 2001. Our response is:

1. Studies on data analysis and human intelligence have shown that more information does not equal more knowledge. In fact, in some situations, more information causes us to either become paralyzed or even make the wrong decision.³

2. We are not teaching forensic analytics or data mining. We are not teaching you how to discover the ultimate source of a tweet, or spot the critical text message between terrorists. Our focus is on uncovering answers by looking at aggregate data and behavior.

3. However, some of our tools will help you conduct forensic analytics, but in indirect ways. For example, through volumetric analysis, you can uncover whether unusual changes in the virtual traffic of Bieber fan sites in correlation with increasing real-world drug activity can tell you something about whether Mexican drug cartels really are impersonating Bieber fans online. After identifying a relationship, you can refocus your data needs.

4. To uncover unusual relationships, and to keep your analysis innovative, it helps to periodically shift your data collection needs and modify your search queries, sometimes without an educated guess as to why. Introducing randomness into data collection may help.

5. Come to grips with the fact that you will never be able to deal with all the data the Internet and other sources generate. As database technologies proliferate and improve, so do the technologies to create and share data. Chasing after every last piece of data will leave you exhausted and seriously tax your resources. You likely will never catch up, and that is okay. It helps to focus on what you know best and let someone else chase doubtful connections.

Note

Mexican drug cartels might read this book and decide to shift their communication to Justin Bieber forums. So maybe you do want to focus on the Bieber forums.

After focusing your searches, delete duplicates and spam from your collected data. Much of the content on Twitter is simply people retweeting what others have tweeted. Unless you are interested in the popularity of certain messages in a network, you can consider retweets to be duplicates and thus noise. Spam bots, which post nonsense, pornography, and malware, are pervasive on social media sites, especially Twitter. If an event is really popular, and a hashtag becomes associated with it, expect the spam bots to quickly inundate the hashtag with spam.

Creating a noise-eliminating tool from scratch is difficult and a waste of time and resources. The easiest route is to purchase access to an API that filters social media data for you (the APIs usually do more than just eliminate noise) or to use a free collection of data aggregation and analysis tools called SwiftRiver that we again mention in the last section of this chapter. In some cases, the APIs of social media platforms will build in options to remove noise and spam. Twitter, for example, allows you to query its API in such a way that it does not return retweets.

Filtering the data often goes beyond simply eliminating noise and spam. Filtering can also involve validating the data to ensure only correct information is stored and that misinformation is removed. Currently, the technology to do so is nascent and very difficult to implement. Many companies claim their commercial APIs can validate data, but for the most part their claims are more fantasy than reality. You will have to manually validate data by looking at some of the data at random and asking:

Are other people saying the same thing? If only one person claims to see something while people around him do not, he may be lying.
Is the source trustworthy? You may have a list of certain Facebook or Twitter accounts that you know post incorrect information. You can easily program the Twitter API and other data aggregation tools so you do not receive data from sources you do not want.

Storing and Managing the Data

Deciding how to store and manage the data is one of the most important decisions you will make. The type of storage you use, its size, and complexity, will affect how you integrate, access, and use your data. It will also affect the cost of storing the data, which may affect the scope of your analysis. Conversely, the type of analysis you want to do will affect the type of storage you use. Conducting a one-time analysis with a small amount of data does not require as complex or as much storage as repeat analyses that involve ingesting massive amounts of social media data over a long period of time. We will cover the span of data storage requirements from the most minimalist project to the most multifaceted. Unless you are doing a simple analysis, you will need the help of technical database developers to help you formulate and manage your database. Still, understanding the two relevant database management system types will help you appreciate how you can use the data for analysis and choose an appropriate database technology. You may use each of the two types or only one. Some projects will be more appropriate for one and some more for the other. Other types of database management systems exist, but they are not as relevant and are more obscure and thus more difficult to implement.

Relational Databases

The most popular database management system type is known as relational databases. Currently, they are the most widely used, although their popularity is waning because they are not appropriate for dealing with lots of complex data.

What Are They?

Simply put, relational databases organize data into linked tables of rows and columns. In the table, the rows represent an object (known as a tuple) and the columns represent attributes of the object. Anytime you use Microsoft Excel to create a spreadsheet, you are in essence creating a relational database. Programming languages such as Structured Query Language (SQL) enable users to conduct queries on the table. They are colloquially known as SQL databases because programmers prefer using SQL to access them. Using SQL you can, for example, access all the objects that have a certain attribute or access certain objects but not others. In the context of social media data, relational databases enable you to store instances of a content created on social media platforms as an object and the metadata about the content as attributes. See Figure 4.4 for an illustrated example of a relational database. Relational databases are the most established and traditional ways of storing and managing data.

Figure 4.4 Relational Database Example

Advantages and Disadvantages

The advantages of relational databases derive from their ubiquity. Governments and corporations prefer using them because of their longevity and reliability. Due to this widespread use, you can easily find personnel to implement and manage relational databases. Also, they have gone through numerous iterations and are user-friendly and secure. Major companies such as Oracle and Microsoft support relational databases, and numerous free, open source versions also exist, the most famous of which is MySQL.

In terms of managing and representing social media data, relational databases have numerous disadvantages. The most significant is that they cannot handle enormous amounts of data and do not scale easily. Imagine storing millions of tweets, Facebook updates, and text messages in countless tables. Add to that video, pictures, and audio. Every time you query the database for a particular piece of data, the computer will have to go through an enormous amount of data, which will significantly slow the database down. Relational databases were meant to keep track of the inventory of stores, not what most people on the planet are saying. Programmers have come up with clever ways to quickly query data from large relational databases, but they are reaching their limits. Also, the tables in the database are difficult to modify on the fly. In other words, it is difficult to add attributes (columns) after you have already defined the tables. If you are running a store and need to keep track of objects, you will have a set amount of attributes for each object. The attributes may include quantity and price. You will likely not need to add new attributes to all or some of the objects. However, because social media data is so unstructured and differs wildly, you may need the flexibility to add or subtract attributes depending on the content. One tweet may have five attributes, but a text message may only have two. The second most significant disadvantage is that, despite their name, relational databases do not tell you anything about relationships. Social media data is all about how people use different media to communicate and create relationships with each other. Knowing how one piece of content is related to another piece of content is essential. Relational databases cannot easily tell you how, for example, one Twitter account is connected to another Twitter account. You cannot easily link seemingly disparate content together, which causes you to miss out on querying information about relationships and connections between content.

Due to these crippling disadvantages, do not use relational databases if you expect to use massive amounts of social media data. The only time you should consider using relational databases is when you have a small or one-time analysis. For example, if want to use regression analysis to find the strength of the relationship between an increase in Craigslist ads in the adult section and police reports of kidnapping in one city, you can most likely use a Microsoft Excel–based relational database. The rows could be something as simple as each instance of kidnapping. One column could be whether an adult listing appeared within five days of the kidnapping, and the second column could be how many listings appeared.

Non-Relational Databases

The second most popular database management system type is known as non-relational databases. They are gaining in popularity and power because they are ideally suited for handling social media data.

What Are They?

Non-relational databases are basically databases that are in no way similar to relational databases. Colloquially they are called NoSQL databases because they either do not use SQL or use another language along with SQL. Numerous types of non-relational databases exist, but the ones most relevant for us are document-oriented and graph databases. Not all document-oriented databases are graph databases, but some more or less can function as graph databases. The most cutting-edge technology companies including Facebook, Google, Amazon, and Twitter are leading the charge in adopting non-relational databases.

Document-oriented databases store information about an object as a document, similar to a Microsoft Word document. The document can hold numerous attributes, other types of media, and any other data associated with the object. Because each document is standalone, you can easily add attributes later to specific documents. Programming languages such as JSON then enable you to query the document or part of the document however you wish. If you need to add more information to the document, you can always replace the document with an updated version. In most document-oriented databases, the documents are not linked together in any meaningful way. They look like a stack of documents in a file cabinet. So when you query a document, the computer has to rummage through the cabinet. Programmers have come up with ingenious ways to speed up this process. Due to their flexibility, document-oriented databases such as MongoDB are fast becoming popular. See Figure 4.5 for an illustrated example of a document-oriented database

Figure 4.5 Document-oriented database example

A graph database goes a step further than document-oriented databases and stores information about the objects and their relationships with each other in a network. As in social networks, each object is a node and it can have as many attributes as desired, similar to documents. Each object is then connected to other objects. You can then link different content or objects together, or in more advanced cases allow the objects to find and create links themselves. Queries can then find information about each object and its attributes, and/or information about how they are connected to other objects. Graph databases such as neo4j are a new and emerging technology. See Figure 4.6 for an illustrated example of a graph database.

Figure 4.6 Graph database example

Before discussing the advantages and disadvantages of non-relational databases, we need to discuss what we consider the object in those databases. In relational databases, the object is usually the content. It is the tweet, the text message, or the YouTube video. The content can also be the object in non-relational databases, however it is better for an entity to be the object.

Entity Extraction

An entity is a name, place, organization, or date that a piece of content refers to or mentions. For example, in the text message, “Michael is going to Malaysia,” the entities are Michael and Malaysia. By extracting entities and classifying them as objects in non-relational databases, you are going beyond simply storing tweets and text messages. Instead, you are representing all the people, places, things, and times mentioned on social media virtually, and storing information about them and how they relate to each other. Thus, when you query information from the database, you are querying information about specific things, thereby allowing you to analyze things and how they relate to each other. Entities can also be the objects in relational databases, although it is difficult to update information about them later because of the static nature of relational databases.

You can still store the content or the actual text message about the entity if you desire. In the preceding example, the entity Michael would have as an attribute, the text message “Michael is going to Malaysia.” If someone else then texts “Michael is married to Mandy,” you can then update the Michael entity with that additional information. You can then not only associate Michael with the new text message, but also derive the entity from the new text message and relate it to Michael as an attribute. So the entity of Michael is now connected to the entity Malaysia (because that is where he is going), and the entity Mandy (because that is his wife). This method of representing entities as objects in a graph database then allows you to create an impressive dossier on all things and how they are related, and it works like the human brain, which is phenomenal for storing lots of linked information and manipulating it quickly.

To do entity extraction, you usually take the content and send it to an entity extractor API. The API then returns the entities. A few open source entity extractors exist, although they are limited in how many times you can use them, how well they can detect entities given misspellings and slang, and how many entities they can detect. A commercial entity extractor such as Basis Technology's is more robust and smarter. A free entity extractor will likely not realize that “Ravi” is a name, but a commercial one probably will. If you are serious about collecting social media data long-term to tease out the relationships between disparate things, you must integrate entity extraction technology into your data system.

Advantages and Disadvantages

Non-relational databases confer numerous advantages when it comes to dealing with social media data. As mentioned before, the hottest technology companies are investing in and adopting non-relational databases, which makes it likely that they will grow in popularity, power, and appeal. One reason why companies such as Facebook like non-relational databases is because they can scale and handle enormous amounts of data easily. You can add as many documents and nodes to the network as you want, whereas modifying existing tables can be difficult. Many non-relational databases exist on cloud or localized cloud servers that can easily take on more data without you having to go out and purchase physical servers. Cloud servers, which are becoming more secure, also allow users who are not at a specific location or on a specific device to access the data through the Internet at any time. Also, other technologies such as Hadoop make it easier to query non-relational databases and make them more resilient and reliable. Such technologies also allow you to run algorithms through the data and query the results of those algorithms in ways that you cannot do with tables and relational databases. Another big advantage is that non-relational databases model social networks much better than relational databases. They are designed to manage social media data because they focus on managing information about relationships and objects whose attributes can change frequently. If need be, after extracting entities and their attributes, you can delete irrelevant information, thereby saving on data costs. Hence, major social media companies like Facebook and Twitter are funding their development.

The disadvantages stem from the novelty of the technologies supporting non-relational databases. Non-relational databases have only recently become popular and are still very much in development. Kinks and security problems exist and need to be worked out. Also, programmers that can work with non-relational databases are few and expensive. However, as the technology flourishes, the disadvantages will dissipate.

Due to the overwhelming advantages and the diminishing disadvantages, we highly recommend implementing some sort of non-relational database, preferably a graph database, to store and manage social media data. They are suited for social media data and linking them together as they naturally are linked. If you expect to ingest a large amount of data or do numerous or long-term projects, you should use a non-relational database.

Third-Party Solutions

Some of you may have headaches or may have thrown your hands up in exasperation. Collecting, filtering, and storing social media and other data requires a lot of effort and resources. Implementing it for large or long-term projects from scratch necessitates the employment of technical developers, program managers, and subject matter experts. The process becomes even more cumbersome and expensive if you desire automated analytics that can further filter the data for you and deliver it to you in fancy visualizations. Creating your own automated analytics is beyond the scope of this book, although Chapters 5 and 6 briefly cover how you can create manual analysis into automated analysis.

Cross-Reference

Chapter 5 describes how to use a free software known as NodeXL to download various types of social media data quickly and without much human involvement.

Fortunately, many third-party organizations offer all of the above and more. From our count, nearly 240 software solutions offer either a part or all of the overarching collecting, filtering, storing, and analyzing process. Some only do data collection, some only do a type of filtering, and some do everything and more. They vary in their reliability, usefulness, and price. Some are absolutely awful, some are great at one thing and terrible at other things, some are reasonable at everything, and some are even free. Some offer great customer service and even allow you to customize part of their software offering. Some offer parts of their software as commercial APIs that can help you collect data or do entity extraction.

Choosing the right software solution is difficult, and you will have to make trade-offs. Recognize that few software solutions are appropriate for security purposes. They are more focused on helping companies and ad agencies do brand management and measure popularity, not identify sex traffickers or detect violence. Many major companies that traditionally offer social media solutions to commercial clients, such as Radian6 and IBM, are increasingly focusing on the government and security sector, but they have a long way to go. Unfortunately, social media literacy is low in government and military sectors, and commercial entities exploit this fact to sell subpar and outdated products. We partly wrote this book so that our government officials do not waste money on subpar social media analytical products. Use what you have gained from the book so far and the following questions to evaluate potential third-party products:

Can it deal with various types of unstructured data? Twitter is popular now, but another social media platform may replace it tomorrow. Any software that collects data must be able to deal with all types of unstructured data, not just Twitter or Facebook.
How much data does it download? Systems that only ping free APIs for data will not have nearly enough data as systems that purchase the firehose, or all of the data. Based on your requirements, you may need some or all of the data. Ask how it collects the data.
How does it filter the data? The system should eliminate duplicates and spam, but you should be suspicious if it goes much further. Some use machine translation to translate foreign language messages. However, machine translators do not pick up slang, code words, and sarcasm. Key information may be lost in translation. Also, you may need to know how many times a tweet has been retweeted. Inquire exactly how the filtering algorithms work and what they do.
What exactly do the analyses do? Many companies claim their systems can do automated analyses on the data they collect. Most of the time, the analysis is very simplistic and results in aesthetically pleasing graphics that do not help your mission at all. Do not get swayed by shiny things. Insist on knowing exactly how the system does the analyses and what assumptions it makes when doing them. Very few companies actually do social network analysis using social media data, but many claim they can. Many also claim they can identify influencers, but their algorithms to do so are rubbish. Some companies create analytical tools based on absolute nonsense and you should call them out on it. We explore some in Chapters 5 and 6. We are not saying that companies that offer sophisticated automated analytics do not exist. It is just that they are usually smaller, focused on a niche area, and harder to find.
How do they store and share the data? Most host the data and the software on a cloud server and allow you to access it as a web application. However, for security reasons, many government customers need software that sits behind a firewall on local servers. Ask if this sort of enterprise solution is possible, and who else gets to access the data you are accessing.
Do they provide support? The major social media solution companies are great at providing customer service. However, small or open source organizations are doing some of the most cutting-edge work in this field and cannot afford to provide service. Determine what level of service you need. If you have the resources, skill, time, and audacity to take on more for yourself, then explore open source products. The SwiftRiver platform is a free yet powerful social media aggregation and analysis system that requires only some technical skill to set up and deploy. Check it out at http://ushahidi.com/products/swiftriver-platform.
How much does it cost? Because the social media solution market is only now emerging, prices vary considerably. Compare prices and know that most of what you want is probably available for free. Government clients get ripped off a lot. Be aggressive on the price. Oftentimes, hiring developers to build a system for you that is made up of free and commercial APIs will be cheaper.
Does it claim that it will solve all your problems? Ignore anyone who says they have solved it all. Social media data collection, management, and analysis are at their primitive, fetal stage. The solutions that exist today will be embarrassing compared to what will emerge five years from now.
Can you access it on mobile devices? You probably own a tablet or smartphone, and there is no reason why you should not be able to access software using those devices. Ask if you can access the system using mobile devices, and if the company says no, ask why not. Most likely, they will say they are working on it. The reality is that it is because they are older companies that are desperately trying to stay relevant and tweaking their software to become mobile-accessible. Stay away from such outdated technology.
What kind of database does it use? If it does not use non-relational databases, be wary. As explained before, solutions that use relational databases will not deliver the data and flexibility you need quickly and reliably, especially as social media use grows.

Now that you know how to collect data, you can begin conducting analysis. Chapter 5 will describe the process for conducting social network analysis.

Summary

Social media analysis consists of intelligently finding and manipulating social media content, associated metadata, and related non-social media data.
To determine collection needs, you need to ask:
- What data will solve the problem? You will likely need a diverse set of data including social media, metadata, and sociological data.
- How much data is enough? A Ten percent sample of the population size or 500–1000 data points per population is enough to give you a reasonably error-free answer.
- Who has the data? Social media companies, non-profit organizations, governmental departments, universities, think tanks, unaffiliated but trusted bloggers, and private companies will likely have the data you need.
- Will they give the data? Public data is easy to get, and you can buy private data but it is usually expensive. Commission a graduate student to get data from unfriendly sources.
Data collection will likely require incorporating API queries, RSS feed readers, and customized crawlers together in a framework.
Filter the data using focused search queries and commercial and open source APIs to eliminate noise, spam, and duplicates.
Use relational databases to store and manage your data for one-time analyses or small projects. For large or long-term projects that involve lots of data, use non-relational databases such as the graph database. They are better suited to handle social media data and can handle lots of data easily.
Entity extraction allows you to keep track of all the people, places, things, and times mentioned in social media content and how they relate to each other, and reduce the amount of data you need to store.
Use what you read to evaluate third-party software applications, which can offer impressive and awful solutions across the social media data process.

Notes

1. Rao, L. (2010) “Twitter Seeing 6 Billion API Calls Per Day, 70K Per Second.” Accessed: 14 June 2012. http://techcrunch.com/2010/09/17/twitter-seeing-6-billion-api-calls-per-day-70k-per-second/

2. The Telegraph (2012) “130 Million Tweets Everyday Are Not Worth Reading, Researchers Find.” Accessed: 14 June 2012. http://www.telegraph.co.uk/technology/twitter/9057314/130-million-Tweets-everyday-are-not-worth-reading-researchers-find.html

3. Hall, C., Arissa, L., and Todorov, A. (2007) The Illusion of Knowledge: When More Information Reduces Accuracy and Increases Confidence. Accessed: 14 June 2012. http://webscript.princeton.edu/∼tlab/wp-content/publications/Hall_Ariss_Todorov_OBHDP2007.pdf

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 4: Collecting and Managing Social Media Data

Create new playlist

Sign In

Sign Up

Understanding Social Media Data

Determining Collection Needs

What Data Will Solve the Problem?

How Much Data Is Enough?

Who Has the Data?

Will They Give the Data?

The Data Is Publically Available

The Data Is Privately Available

The Data Is Classified

People Who Do Not Like You Own the Data

Collecting the Data

Data Framework

APIs

Twitter's APIs

RSS Feeds

Crawlers

Filtering the Data

Storing and Managing the Data

Relational Databases

What Are They?

Advantages and Disadvantages

Non-Relational Databases

What Are They?

Entity Extraction

Advantages and Disadvantages

Third-Party Solutions

Summary

Notes

Table of Contents for
Chapter 4: Collecting and Managing Social Media Data