Chapter 7. Social Media Analysis – Analyzing Twitter Data

Connected is the word that describes life in the 21st century. Though various factors contribute to the term connected, there's one aspect which has played a pivotal role. It's called the Web. The Web, which has made distance an irrelevant metric and blurred socio-economic boundaries, is a world in itself and we all are a part of it. The Web or Internet in particular has been a central entity in this data-driven revolution. As we have seen in our previous chapters, for most modern day problems, it is the Web/Internet (henceforth used interchangeably) that acts as a source of data. Be it e-commerce platforms or financial domain, the Internet provides us with huge amounts of data every second. There's another ocean of data within this virtual world which touches our lives at a very personal level. Social networks, or social media, is a behemoth of information and the topic for this chapter.

In the previous chapter, we covered the financial domain, where we analyzed and predicted credit risk for customers of a certain bank. We now shift gears and move into the realm of social media and see how machine learning and R empower us to uncover insights from this ocean of data.

In this chapter, we will cover the following topics:

  • Data mining specifics for social networks
  • The importance and use of different data visualizations
  • An overview of how to connect and collect Twitter data
  • Utilizing Twitter data to uncover amazing insights
  • Seeing how social networks pose new challenges to the data mining process

Social networks (Twitter)

We all use social networks day in and day out. There are numerous social networks catering to all sorts of ideologies and philosophies, but Facebook and Twitter (barring a couple more) have become synonymous with the term social network itself. These two social networks enjoy popularity not only because of their uniqueness and the quality of service but because of the way they enable us to interact in a very intuitive way. As we saw with recommendation engines used in e-commerce websites (see Chapter 4, Building a Product Recommendation System), social networks have existed long before Facebook, Twitter, or even the Internet.

Social networks have interested scientists and mathematicians alike. It is an interdisciplinary topic which spans but is not limited to sociology, psychology, biology, economics, communication studies, and information science. Various theories have been developed to analyze social networks and their impact on human lives in the form of factors influencing economics, demographics, health, language, literacy, crime, and more.

Studies done as early as the late 1800s form the basis of what we today refer to as social networks. A social network, as the word itself says, is a sort of connection/network between nodes or entities represented by humans and elements affecting social life. More formally, it is a network depicting relationships and interactions. Hence, it is not surprising to see various graph theories and algorithms being employed to understand social networks. Where the 19th and 20th centuries were limited to theoretical models and painstaking social experiments, the 21st century's technology has opened the doors for these theories to be tested, fine tuned, and modeled to help understand the dynamics of social interactions. Though testing these theories by some social networks (called social experiments) have been caught in controversies, such topics are beyond the scope of this book. We shall limit ourselves to the algorithmic/data science space and leave the controversies for the experts to discuss.

Note

The Milgram Experiment, or the small world experiment, was conducted in the late 1960s to examine the average path length for people in United States. As part of this experiment, random people were selected as starting points of a mail chain. These random people were tasked to send the mail to the next person so that the mail gets one step closer to its destination (somewhere in Boston) and so on. An average of six hops to the destination is the documented result of this famous experiment. Urban folklore suggests the phrase 6 degrees of separation originated from this experiment, even though Dr. Milgram never used the term himself! He conducted many more experiments; search and be amazed.

Source:

http://www.simplypsychology.org/milgram.html

Before we jump into the specifics, let us try and understand the reason behind choosing Twitter as our point of analysis for this and the upcoming chapter. Let us begin with understanding what Twitter is and why is it so popular with both end users and data scientists alike.

Twitter, as we all know, is a social network/micro-blogging service that enables its users to send and receive tweets of a maximum of 140 characters. But what makes Twitter so popular is the way it caters to the basic human instincts. We, humans, are curious creatures with an incessant need to be heard. It is important for us to have someone or some place to voice our opinions. We love to share our experiences, feats, failures, and ideas. At some level or other, we also want to know what our peers are up to, what's keeping celebrities busy, or simply what's on the news. Twitter addresses just that.

With multiple social networks existing long before Twitter came into existence, it wasn't some other service which Twitter replaced. In our view, it was the way Twitter organized the information and its users that clicked. Its unique Follow model of relationship caters to our hunger for curiosity, while its short, free, and high-speed communication platform enables the users to speak out and be heard globally. By allowing users to follow a person or an entity of interest, it enables us to keep up with their latest happenings without the other user following us back. The Follow model tips Twitter's relationships towards more of an interest graph rather than the friendship model usually found in social networks such as Facebook.

Twitter is known and used across the globe for the super-fast spread of information (and rumors). It has been innovatively used in certain circumstances unimaginable before, such as finding people in times of natural calamities such as earthquakes or typhoons. It has been used to spread information so far and deep that it takes viral proportions. The asymmetric relationships and high speed information exchange aid in making Twitter such a dynamic entity. If we closely analyze and study the data and dynamics of this social network we can uncover many insights. Hence, it is the topic for this chapter.

Let's apply some data science to tweets using #RMachineLearningByExample!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset