Every application has its own unique structure, or architecture. This architecture provides the overarching organization or framework for the application. For this application, we combine the three classes using a Java 8 stream in the ApplicationDriver
class. This class consists of three methods:
ApplicationDriver
: Contains the applications' user inputperformAnalysis
: Performs the analysismain
: Creates the ApplicationDriver
instanceThe class structure is shown next. The three instance variables are used to control the processing:
public class ApplicationDriver { private String topic; private String subTopic; private int numberOfTweets; public ApplicationDriver() { ... } public void performAnalysis() { ... } public static void main(String[] args) { new ApplicationDriver(); } }
The ApplicationDriver
constructor follows. A Scanner
instance is created and the sentiment analysis model is built:
public ApplicationDriver() { Scanner scanner = new Scanner(System.in); TweetHandler swt = new TweetHandler(); swt.buildSentimentAnalysisModel(); ... }
The remainder of the method prompts the user for input and then calls the performAnalysis
method:
out.println("Welcome to the Tweet Analysis Application"); out.print("Enter a topic: "); this.topic = scanner.nextLine(); out.print("Enter a sub-topic: "); this.subTopic = scanner.nextLine().toLowerCase(); out.print("Enter number of tweets: "); this.numberOfTweets = scanner.nextInt(); performAnalysis();
The performAnalysis
method uses a Java 8 Stream
instance obtained from the TwitterStream
instance. The TwitterStream
class constructor uses the number of tweets and topic
as input. This class is discussed in the Data acquisition using Twitter section:
public void performAnalysis() { Stream<TweetHandler> stream = new TwitterStream( this.numberOfTweets, this.topic).stream(); ... }
The stream uses a series of map
, filter
, and a forEach
method to perform the processing. The map
method modifies the stream's elements. The filter
methods remove elements from the stream. The forEach
method will terminate the stream and generate the output.
The individual methods of the stream are executed in order. When acquired from a public Twitter stream, the Twitter information arrives as a JSON document, which we process first. This allows us to extract relevant tweet information and set the data to fields of the TweetHandler
instance. Next, the text of the tweet is converted to lowercase. Only English tweets are processed and only those tweets that contain the sub-topic will be processed. The tweet is then processed. The last step computes the statistics:
stream .map(s -> s.processJSON()) .map(s -> s.toLowerCase()) .filter(s -> s.isEnglish()) .map(s -> s.removeStopWords()) .filter(s -> s.containsCharacter(this.subTopic)) .map(s -> s.performSentimentAnalysis()) .forEach((TweetHandler s) -> { s.computeStats(); out.println(s); });
The results of the processing are then displayed:
out.println(); out.println("Positive Reviews: " + TweetHandler.getNumberOfPositiveReviews()); out.println("Negative Reviews: " + TweetHandler.getNumberOfNegativeReviews());
We tested our application on a Monday night during a Monday-night football game and used the topic #MNF. The # symbol is called a hashtag and is used to categorize tweets. By selecting a popular category of tweets, we ensured that we would have plenty of Twitter data to work with. For simplicity, we chose the football subtopic. We also chose to only analyze 50 tweets for this example. The following is an abbreviated sample of our prompts, input, and output:
Building Sentiment Model Welcome to the Tweet Analysis Application Enter a topic: #MNF Enter a sub-topic: football Enter number of tweets: 50 Creating Twitter Stream 51 messages processed! Text: rt @ bleacherreport : touchdown , broncos ! c . j . anderson punches ! lead , 7 - 6 # mnf # denvshou Date: Mon Oct 24 20:28:20 CDT 2016 Category: neg ... Text: i cannot emphasize enough how big td drive . @ broncos offense . needed confidence booster & amp ; just got . # mnf # denvshou Date: Mon Oct 24 20:28:52 CDT 2016 Category: pos Text: least touchdown game . # mnf Date: Mon Oct 24 20:28:52 CDT 2016 Category: neg Positive Reviews: 13 Negative Reviews: 27
We print out the text of each tweet, along with a timestamp and category. Notice that the text of the tweet does not always make sense. This may be due to the abbreviated nature of Twitter data, but it is partially due to the fact this text has been cleaned and stop words have been removed. We should still see our topic, #MNF
, although it will be lowercase due to our text cleaning. At the end, we print out the total number of tweets classified as positive and negative.
The classification of tweets is done by the performSentimentAnalysis
method. Notice the process of classification using sentiment analysis is not always precise. The following tweet mentions a touchdown by a Denver Broncos player. This tweet could be construed as positive or negative depending on an individual's personal feelings about that team, but our model classified it as positive:
Text: cj anderson td run @ broncos . broncos now lead 7 - 6 . # mnf Date: Mon Oct 24 20:28:42 CDT 2016 Category: pos
Additionally, some tweets may have a neutral tone, such as the one shown next, but still be classified as either positive or negative. The following tweet is a retweet of a popular sports news twitter handle, @bleacherreport
:
Text: rt @ bleacherreport : touchdown , broncos ! c . j . anderson punches ! lead , 7 - 6 # mnf # denvshou Date: Mon Oct 24 20:28:37 CDT 2016 Category: neg
This tweet has been classified as negative but perhaps could be considered neutral. The contents of the tweet simply provide information about a score in a football game. Whether this is a positive or negative event will depend upon which team a person may be rooting for. When we examine the entire set of tweet data analysed, we notice that this same @bleacherreport
tweet has been retweeted a number of times and classified as negative each time. This could skew our analysis when we consider that we may have a large number of improperly classified tweets. Using incorrect data decreases the accuracy of the results.
One option, depending on the purpose of analysis, may be to exclude tweets by news outlets or other popular Twitter users. Additionally we could exclude tweets with RT, an abbreviation denoting that the tweet is a retweet of another user.
There are additional issues to consider when performing this type of analysis, including the sub-topic used. If we were to analyze the popularity of a Star Wars character, then we would need to be careful which names we use. For example, when choosing a character name such as Han Solo, the tweet may use an alias. Aliases for Han Solo include Vykk Draygo, Rysto, Jenos Idanian, Solo Jaxal, Master Marksman, and Jobekk Jonn, to mention a few (http://starwars.wikia.com/wiki/Category:Han_Solo_aliases). The actor's name may be used instead of the actual character, which is Harrison Ford in the case of Han Solo. We may also want to consider the actor's nickname, such as Harry for Harrison.