Fetching the Twitter data

Naturally, we need tweets and their corresponding labels that describe sentiments. In this chapter, we will use the corpus from Niek Sanders, who has done an awesome job of manually labeling more than 5,000 tweets as positive, negative, or neutral and has granted us permission to use it in this chapter.

To comply with Twitter terms of services, we will not provide any data from Twitter nor show any real tweets in this chapter. Instead, we can use Sander's hand-labeled data, which contains the tweet IDs and their hand-labeled sentiments. We will use Twitter's API to fetch the corresponding tweets one by one. To not bore you too much, just execute the first part of the corresponding Jupyter notebook, which will start the downloading process. In order to play nicely with Twitter's servers, it will take quite some time to download all the data for more than 5,000 tweets, which means it is a good idea to start it right away.

The data comes with four sentiment labels, which are returned by load_sanders_data():

>>> X_orig, Y_orig = load_sanders_data()
>>> classes = np.unique(Y_orig)
>>> for c in classes: print("#%s: %i" % (c, sum(Y_orig == c)))
#irrelevant: 437 #negative: 448 #neutral: 1801 #positive: 391

Inside load_sanders_data(), we are treating irrelevant and neutral labels together as neutral, and dropping all non-English tweets, resulting in 3,077 tweets.

In case you get different counts here, it is because, in the meantime, tweets might have been deleted or set to be private. In that case, you might also get slightly different numbers and graphs than the ones shown in the upcoming sections.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset