Data for text classification

Before diving into the machine learning (ML) problems in text classification, we will take a look at the different open datasets that are available on the internet. Many of the classification tasks may require large labeled text data. This data can be broadly grouped into those with binary classes, multi-classes, and multi-labels. The following are some of the popular datasets used for benchmarking in both research and some competitions, such as Kaggle:

Dataset name
Class type
Source

1

IMDb movie Dataset

Binary classes

http://ai.stanford.edu/~amaas/data/sentiment/

2

Twitter Sentiment Analysis Dataset

Binary classes

http://thinknook.com/twitter-sentiment-analysis-training-corpus-dataset-2012-09-22/

3

YouTube Spam Collection Dataset

Binary classes

https://archive.ics.uci.edu/ml/datasets/YouTube+Spam+Collection

4

News Aggregator Dataset

Multiclass

https://archive.ics.uci.edu/ml/datasets/News+Aggregator

5

Yelp reviews

Multi-label

https://www.yelp.com/dataset

6

Amazon reviews dataset

Multiclass

http://jmcauley.ucsd.edu/data/amazon/

7

Reuters Corpora

Multi-label/Multiclass

http://trec.nist.gov/data/reuters/reuters.html

 

Some of the preceding datasets will also be used as examples for this chapter. While the preceding list is not exhaustive, these are provided for the reader so that they can start experimenting with text classification and topic categorization tasks.

A comprehensive list of open datasets can also be found from the UCI repository at https://archive.ics.uci.edu/ml/datasets.html and from Kaggle at https://www.kaggle.com/datasets
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset