Data for text classification

Before diving into the machine learning (ML) problems in text classification, we will take a look at the different open datasets that are available on the internet. Many of the classification tasks may require large labeled text data. This data can be broadly grouped into those with binary classes, multi-classes, and multi-labels. The following are some of the popular datasets used for benchmarking in both research and some competitions, such as Kaggle:

	Dataset name	Class type	Source
1	`IMDb movie Dataset`	Binary classes	http://ai.stanford.edu/~amaas/data/sentiment/
2	`Twitter Sentiment Analysis Dataset`	Binary classes	http://thinknook.com/twitter-sentiment-analysis-training-corpus-dataset-2012-09-22/
3	`YouTube Spam Collection Dataset`	Binary classes	https://archive.ics.uci.edu/ml/datasets/YouTube+Spam+Collection
4	`News Aggregator Dataset`	Multiclass	https://archive.ics.uci.edu/ml/datasets/News+Aggregator
5	`Yelp reviews`	Multi-label	https://www.yelp.com/dataset
6	`Amazon reviews` dataset	Multiclass	http://jmcauley.ucsd.edu/data/amazon/
7	`Reuters Corpora`	Multi-label/Multiclass	http://trec.nist.gov/data/reuters/reuters.html

Some of the preceding datasets will also be used as examples for this chapter. While the preceding list is not exhaustive, these are provided for the reader so that they can start experimenting with text classification and topic categorization tasks.

A comprehensive list of open datasets can also be found from the UCI repository at https://archive.ics.uci.edu/ml/datasets.html and from Kaggle at https://www.kaggle.com/datasets.

Table of Contents for Data for text classification

Create new playlist

Sign In

Sign Up

Table of Contents for
Data for text classification