Benchmark datasets

Following is a list of benchmark datasets used in the majority of text categorization research:

IMDB Movie Review dataset: This is a dataset for binary sentiment classification. It contains a set of 25,000 movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well. This dataset is downloadable from http://ai.stanford.edu/~amaas/data/sentiment/.
Reuters dataset: This dataset has 90 classes, 9,584 training documents and 3,744 testing documents. It's available as a part of the package nltk.corpus. The class distribution for the documents in this dataset is very skewed, with the two most frequent classes containing approximately 70% of all the documents. Even if we consider only the 10 most frequent classes, the two most frequent classes in this dataset have approximately 80% of all the documents. So most of the classification results are evaluated on subsets of these top frequent classes, and are named R8, R10, and R52 for the top 8, 10, and 52 frequent classes in the train set.
20 Newsgroup dataset: This data is organized into 20 different newsgroups, each corresponding to a different topic. Some of the newsgroups are very closely related to each other (for example: comp.sys.ibm.pc.hardware/comp.sys.mac.hardware), while others are highly unrelated (for example: misc.forsale/soc.religion.christian). Here is a list of the 20 newsgroups, partitioned into six major categories according to subject matter. This dataset is available in sklearn.datasets:

comp.graphics

comp.os.ms-windows.misc

comp.sys.ibm.pc.hardware

comp.sys.mac.hardware

comp.windows.x

rec.autos

rec.motorcycles

rec.sport.baseball

rec.sport.hockey

sci.crypt

sci.electronics

sci.med

sci.space

misc.forsale

talk.politics.misc

talk.politics.guns

talk.politics.mideast

talk.religion.misc

alt.atheism

soc.religion.christian

We will discuss later how to load up this dataset for further analysis.

Table of Contents for Benchmark datasets