Following is a list of benchmark datasets used in the majority of text categorization research:
- IMDB Movie Review dataset: This is a dataset for binary sentiment classification. It contains a set of 25,000 movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well. This dataset is downloadable from http://ai.stanford.edu/~amaas/data/sentiment/.
- Reuters dataset: This dataset has 90 classes, 9,584 training documents and 3,744 testing documents. It's available as a part of the package nltk.corpus. The class distribution for the documents in this dataset is very skewed, with the two most frequent classes containing approximately 70% of all the documents. Even if we consider only the 10 most frequent classes, the two most frequent classes in this dataset have approximately 80% of all the documents. So most of the classification results are evaluated on subsets of these top frequent classes, and are named R8, R10, and R52 for the top 8, 10, and 52 frequent classes in the train set.
- 20 Newsgroup dataset: This data is organized into 20 different newsgroups, each corresponding to a different topic. Some of the newsgroups are very closely related to each other (for example: comp.sys.ibm.pc.hardware/comp.sys.mac.hardware), while others are highly unrelated (for example: misc.forsale/soc.religion.christian). Here is a list of the 20 newsgroups, partitioned into six major categories according to subject matter. This dataset is available in sklearn.datasets:
comp.graphics comp.os.ms-windows.misc comp.sys.ibm.pc.hardware comp.sys.mac.hardware comp.windows.x |
rec.autos rec.motorcycles rec.sport.baseball rec.sport.hockey |
sci.crypt sci.electronics sci.med sci.space |
misc.forsale |
talk.politics.misc talk.politics.guns talk.politics.mideast |
talk.religion.misc alt.atheism soc.religion.christian |
We will discuss later how to load up this dataset for further analysis.