7.6. Computational Classification

In §6.5.2, “Implementing Categories That Do Not Conform to the Classical Theory” we briefly discussed the use of the machine learning technique known as clustering to create a system of categories for classifying a set of resources or documents for which measures of inter-item similarity can be calculated. Clustering programs do not start with a set of resources that are already classified, making them unsupervised techniques. The categories they create maximize the similarity of resources within a category and maximize the differences between them, but these statistically-designed categories are not always meaningful ones that can be named and used by people. We ended Chapter 6 by suggesting that it is often better to start with a designed classification scheme and then train computers with supervised learning techniques to assign new resources to the categories.

Because of its importance, ubiquity, and ease of processing by computers, it should not be surprising that a great many computational classification problems involve text. Some of these problems are relatively simple, like identifying the language in which a text is written, which is solved by comparing the probability of one, two, and three character-long contiguous strings in the text against their probabilities in different languages. For example, in English the most likely strings are “the”, “and”, “to”, “of”, “a”, “in”, and so on. But if the most likely strings are “der”, “die”, “und”, and “den” the text is German and if they are “de”, “la”, “que”, “el”, and “en” the text is Spanish.

More challenging text classification problems arise when more features are required to describe each instance being classified and where the features are less predictable. The unknown author of a document can sometimes be identified by analyzing other documents known to be written by him to identify a set of features like word frequency, phrase structure, and sentence length that create a “writeprint” analogous to a fingerprint that uniquely identifies him. This kind of analysis was used in 2013 to determine that Harry Potter author J. K. Rowling had written a crime fiction novel entitled The Cuckoo's Calling under the pseudonym Robert Galbraith.434a[Com]

Another challenging text classification problem is sentiment analysis, determining whether a text has a positive or negative opinion about some topic. Much academic and commercial research has been conducted to understand the sentiment of Twitter tweets, Facebook posts, email sent to customer support applications, and other similar contexts. Sentiment analysis is hard because messages are often short so there is not much to analyze, and because and because sarcasm, slang, clichés, and cultural norms obscure the content needed to make the classification.

A crucial consideration whenever supervised learning is used to train a classifier is ensuring that the training set is appropriate. If we were training a classifier to detect spam messages using email from the year 2000, the topics of the emails, the words they contain, and perhaps even the language they are written in would be substantially different than messages from this year. Up to date training data is especially important for the classification algorithms used by Twitter, Facebook, YouTube, and similar social sites that classify and recommend content based on popularity trends.

When the relevant training data is constantly changing and there is a great deal of it, there is a risk that by the time a model can learn to classify correctly it is already out of date. This challenge has led to the development of streaming algorithms that operate on data as it comes in, using it as a live data source rather than as a static training set. Streaming algorithms are essential for tackling datasets that are too large to store or for models that must operate under intense time pressure. Streaming approaches complement rather than replace those that work with historical datasets because they make different tradeoffs between accuracy and speed. The streaming system might provide real-time alerting and recommendations, while historical analyses are made on the batch-oriented system that works with the entire data collection.434b[Com]

[434b][Com] (Ellis 2014). A compelling demonstration of the need to sample big data streams to ensure against bias is (Morstatter et al 2013).

The best way to explain how computational classification works is with a detailed example. A familiar type of supervised machine learning to classify resources is email spam filtering. Messages are classified as SPAM or HAM (i.e., non-SPAM); the former are sent to a SPAM folder, while the latter head to your inbox. We can describe this technique generically as follows:

  1. First, the algorithm is provided with emails that have been assigned to the SPAM and HAM categories. These labeled instances make up the training set. The training set can be created explicitly for training the classification algorithm, or extracted from regular email activity when people sort their email or select a “report spam” button in their email program.

  2. The algorithm is given a set of pre-defined “features” of the email messages, some from the message metadata like the sender’s email address or the number of recipients, and some from the message content, like the presence of words that appear in SPAM messages with higher frequency than in HAM messages (like “pharmaceutical” or “beneficiary”). Different algorithms use different features, so the feature selection task is a pre-requisite step.435[Com]

    [435][Com] A straightforward method is to run the machine learning algorithms using different sets of features, and select the set of features that yields the best result. Because this method “wraps” around the machine learning algorithm itself, it is named “wrapper method.” However, it can be very computationally expensive to run machine learning algorithms multiple times, especially when the number of features are large. A faster alternative is the so-called “filter methods.” In filter methods, one computes how “informative” or “predictive” a feature is to a category. For instance, an email with word “lottery” is more likely to be SPAM than “metadata.” The numeric measures frequently used for such purpose are correlation and mutual information. Filter method is less computationally expensive than wrapper method, but in general does not perform as well as wrapper method.

  3. The algorithm examines each message and stores the components reflecting the presence or absence of the pre-defined features.

  4. The algorithm modifies itself (i.e., learns) based on the training set to adjust the weight given to each feature until it can correctly assign (most of) them into the categories they belonged to in the training set.

  5. The algorithm is now ready to classify new uncategorized messages to the SPAM or HAM categories.

  6. The algorithm can continue to improve its classification accuracy if the user gives it feedback by reclassifying SPAM messages as HAM ones or vice versa. The most efficient learning takes place when the algorithm uses “active learning” techniques to choose its own training data by soliciting user feedback only where it is not certain about how to classify a message. For example, the algorithm might be confident that a message with “Cheap drugs” in the subject line is SPAM, but if the message comes from a longtime correspondent, the algorithm might ask the user to confirm that this classification was correct.436[Com]

    [436][Com] See (Blanzieri and Bryl 2009) for a review of the spam problem and the policy and technology methods for fighting it. (Upsana and Chakravarty 2010) is somewhat more recent and more narrowly focused on text classification techniques.

    A very thorough yet highly readable introduction to Active Learning is (Settles 2012).

How the algorithm “learns” depends on the specific machine learning algorithm. Options include the “Naïve Bayes” model, decision trees, and support vector machines.437[Com] The choice of algorithm is called “model selection.”438[Com]

[437][Com] The “Naïve Bayes” model is based on the assumption that the features are conditionally independent of each other. Based on the training set, the probability of features conditional on categories is estimated; and then using Bayes’ Theorem, the probability of categories conditional on features is inferred. Although the assumption of conditional independent is not met in reality, Naïve Bayes works surprisingly well in practice.

Decision trees are another popular class of algorithms for classification. In the ID3 algorithm, during the learning process the training set is repeatedly split into subsets based on the feature that yielded the highest information gain. Information gain is a concept borrowed from information theory. Intuitively, the higher the information gain, the greater the difference in the distribution of the subsets.

Support vector machine (SVM) approaches work by constructing a plane that maximally separates the positive and negative training samples. The SVM is commonly regarded as one of the most effective off-the-shelf classification algorithms.

[438][Com] Cross validation is the method commonly used for model selection. It works as follows: the entire labeled data set is divided into training subset and holdout validation set. The commonly used ratio of the divide is 70% for training and 30% for holdout validation. To compare different machine learning algorithms, one runs each of these algorithms on the training subset, and tests the training result on the holdout cross validation set. The algorithm that yields the best performance on the cross validation set is selected.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset