Chapter 6. Text Classification

We were talking about some of the most common NLP tools and preprocessing steps in the last chapter. This is the chapter where we will get to use most of the stuff we learnt in the previous chapters, and build one of the most sophisticated NLP applications. We will give you a generic approach about text classification and how you can build a text classifier from scratch with very few lines of code. We will give you a cheat sheet of all the classification algorithms in the context of text classification.

While we will talk about some of the most common text classification algorithms, this is just a brief introduction and to get to a detailed understanding and mathematical background, there are many online resources and books available that you can refer to. We will try to give you all you need to know to get you started with some working code snippets. Text classification is a great use case of NLP, but in this chapter, instead of using NLTK, we will use scikit-learn that has a wider range of classification algorithms and its library is much more memory efficient for text mining.

By the end of this chapter:

  • You will learn and understand all text classification algorithms
  • You will learn end-to-end pipeline to build a text classifier and how to implement it with scikit-learn and NLTK

The following is the scikit-learn cheat sheet for machine learning:

Text Classification

credit : scikit-learn

Now, as you travel along the process shown in the cheat sheet. We have a clear guideline about what kind of algorithm is required for which problem? When we should move from one classifier to another depending on the size of the tagged sample? It's a good place to start following this for building practical application, and in most cases this will work. We will focus mostly on text data while the scikit-learn can work with other types of data as well. We will explore text classification, text clustering, and topic detection in text (dimensionality reduction) with examples in this chapter and build some cool NLP applications. I will not go in to more detail about the concepts of machine learning, classification, and clustering in this chapter, as there are enough resources available on the Web for you. We will provide you with more details of all these concepts in the context of a text corpus. Still, let me give you a refresher.

Machine learning

There are two types of machine learning techniques—supervised learning and Unsupervised learning:

  • Supervised learning: Based on some historic prelabeled samples, machines learn how to predict the future test sample, based on the following categories:
    • Classification: This is used when we need to predict whether a test sample belongs to one of the classes. If there are only two classes, it's a binary classification problem; otherwise, it's a multiclass classification.
    • Regression: This is used when we need to predict a continuous variable, such as a house price and stock index.
  • Unsupervised learning: When we don't have any labeled data and we still need to predict the class label, this kind of learning is called unsupervised learning. When we need to group items based on similarity between items, this is called a clustering problem. While if we need to represent high-dimensional data in lower dimensions, this is more of a dimensionality reduction problem.
  • Semi-supervised learning: This is a class of supervised learning tasks and techniques that also make use of unlabeled data for training. As the name suggests, it's more of a middle ground for supervised and unsupervised learning, where we use small amount of labeled data and large amount of unlabeled data to build a predictive machine learning model.
  • Reinforcement learning: This is a form of machine learning where an agent can be programmed by a reward and punishment, without specifying how the task is to be achieved.

If you understood the different machine learning algorithms, I want you to guess what kind of machine learning problems the following are:

  • You need to predict the values of weather for the next month
  • Detection of a fraud in millions of transactions
  • Google's priority inbox
  • Amazon's recommendations
  • Google news
  • Self-driving cars
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset