Introducing NLP

Huge, unstructured textual data is getting generated on a daily basis. Social media, websites such as Twitter and Facebook, and communication apps, such as WhatsApp, generate an enormous volume of this unstructured data daily—not to mention the volume created by blogs, news articles, product reviews, service reviews, advertisements, emails, and SMS. So, to summarize, there is huge data (in TBS).

However, it is not possible for a computer to get any insight from this data and to carry out specific actions based on the insights, directly from this huge data, because of the following reasons:

  • The data is unstructured
  • The data cannot be understood directly without preprocessing
  • This data cannot be directly fed in an unprocessed form into any ML algorithms

To make this data more meaningful and to derive information from it, we use NLP. The field of study that focuses on the interactions between human language and computers is called NLP. NLP is a branch of data science that is closely related to computational linguistics. It deals with the science of the computer  analyzing, understanding, and deriving information from human natural language-based data, which is usually unstructured like text, speech, and so on.

Through NLP, computers can analyze and derive meaning from human language and do many useful things. By utilizing NLP, many complex tasks, such as an automatic summary of huge documents, translations, relationship extraction between a different mass of unstructured data, sentiment analysis, and speech recognition, can be accomplished.

For computers to understand and analyze human language, we need to analyze the sentence in a more structured manner and understand the core of it. In any sentence, we need to understand three core things:

  • Semantic information: This relates to the meaning of the sentence. This is the specific meaning of the words in the sentence, for example, The kite flies. Here, we don't know whether the kite is man-made or a bird.
  • Syntactic information: This relates to the structure of the sentence. This is the specific syntactic meaning of the words in a sentence. Sreeja saw Geetha with candy. Here, we are not sure who has the candy: Sreeja or Geetha?
  • Pragmatic information (context): This relates to the context (linguistic or non-linguistic) of the sentence. This is the specific context in which the words in the sentence are used. For example, He is out in the context of baseball and healthcare is different.

However, computers cannot analyze and recognize sentences as humans do. Therefore, there is a well-defined way to enable computers to perform text processing. Here are the main steps involved in that exercise:

  1. Preprocessing: This step deals with removing all the noise from the sentence, so the only information critical in the context of the sentence is retained for the next step. For example, language stop words ("noise"), such as is, the, or an, can be removed from the sentence for further processing. When processing the sentence, the human brain doesn't take into consideration the noise that's present in the language. Similarly, the computer can be fed with noiseless text for further processing. 
  2. Feature engineering: For the computer to process the preprocessed text, it needs to know the key features of the sentence. This is what is accomplished through the feature engineering step.
  3. NLP processing: With the human language converted into a feature matrix, the computer can perform NLP processing, which could either be classification, sentiment analysis, or text matching. 

Now, let's try to understand the high-level activities that would be performed in each of these steps.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset