CHAPTER 3
Handling Unstructured Data

In this chapter, we look in more detail at the differences between structured and unstructured data. This difference in type of data often drives the selection of certain classes of algorithms for ML. We see what makes unstructured data different and why it needs particular attention to handle it properly. We explore common types of unstructured data like images, videos, and text. We see which techniques and tools are available to analyze this data and extract knowledge from it. We see examples of converting structured data into features that can be used for training Machine Learning models.

Structured vs. Unstructured Data

As we saw in the previous chapter, the key to ML is providing good data that the model can learn patterns from and then make its own predictions on unseen data. We need to provide good clean data to the model in a way that it can learn from. Structured data is data in a state that can be easily consumed by a model. Here there is a fixed data structure to how you receive the data to feed to your model. Over time or over multiple data points, this structure does not change. Hence, you can map your features to this structure. Each data point can be thought of as a fixed size vector, with each dimension or row of the vector representing a feature.

Figure 3.1 shows two examples of structured data. The first is timeseries data obtained as sensor readings. Here you get the same vector data points over different intervals of time. The timestamp in this case is the key or index field (column) that is the unique identifier. We will not have two data points with the same exact timestamp (unless our data collection system has an error).

Illustration depicting two examples of structured data: (left) timeseries obtained as sensor readings and (right) tabular data obtained as loan history.

Figure 3.1: Structured data examples—timeseries and tabular data

The example in Figure 3.1 is tabular or columnar data that shows the history of loans given by a financial institution. It is usually recommended to have a unique key like a customer ID in this case, so we can have fast searches based on the key. However, for the same customer, you may have two loans and you'll end up with two entries for the same customer ID. In that case, it is recommended you have a unique key like a loan ID.

Now you can see that each of the data points is a finite length vector of numbers that can be fed into the ML model for training. Similarly, after the model is developed for prediction or inference, the data in the same formal structure can be fed to the model. The features that are used for training map directly to the columns in the structured data. Of course, you may still need to cleanse the data.

For example, the timeseries data always comes with a quality value set by the data acquisition system (DAQ). If the data acquisition system gets the sensor data correctly, it will assign a quality flag of good—in this case 1. An example could be a DAQ with sensor wires connected at different input/output (I/O) points. If a wire is loose and the signal does not come from the sensor to the DAQ box, it will set the flag as bad. One data cleansing step will be to get rid of all the bad‐quality data points.

Other examples of structured data include clickstreams, which are collected whenever users click website links; weblogs, which are logs of website statistics collected by web servers; and of course, gaming data, which captures every step you take and every bullet you fire in Call of Duty!

Now let's talk about unstructured data. This could be images or videos collected from cameras. A video stream may be obtained from cameras and stored into common video files like MP4s and AVIs. Text data may be collected from email messages, web searches, product reviews, tweets, social media postings, and more. Audio data may be collected just through sound recorders on cell phones or by placing acoustic sensors at strategic locations to get the maximum sound signal.

Unstructured data is called so because the data points do not follow a fixed structure. An image may come in as an array of pixel intensity values. Text may be encoded as a sequence of characters in special encodings like ASCII (American Standard Code for Information Interchange). Sound may come in as a set of pressure readings. There is no fixed structure to this data. You cannot read data from the pixel arrays and say that the image has a person in it, for example.

There are usually two popular ways to handle unstructured data, as shown in Figure 3.2.

Illustration depicting two popular ways to handle unstructured data: first one involves cleansing the data, removing noise, and finding the key features and second one is a ready-made mold of a house.

Figure 3.2: Two paths to handling unstructured data

  • The first approach is to extract features from the unstructured data. This involves cleansing the data, removing noise, and finding the key features. In Figure 3.2, we see unstructured data as a big blob. After cleansing, you can extract structured features—analogous to LEGO blocks. These LEGO blocks can then be assembled to build the result, such as a house.
  • The second approach is to use a method called end‐to‐end learning. This is analogous to a ready‐made mold of a house in which you fit the unstructured data. You don't have to do any cleansing or preparation—you just get the right mold and fit the data into it to get your desired shape. Of course, you need the right mold for the particular result you are trying to get. End‐to‐end models are where Deep Learning really shines. Here the mold is analogous to the appropriate DL architecture that is used to build your model. This is getting standardized very fast. A DL architecture called Convolutional Neural Networks (CNN) has been universally accepted as a standard for all image and video tasks. Similarly, while handling text and speech data, since this data comes as a sequence of inputs, the universally accepted architecture here is Recurrent Neural Networks (RNN). We will cover the DL techniques in detail in Chapters 4 and 5.

In reality, you may not find a silver bullet using either approach. The end‐to‐end approach looks good but will not work in all cases. You will have to use trial and error to see what best fits your needs and datatypes. Sometimes you may have to use a hybrid approach. You may have to cleanse the data to some level and then feed it into a DL model. Although RNNs are best for sequence data, you may find CNN used for sequence data after some preprocessing. The method or combination of methods usually depends on the problem domain and this is where the experience of data scientists comes into play. For now, let's explore each type of unstructured data and the common methods of handling it.

Making Sense of Images

When a computer reads an image, it is usually captured from a digital camera or a scanner and stored in digital form in computer memory. When we take a photo with a digital camera, our camera has an optical sensor that captures light from a scene, renders this inside our camera, and saves the image as a series of numbers—basically a large sequence of 0s and 1s. In raw form, a two‐dimensional image is basically a matrix or array of pixel values. Here each pixel value represents intensity of a particular color. However, it does not have a human‐readable value like wine alcohol percentage or quality rating. This data is usually referred to as unstructured. The individual values have less significance but as a whole they complement each other and form the bigger domain object like an image.

First, let's look at an example of how a computer captures and stores unstructured data. Say we have an image of a handwritten digit, as shown in Figure 3.3. This is an image from the open handwriting image dataset—considered the “Hello World” for Deep Learning problems—known as MNIST. This has a training set of 60,000 examples and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size‐normalized and centered in a fixed‐size image. This dataset is made available by Yann Lecun at the website http://yann.lecun.com/exdb/mnist.

A pixel image of a handwritten digit 5 in 28 × 28 resolution, represented in digital format in the computer's memory.

Figure 3.3: An image of a handwritten digit 5 in 28×28 resolution

The image we have in Figure 3.3 is a 28×28 pixel image. That means this image is represented in digital format—in the computer's memory—as a two‐dimensional array of pixels with 28 rows and 28 columns. The value of each element of the array is a number from 0 to 255 representing the intensity of black or white color, with 255 being all white and 0 being black. 150 will be a gray cell. Figure 3.4 expands this image to show exactly how these pixel intensity values look.

A pixel image expanded to show the two-dimensional array of pixels with 28 rows and 28 columns, in detail.

Figure 3.4: The image expanded to show the 28×28 pixel array in detail

We see in the expanded image in Figure 3.4 the details of the color values for each of the 28×28 pixels in the array. The border is shown to differentiate the pixels. The white, black, or shade of gray value for each pixel is represented by a number between 0 and 255. Figure 3.5 shows the raw data.

An image array of raw data with pixel intensity values, represented by a number between 0 and 255, depicted as a border to differentiate the pixels.

Figure 3.5: Image array as raw data with pixel intensity values

Figure 3.5 is how a computer sees this image. You can see most pixels have a 0 value, representing a black color. The values with white and gray form a pattern of the digit 5. Also keep in mind that since this is a grayscale image, the values of pixel arrays are just a single integer. If we had a color image then these pixel array values would be arrays with RGB values. That is, each cell would be an array with values for red, green, and blue color intensities.

Also, a computer understands only 0s and 1s. When this image is stored in computer memory, the pixel array values are not stored as numbers—139, 253, etc. They get converted to sequences of 0s and 1s. Using the appropriate number encoding used by the computer, each integer is stored as a sequence of bits (0 or 1)—usually as a sequence of eight bits—that can capture values from 0 to 255. Hence 255 is the highest value, which is assigned to the color white.

You can actually see this in the array in Figure 3.5. Our brain is so amazing that it finds the pattern even in this huge array of values. But how does the computer extract this knowledge from this pixel array? For that, it needs a human‐like intelligence, which is delivered using Machine Learning algorithms.

The features for this dataset are the pixel array values, so a total of 28 × 28 = 784 pixels. It will be extremely difficult to get a regular Machine Learning correlation between values of the pixels and digits we want to predict.

This is how the image is processed by the computer; however, it does not make sense to store such a large array for each image. Practically, we compress the image from the large array into a compressed format that is optimized for storage. We know these compressed storage formats as file extensions—GIF (Graphics Interchange Format), JPG/JPEG (Joint Photography Experts Group), and PNG (Portable Network Graphics). These file extensions have their own ways of compressing data and saving images. You can use a computer vision or image processing library like OpenCV or PIL (Python Imaging Library) to read files from these formats and convert them into arrays for processing. Let's look at some examples.

Computer Vision

Computer vision is all about seeing things in images. We process images and extract knowledge from them. We can do things like find geometrical objects such as lines, rectangles, circles, etc., in images. We can look at colors of different objects and try to separate them. The knowledge extracted, which may be geometry or colors, can be used to prepare features that will be used to train our ML model. Hence, computer vision helps us in feature engineering to extract important knowledge from the large image array. Let's look at this through some examples.

We will use one of the most popular image‐processing libraries called OpenCV. This was developed at Intel and then open sourced. Currently this is maintained as an open source solution at opencv.org. OpenCV is written in C++, but has APIs available in other languages like Python and Java. We will of course use Python as before. You can install it from the website. Incidentally, OpenCV comes preinstalled when you start a Notebook at Google Colaboratory.

Here we will cover some of the basic CV steps that will help you do some preprocessing on images. You can find a whole lot more examples at the OpenCV website at https://docs.opencv.org/4.0.0/d6/d00/tutorial_py_root.html.

Now we will look at some key computer vision tasks that are done to load and process images. We will do these in OpenCV. We will first load an image from disk, display it, and manipulate the pixels to show how it changes (see Listing 3.1). We will use a free and openly available image from Wikipedia—the Mona Lisa. The Mona Lisa is a painting by the Italian Renaissance artist Leonardo da Vinci. It's been described as “the best known, the most visited, the most written about, the most sung about, the most parodied work of art in the world.” It is worth almost $800 million today. The image of Mona Lisa we will use is available at https://en.wikipedia.org/wiki/Mona_Lisa. You can save this as monalisa.jpg on your local drive. See Figure 3.6.

Here are the results:

OpenCV Version: 3.4.2
Original image array shape: (1024, 687, 3)
Pixel (100,100) values: [145 152 95]
 
Resized image array shape: (600, 400, 3)
 
Image converted to grayscale. 
Image of a famous painting converted to grayscale, using the OpenCV library (CV2), resizing it to a 400-pixel width and 600-pixel height image.

Figure 3.6: Load image using OpenCV and convert it to grayscale

We have loaded the image using the OpenCV library (CV2) and we have it as an array. We resized the image to a 400‐pixel width and 600‐pixel height image and displayed it using the Matplotlib charting library.

Finally, we save the modified image as a new JPG file called new_monalisa.jpg. This new image has 400×600 pixels—that is, 240,000 pixels. Each pixel has three values, indicating three color channels. Each color value representing red, blue, and green has an 8‐bit integer value between 0 and 255. So the total size of the image should be 240,000 × 3 × 8 bits = 720,000 × 8 bits, which is 720,000 bytes or 720 kilobytes (kb). If you look at the new file generated (it's called new_monalisa.jpg), it's about 124KB. That's the level of compression JPG encoding provides us with.

One thing you will notice in this code is that we changed the color spaces back and forth. Color spaces determine how the information about colors is encoded in a digital image. The most popular way of representing color is using three values—one each for red, green, and blue (RGB) elements. Any color can be represented as a combination of these three colors. The RGB color model is an additive color model in which red, green, and blue values are added together in various ways to reproduce a broad array of colors. So red is represented as (255,0,0), green as (0,255,0), and blue as (0,0,255). As you see in Figure 3.7, a combination of red and green gives us yellow, green and blue gives cyan, and blue and red gives pink.

Image depicting an RGB model represented as a combination of three elements in the form of a Venn diagram.

Figure 3.7: RGB color space source Wikipedia

(Source: SharkD)

Listing 3.2 shows some examples of how the additive nature of the RGB color space works. You see that you can mix colors and get new colors. Black and white are extremes with all 0 or 255 values for the RGB color channels. You can try several combinations and see what you get. Keep in mind that here the resolution or granularity of the digital colors is 8 bits. Hence, for any channel, the maximum number we can use to represent the color is 255. This is the most common resolution. However, sharper systems use a 16‐ or 24‐bit color resolution and these can represent even more variation in colors.

There are other color spaces used by different systems. For example, OpenCV loads and saves images in the BGR color space instead of RGB. Hence, we need to convert the color space after reading or before storing to save it in the correct format. Some of the other popular color spaces are YPbPr and HSV. YPbPr is a color space used in video electronics, particularly with component video cables. HSV (Hue, Saturation, Value) is also a popular color space usually representing colors in a true sense and not additive like RGB.

Now, let's do some processing on this image, as shown in Listing 3.3. We first convert the image into grayscale or black‐and‐white. Then we fill a portion of the image as a black rectangle. Then we crop a portion of the image and fill it elsewhere. We do these as array operations. Figure 3.8 shows the results.

Image of a famous painting depicting the results of array operations on the image, making a copy of the original image in memory.

Figure 3.8: Results of array operations on the image

Now we will use OpenCV's built‐in functions for drawing some geometries and text on the image. We will first make a copy of the original image in memory, which we call temp_image, and then we process this to show the results. For showing the results, we define a dedicated function. This will get rid of the axes when the image is shown and set the image size. Let's see this action in Listing 3.4. Figure 3.9 shows the results.

Image of a famous painting depicting the results of the OpenCV operations on the image.

Figure 3.9: Results of the OpenCV operations on the image

Now we will use OpenCV's functions for doing some image‐cleansing operations. These can be pretty handy when you're dealing with noisy images, which is often the case when you get field images. Many times, the color may not store important information about the image. You may be interested in understanding the geometry, and in that case, a grayscale image is fine. So first we convert our image to grayscale and then perform a thresholding operation on it.

Thresholding is a very important operation in computer vision. It is basically a filtering operation that checks for pixel intensity up to a particular value. Anything below that value is removed. This way, we only get specific details like bright areas of the image.

Let's see this action in Listing 3.5. Figure 3.10 shows the result.

Images of the same painting depicting the results of six different thresholding operations on the image.

Figure 3.10: Results of thresholding operations on the image

Now we will perform two operations that can greatly help you make images smooth and remove noise. We will use a process called convolution to run a filter or kernel over the image. The filter will have a particular structure that will help process the image and transform it. Using special kinds of filters, we can do operations like smooth or blur the image or sharpen it. These are the operations often done by image processing software like Photoshop and mobile photo editors.

We will use two filters/kernels of the following type. These will be uniformly applied over the entire image array and we will see how the results transform the image:

Kernel_1 = 1/9 * [ [1,1,1], 
[1,1,1], 
[1,1,1]]
 
Kernel_2 =    [    [-1,-1,-1],
[-1,+9,-1], 
[-1,-1,-1]] 

Let's see this action in Listing 3.6. Figure 3.11 shows the results.

Image of the same painting depicting two different results of applying 2D filters to the original image.

Figure 3.11: Results of applying 2D filters to the image

You can use these techniques to cleanse the images you collect of noise. Smoothing helps get rid of unwanted noise in images. In some cases, if the images are too blurry, you can use a sharpening filter to make the features look more prominent.

Another very useful technique that is often used is to extract geometry information from images. You can take a grayscale image and extract the edges from it. This helps remove unwanted details like colors, shading, etc., and focuses only on the prominent edges. Listing 3.7 shows the code and Figure 3.12 shows the result.

Image of the same painting depicting the results of applying Canny edge detection on the original image.

Figure 3.12: Results of applying Canny edge detection

We will see one last example that may be helpful when you handle image data. We earlier saw an example where we took a small region of interest (ROI) from a bigger image. However, in that case, we knew the exact coordinates that corresponded to the face of Mona Lisa. Now we will see a technique to detect faces directly. This is an ML technique that is included with the OpenCV library. We will cover details of the ML methods in the next chapter; however, let's talk a little about this method.

OpenCV comes with an algorithm that can look at images and automatically detect faces in them. This algorithm is called Haar Cascades. The idea here is that it tries to use some knowledge of how a face looks in a big array of pixels. It tries to capture knowledge like the fact that our eyes are usually darker than the rest of our face, the region between the eyes is bright, etc. Then, using a cascade of learning units or classifiers, it identifies the coordinates of a face in an image. These classifiers for detecting faces, eyes, ears, etc. are already trained for you and made available on the OpenCV GitHub at https://github.com/opencv/opencv/tree/master/data/haarcascades.

Take a look at the face detection in action in Listing 3.8. Figure 3.13 shows the result.

Image of the same painting depicting the results of detecting a face using the Haar Cascade Classifier on the original image.

Figure 3.13: Results of detecting a face using the Haar Cascade Classifier

These preprocessing steps can greatly cleanse your noisy images and help you extract valuable information, which can then be used to train the ML model. Using smoothing and edge detection, you can get rid of the background and only give the model relevant information to work with. Similarly, say you are building a face detection analytic—like the one that iPhone uses to unlock with face identification. The first step would be to reduce the large image into a much smaller and more manageable region of interest, which can be processed much faster by your face recognition model.

There are lots more algorithms and methods that computer vision libraries like OpenCV provide. If your data involves images, then you can look at details of some of the other methods like extracting Hough lines, circles, matching image templates, etc. They are available at https://docs.opencv.org/4.0.0/d6/d00/tutorial_py_root.html.

Next, we'll look at how we can handle video data. Again, we will use computer vision methods to do so.

Dealing with Videos

Videos are basically sequences of images over time. They can be like a timeseries of image data. Typically, you extract the frames at specific times from a video and process them using regular computer vision or ML algorithms. Now you may feel that storing all these images in sequence may make the video files extremely huge. A typical video will have around 24 or 30 frames per second (fps), which indicates that every second there will be 30 images. You can see how the file sizes would typically grow enormous. That's where the video formats come into play.

Just like image storage formats like JPG, GIF, and PNG compress the pixel arrays into binary formats, video compression and decompression (codec) will compress the sequence of images that create a video. Common video codecs used are XVid, DivX, and the current most popular, H.264. These codecs define how the frames are encoded to maximize storage and minimize loss.

Along with a codec, another specification the video has is the type of container used. This is also known as the format. The container stores the contents of the video file encoded by the respective codec. Popular container formats are AVI, MOV, and MP4. Not all MP4 files are encoded by the same codec. Some may need a special codec and hence your video player may need to download a special codec—although the extension is the same—.MP4. Sometimes the video content may be available as a stream rather than as a container. Here also a similar codec is used, only the content is streaming. That's how you get content delivered over YouTube and Netflix.

Computer vision libraries like OpenCV provide codec support to decode these video files and extract frames. OpenCV can also connect to a live stream from a source like a camera and extract video. Check out the example code in Listing 3.9. It is difficult to show the actual results in a book, but you can run the example on your machine.

This code will read a video file, extract the frames (images), convert the frames to grayscale, and write every 30th frame out. Assuming 30 frames per second, you should get a frame per second. After you have the images or frames, you can run the same computer vision algorithms to extract valuable information.

Next, we cover handling another interesting datatype—text.

Handling Textual Data

Data in text format is one of the most common forms of unstructured data around us. We don't often consider text as a data source; however, analyzing text can give us rich insights into several aspects, particularly human behavior.

You have probably had this experience yourself. The other day, I searched for reviews of a new PlayStation game on Google. The next thing I knew, I started getting bombarded with advertisements of games in the same genre. I also got an email from Amazon recommending more games. When I entered my search query, Google had an algorithm that extracted the meaning of my search query and learned that I am interested in that product. Then it passed that information to other algorithms that found similar products and provided me with recommendations. That is the magic of modern advertising. Companies like Google, Facebook, and Twitter have advertisements as one of their major revenue sources. They continuously analyze volumes of text content generated from product reviews, social media postings, and tweets to extract valuable information about the lifestyles of their customers. Many times, this information is sold to third parties, who can mine this data and extract valuable insights. Text mining is a major activity where companies try to extract value from text content using advanced Natural Language Processing (NLP) algorithms.

Another example of analyzing text data is the chatbot, which understands text messages sent by customers and responds appropriately by searching through huge databases of text. Here the chatbot needs to be smart enough to understand what the customer asked for and respond correctly. Many online support services employ chatbots and you may not even know that you are not talking to a human on the other end. Text analysis is also extensively used for filtering emails and identifying spam content. This is a classification problem where, based on the content of the message, we give it a label of spam or not.

What makes text data unique is that it comes in as a sequence of characters, unlike an image, which is one big blob or array of data. Text content comes in as a sequence and has to be processed so that the meaning or context can be derived. In the computer memory, text data is encoded using several types of encoding. It could be a proprietary encoding like a Microsoft Word file or an open encoding specified by American Standard Code for Information Interchange (ASCII). Now this sequence of text data has to be analyzed for meaning.

As we saw in Figure 3.2 for text data, you can follow one of the same two approaches. You can denoise the data and extract features using specialized text processing techniques like NLP. Or you can feed the text as a vector to Deep Learning models that learn to extract this information.

For NLP, one of the most popular libraries is NLTK (Natural Language Tool Kit). NLTK is written in the Python programming language. It was developed by Steven Bird and Edward Loper from the Department of Computer and Information Science at the University of Pennsylvania. Details about this library are available at https://www.nltk.org.

Let's look at some examples of processing text data to cleanse it and extract features from it. We will look at an example of using an end‐to‐end DL approach in the next chapter. Here we will also see an example of a Recurrent Neural Network (RNN).

Natural Language Processing (NLP)

NLP is about processing text data to cleanse and extract valuable information from it. If we can understand the meaning of the text and do some action, then it is termed as a different activity called Natural Language Understanding (NLU). NLP usually deals with low‐level actions and NLU deals with higher level ones. The chatbot case we discussed earlier is an example of NLU. However, many times we generalize and for all text analysis, use the term NLP.

Let's look at some basic concepts about NLP. Text is stored in groups called documents. Documents contain words, which are called tokens. We could group tokens from a document together into smaller groups separated by a full‐stop called sentences. A sentence is usually a sequence of tokens that carries some meaning and should be processed together and in order. A group of similar documents is called a corpus. Many corpora are available online for free to test our NLP skills. NLTK itself comes with corpora like Reuters (news), Gutenberg (books), and WordNet (word meanings) that have specific content.

Let's look at some quick and simple examples with sample NLTK code, which you can easily apply to your data to analyze text.

First, we will work on cleansing the data. We will convert the text to lowercase and then tokenize the text to extract words and sentences. Then we will remove some commonly occurring stop words. Stop words like the, a, and and usually don't add value to overall the context or meaning of a sentence. Finally, we will create a frequency plot to identify the most common words. This can easily give us the gist of words of importance and help in summarizing the content. You can look at this effort in Listing 3.10.

Here are the results:

ORIGINAL TEXT = We are studying Machine Learning. Our Model learns
patterns in data. This learning helps it to predict on new data.
‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐
 
WORD TOKENS = ['we', 'are', 'studying', 'machine', 'learning', '.',
'our', 'model', 'learns', 'patterns', 'in', 'data', '.', 'this',
'learning', 'helps', 'it', 'to', 'predict', 'on', 'new', 'data', '.']
‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐
 
SENTENCE TOKENS = ['we are studying machine learning.', 'our model
learns patterns in data.', 'this learning helps it to predict on new
data.']
‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐
 
STOP WORDS = ['is', 'a', 'our', 'on', '.', '!', 'we', 'are', 'this',
'of', 'and', 'from', 'to', 'it', 'in']
‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐
 
CLEANED WORD TOKENS = ['studying', 'machine', 'learning', 'model',
'learns', 'patterns', 'data', 'learning', 'helps', 'predict', 'new',
'data']
‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐
 
CLEANED STEMMED TOKENS = ['studi', 'machin', 'learn', 'model', 'learn',
'pattern', 'data', 'learn', 'help', 'predict', 'new', 'data']
‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐
 
CLEANED LEMMATIZED TOKENS = ['studying', 'machine', 'learning',
'model', 'learns', 'pattern', 'data', 'learning', 'help', 'predict',
'new', 'data']
‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐ 

If you follow the code and results in Listing 3.10, we take a set of sentences through a series of cleansing steps. We make the text lowercase, tokenize the text into words, and remove any stop words. Then for each token we apply two normalization techniques in parallel—stemming (see Figure 3.14) and lemmatization (see Figure 3.15). Both these techniques try to remove different versions of the same word and try to make the text simple. They try to remove multiple versions of the same base word such as learns, learning, and learned for the base word learn.

Grid chart presenting the frequency of some common cleaned and stemmed words.

Figure 3.14: Frequency chart of common words—stemmed

Grid chart presenting the frequency of some common cleaned and lemmatized words.

Figure 3.15: Frequency chart of common words—lemmatized

Stemming is a more heuristic technique, where common suffixes are chopped off, like s, es, and ing. However, in doing so, sometimes the true meaning of the word is lost. In the results for stemming, you see some non‐words like machin and studi. Lemmatization, on the other hand, tries to derive the actual root word and keeps the results as valid words. Hence, we see valid words as a result of lemmatization. This is usually preferred when you process text.

Finally, we get a frequency of most common words and plot it—both stemmed and lemmatized. This gives us a high‐level summary of the most frequently occurring words and helps us get a gist of the text. We have a very small amount of text here, but when you apply this approach to a large document or corpus, you can clearly see the key terms popping up with high frequency.

After cleansing the text data, we will explore how to extract some useful information. We will look at two very useful text processing concepts called parts of speech (POS) tagging and Named Entity Recognition (NER). Here, we are extracting contextual information about the text, so the sequence of words is very important. The sequence in which words are arranged helps the algorithm understand what part of speech each word represents.

POS tagging takes a word‐tokenized sentence and identifies the parts of speech, like nouns, verbs, adverbs, etc. A detailed list of tag names added by NLTK to words and their meanings is shown in Listing 3.11.

Named Entity Recognition takes POS one step further by identifying real‐world entities like person, organization, event, etc., from words. Take a look at the quick example in Listing 3.12.

Here are the results:

SENTENCE TO ANALYZE = Mark is working at GE
‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐
 
PARTS OF SPEECH FOR SENTENCE = [('Mark', 'NNP'), ('is', 'VBZ'),
('working', 'VBG'), ('at', 'IN'), ('GE', 'NNP')]
‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐
 
NAMED ENTITIES FOR SENTENCE = (S (PERSON Mark/NNP) is/VBZ working/VBG
at/IN (ORGANIZATION GE/NNP))
‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐ 

Here you can see that Mark and GE were tagged as proper nouns and is and working were tagged as verbs. When we do NER, it identifies Mark as a person and GE as an organization. As you analyze bigger volumes of text, this technique can be invaluable to extracting key named entities.

Word Embeddings

So far, we have kept the text as is and applied some NLP techniques to cleanse the data, find word frequencies, and extract information like parts of speech and named entities. However, for more complex processing, we will need to convert the text into vectors or arrays that can help us extract more value. This is just like the case of images we convert into an array of pixel intensity values for better processing. Now we will see how we can convert text into arrays. The key thing with text data is that for extracting a value, we need to treat it like a sequence. We need to process the words in order so that the contextual information is captured correctly.

One of the most basic ways to create a word vector is using one‐hot encoding. One‐hot encoding is used often to represent categorical data, where each data point belongs to a particular category. So, here we have a large binary array with elements equal to all possible categories. For any data point, all the elements are zero values except for the one that represents the category of that data point—which has a value of 1. Listing 3.13 shows an example. We will first create a vocabulary of all the words that are relevant. This is obtained by analyzing all the words in our corpus. Here, that's just a small amount of text. Then, using this vocabulary, we can build one‐hot encoded vectors.

Here are the results:

VOCABULARY = ['.', 'a', 'ai', 'electricity', 'industries', 'is',
'large', 'many', 'new', 'on', 'poised', 'start', 'the', 'to',
'transformation']
‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐
 
ONE HOT VECTOR FOR '.' = [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
ONE HOT VECTOR FOR 'a' = [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
ONE HOT VECTOR FOR 'ai' = [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
ONE HOT VECTOR FOR 'electricity' = [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
ONE HOT VECTOR FOR 'industries' = [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
ONE HOT VECTOR FOR 'is' = [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
ONE HOT VECTOR FOR 'large' = [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]
ONE HOT VECTOR FOR 'many' = [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]
ONE HOT VECTOR FOR 'new' = [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
ONE HOT VECTOR FOR 'on' = [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0]
ONE HOT VECTOR FOR 'poised' = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]
ONE HOT VECTOR FOR 'start' = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0]
ONE HOT VECTOR FOR 'the' = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]
ONE HOT VECTOR FOR 'to' = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0]
ONE HOT VECTOR FOR 'transformation' = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1] 

As you can see, for a very small text set like this with a couple sentences, we get pretty big vectors. As we look at corpora with vocabularies of thousands or millions of words, these vectors can get extremely large. Hence this method is not recommended.

Another way of representing text is using word frequencies for full sentences or documents. We first define a vocabulary for the corpus, and then for each sentence or document we count the frequency of each word. Now we can represent each sentence or document as an array with the count of each word occurring. We could convert the count into percentages to show the relative importance of words. The problem with this approach will be that many stop words like and, the, to, etc. will have very high frequencies.

An alternative approach that is popular is called term frequency–inverse document frequency (TF‐IDF). This is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. This method assigns frequency terms for words but also compares with words occurring in different documents in the corpus. So, if more documents in the corpus contain the word then it's more likely to be a stop word and is given smaller value. On the other hand, if we have a term that is frequent in a particular document but not in other documents, then most likely that's a subject area for that document. That's the concept of TF‐IDF. The problem with TF‐IDF is that again the vector can get pretty big due to high vocabulary size. Also, it does not capture the context of the word. It does not consider the sequence of words to try to capture the context.

Modern systems use a method called word embeddings to convert words into vectors. Here the embedding values are so assigned that similar words tend to appear together. This concept is known as topic modeling. We will use a popular open source library that focuses on topic modeling called Gensim. Gensim was developed and is maintained by the Czech natural language processing researcher Radim Řehůřek and his company RaRe Technologies. Details are available at https://radimrehurek.com/gensim/index.html.

Gensim can be installed using the Python pip installer, as follows:

pip install --upgrade gensim 

We will now look at a very popular algorithm for learning word embeddings, called Word2Vec. Word2Vec is a neural network model that learns the context of words and builds dense vectors that represent each word with its context. First you will need to train this model on your data and then start using it to get word embeddings. You can download and use pretrained word embedding models on general corpora and use them. We will see an example of building an embedding on our dataset. Unlike the one‐hot encoded vectors, which were sparse, here the vectors we get are dense with fixed lengths. Hence, they can easily represent words with limited storage and can be processed very fast. Internally, Word2Vec uses a combination of two learning models—continuous bag of words (CBOW) and skip‐grams. The details of how these algorithms work can be found in this wonderful research paper: https://arxiv.org/pdf/1301.3781.pdf.

For now, we will look at the implementation of creating word embeddings from our text. Take a look at the example in Listing 3.14.

Here are the results:

ORIGINAL TEXT = AI is the new electricity. AI is poised to start a
large transformation on many industries.
‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐
 
SENTENCE TOKENS = ['ai is the new electricity.', 'ai is poised to start
a large transformation on many industries.']
‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐
 
TRAINING DATA = [['ai', 'the', 'new', 'electricity'], ['ai', 'poised',
'start', 'large', 'transformation', 'many', 'industries']]
‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐
 
VOCABULARY OF MODEL = ['ai', 'the', 'new', 'electricity', 'poised',
'start', 'large', 'transformation', 'many', 'industries']
‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐
 
EMBEDDINGS VECTOR FOR THE WORD 'ai' = [ 2.3302788e‐02 9.8732607e‐03
4.6109618e‐03 5.3516342e‐03
 ‐2.4620935e‐02 ‐5.2335849e‐03 ‐8.8206278e‐03 1.3721633e‐02
 ‐1.8686499e‐04 ‐2.2845879e‐02 3.5632821e‐03 ‐6.0331034e‐03
 ‐2.2344168e‐03 ‐2.3627717e‐02 ‐2.3793013e‐05 ‐1.3868282e‐02
 ‐3.0636601e‐03 1.0795521e‐02 1.2196368e‐02 ‐1.4501591e‐02]
 
EMBEDDINGS VECTOR FOR THE WORD 'electricity' = [‐0.00058223 ‐0.00180565
‐0.01293694 0.00430049 ‐0.01047355 ‐0.00786022
 ‐0.02434015 0.00157354 0.01820784 ‐0.00192494 0.02023665 0.01888743
 ‐0.02475209 0.01260937 0.00428402 0.01423089 ‐0.02299204 ‐0.02264629
 0.02108614 0.01222904] 

The Word2Vec model has learned some vocabulary from the current small amount of text we provided. It trained itself on this data and now can provide us with embeddings for specific words. The embeddings vector does not mean anything to us. However, it has been built by observing patterns among the words and the order or sequence in which they appear. These embeddings can be used to analyze words mathematically, show similarities, and apply Deep Learning analysis.

The embeddings vector here has 20 dimensions, so when we display the vector, it has 20 rows. Word embedding in 20 dimensions is difficult for us to visualize. We can make sense of vectors in two dimensions and plot them on a chart. Let's try to do this.

We will use an unsupervised learning technique called Principal Component Analysis (PCA) to reduce the 20‐dimensional vector into two‐dimensional vectors. Although there is loss of information when we do this, the two‐dimensional vector tries to capture the maximum variations in data points as displayed in 20 dimensions. PCA is an unsupervised ML technique for dimensionality reduction, as we discussed in Chapter 2. Let's look at an example of applying PCA to word embeddings to plot the words on a chart, as shown in Listing 3.15. The actual plot of words is shown in figure 3.16.

Chart plotting some word embeddings from the text, by applying PCA (Principal Component Analysis) to reduce the dimensions.

Figure 3.16: PCA to reduce dimensions and plot word embeddings

We don't get much insight from these word embeddings because we have a very small amount of text. However, if you have large text corpus on which you train the Word2Vec model, you will start seeing relationships between similar words. A pretrained model with around 3 million words from the Google news dataset is available from Google for free. You can download this model and use the embeddings to establish relationships between words. Also, you can perform word math using these words converted to vectors of 300 dimensions.

For example, a very popular example cited in many books on word embeddings is getting embeddings for the words king, man, and woman. You can then use vector math to solve this equation:

(king - man) + woman 

The answer to this math equation is the vector embedding for the word queen. So, you are able to extract meaning or context from these words and use it to show relationships.

We will see an example of using a word embedding to get vectors and feed it to a sentiment analysis Deep Learning model in the next chapter. For now, let's get back to the last unstructured datatype we will look at—audio.

Listening to Sound

Audio data is all around us and it can provide valuable insights. We have the obvious audio data in the form of speech that humans use to communicate. If we can process sound and extract knowledge stored in it, that can drive some amazing outcomes. Our ears are pretty good at analyzing sound waves, recognizing different tones, and extracting information. Modern AI systems try to replicate this power of humans to process and understand sound. Amazon Alexa and Google Home are prime examples of systems that process sound waves and decode the information present in them. So, if we ask Alexa, “What's the capital of India?,” it will process this audio signal received using its built‐in microphone, extract information from this signal to understand the question as text, then send this question as text to a remote Cloud service hosted on Amazon Web Services.

This service does the NLP processing we saw in the previous section to understand what the user has asked. It searches its rich knowledgebase that has structured data that can be easily queried. Once an answer is found, it's coded as text and sent to your Alexa device. This text is then encoded into sound and Alexa responds to you in a few seconds. This flow is shown at a high level in Figure 3.17.

Illustration depicting a high-level flow of Alexa answering a question, by converting the voice input to speech text and transferring it to Amazon Cloud.

Figure 3.17: High‐level flow of Alexa answering a question

Systems that process sound or audio data need to extract information from this data—particularly for outcomes like speech to text and text to speech. These are usually special types of models called sequence‐to‐sequence models that convert a sequence of data (speech or text) into another sequence. These models are also used in translation from one language to another. This is an active area of research and many companies and startups have invested top dollars in solving this problem. However, to start building models, the sound signal first needs to be converted into a vector that can be analyzed by the computer—just like we did with the text data. Let's see how to do that.

Sound waves are basically pressure waves that are generated by vibration and these pressure waves travel through a medium, which could be solid, liquid, or gas. As shown in Figure 3.18, a wave in a time domain will have different pressure values over time. However, this complex signal is composed of many smaller constituent signals of constant frequency—basically sine waves. If we analyze these pressure waves in a frequency domain, we can find the frequency constituents in the signal and these components carry information in the wave.

Image depicting a wave in a time domain that will have different frequency domains over time, revealing the hidden information inside waves.

Figure 3.18: Frequency domain reveals the hidden information inside waves

To extract information from a sound wave, we use microphones or acoustic sensors that sample these pressure waves. These waves are sampled at very high frequencies, like 44.1 Kilohertz (KHz) to get all the frequency components from the wave. You probably have seen this sampling frequency mentioned in streaming applications like online radio stations. Converting sound waves into a frequency domain also helps us vectorize the sound sequences and use them for further analysis in ML and DL models. Let's see an example of converting sound into a vector of numbers.

We will take a sound sample from a car engine and analyze it. This sample was taken using a simple microphone on a cell phone—no complex acoustic sensor. We will first read the signal from the sound file and see how the time domain signal is noisy and does not provide any insights (see Figure 3.19). Then we will convert it to a frequency domain using an algorithm called Fast Fourier Transform (FFT)—see Figure 3.20. We won't cover details about the FFT algorithm, but the underlying concept is that it converts signals from time to frequency domains. You can see the example code in Listing 3.16.

Graph depicting the sampling frequency detected on the time domain plot of sound pressures from a car engine.

Figure 3.19: Time domain plot of sound from car engine

Graph depicting the frequency domain plot of sound waves from a car engine sound signal.

Figure 3.20: Frequency domain plot for car engine sound signal

Here are the results:

Sampling frequency = 44100
 
Shape of data array = (672768, 2) 

We read the sound sample (around 15 seconds) from a WAV file. WAV is a common and simple extension for audio data. Modern files are compressed into MP3 extensions, but that needs additional drivers to be read. WAV can be easily read by our sound analysis library, Scipy. We see that the sampling rate for audio is 44100 Hertz or 44.1KHz, which is pretty common. We first create a plot of the time domain signal—that is, the pressure amplitude variation over time. We see that the blue plot is pretty noisy and we don't really get much from it.

Now we use the FFT library from NumPy and build an FFT plot. When we decompose the signal into a frequency domain we see some frequencies standing out. We show the plot by converting the frequency from hertz to rotations per second (RPM). We see that the audio signal has a significant spike at a frequency around 2000 RPM. This corresponds to the frequency at which the engine was rotating when the signal was collected. This is just one value we decode from the audio signal. Without knowing about the engine, we can analyze the sound and find the rotating frequency. Similarly, we can use the frequency data encoded in the sound signal to vectorize sound values and use them for training our ML and DL models.

Summary

In this chapter, we looked at the differences between structured and unstructured data. We went into details of specific types of unstructured data and how to convert this data into vectors and arrays for processing. We saw how images are represented as pixel intensity arrays and how, using computer vision techniques, we can cleanse the data and extract information. We saw how the same methods can be extended to video, which is a timeseries of images. We saw how to handle text data using natural language processing (NLP) and extract information. Finally, we saw an example of analyzing audio data using frequency analysis. These methods can be used on their own to extract valuable information from unstructured data. They also serve as good preprocessing techniques to make data ready for processing by advanced ML and DL algorithms.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset