Deep learning (also known by some within the industry as deep structured learning or hierarchical learning, among other titles) is really part of a wider family, or branch, of machine learning methods, as mentioned earlier. These methods are based on learning what is known as representations (that is, where the model discovers from the data the representations, patterns, or rules needed to carry out a desired task or meet an objective), as opposed to task specific algorithms (that is, detailed rules written out or predefined, describing how to perform a specific task).
As an alternative to the process of manually creating rules, instructions, or equations deemed essential to solving a problem and then organizing data to be run through them, the process of deep learning simply sets up fundamental parameters about the problem to be solved and then trains the computer to learn on its own by recognizing patterns within that data.
This is accomplished by using multiple layers of processing. For example, the first layer may establish the most basic feature or features by finding a simple or basic pattern. The next layer is then fed this identified information, which then works to break out the next level of information and feed that to another layer and so on, until the final layer can determine an outcome or make a prediction.
This process is typically illustrated using the tree-like flow of a decision tree or decision flow diagram. This graphical representation can visually show decisions and their possible consequences, including chance event outcomes, and so on.
If we again exploit our previously mentioned physical height and body weight example, using machine learning, one would have to define features, instructions, or rules based upon whether an individual was male or female, their age and ethnicity, and perhaps their BMI or body mass index. In short, you would outline the physical attributes to be used to meet the objective (guess the correct body weight) and then let the system use the more important features to determine a subject's suspected body weight.
So, deep learning automatically discovers or finds out the features that are important to be used for making the prediction. This finding out process might be described as following the steps listed as follows (again, if we use the body height and weight use case example):
To summarize, while classical machine learning requires the extraction and establishment of rules or features from data, followed by the preprocessing or organizing of the data (and these steps are typically 85 to 90 percent a human effort) before the model can be used to make predictions, deep learning uses deep learning algorithms to perform its own feature learning and then is able to make its predictions.
At the time of writing, deep learning is typically assumed to be one of four fundamental architectures.
These are:
These deep learning architectures have been successfully applied to various fields and have produced results comparable to (and in some cases superior to) appropriately skilled human subject matter experts (SMEs):
Today, deep learning has been established as a key instrument for practical machine learning use cases. Since computers are ever more powerful, using deep learning techniques to learn from the ever growing data sources (even big data), we can expect to process and predict quicker and with higher rates of accuracy than ever before.
Furthermore, the concept of deep learning has been described many times in the media as more than a method or practice of machine learning (as we mentioned earlier in this chapter), but more of a ground-breaking attitude to learning, using cognitive skills such as the ability to analyze, produce, solve problems, and thinking meta-cognitively in order to construct long-term understanding.
The use of deep learning techniques promotes understanding and application for life at a much more advanced, more effective, and quicker proportion than other forms of learning, therefore it is an area with extremely high potential to impact the world as we know it.
Pretty much everyone, everywhere has heard of the term big data. Although there may still be some debate or disagreement as to what the term actually means, the bottom line is that there is a lot more data available today then there was yesterday (and there will be even more tomorrow!).
What this means is that this data is available to build more neural networks with many deeper layers, providing even more accurate (or at least perhaps more interesting) outcomes.
Also, new and exciting, is the fascinating world of the internet of things (IoT). The acronym IoT describes the way devices, vehicles, buildings, and many, many other items speak or communicate with each other. Almost all devices these days and into the future have or will have the ability to be smart devices or become connected devices, capturing information about their usage and surrounding environments and conditions, and then connecting and sharing the information and events they collect.
Machine and deep learning models and algorithms will play a significant role in the IoT analytics. Data from IoT devices is sparse and/or has a temporal element in it, and deep learning algorithms can be trained with this information to yield significant insights.
The many, many recent advances in the area of distributed cloud computing and graphics processing units have made incredible computing power available for use, which in turn advances the ability for maximum positive effectiveness of deep learning applications.
Many real-life use cases exist today for applying deep learning algorithms, including (just to name a few):
Now becoming more main stream, the growing field of predictive analysis and predictive analytics is using deep learning in the areas of finance, accounting, government, security, hardware manufacturing, search engines, e-commerce, and medicine.
One newer, very exciting, and perhaps growing ever more important use case for deep learning is with motion detection for situation evaluation, security, and defense.
Natural language processing (NLP) is an area of computer science (or more specifically, computational linguistics) that focuses on the interactions between computers and the human language.
In a natural language application, there is an attempt to process an extreme amount of real-world text, formally called a natural language corpora data source.
Speech recognition is one of the most well-known and perhaps most developed applications of NLP, even so, challenges are many and typically include:
Word embedding is a very popular method of language modeling and feature learning techniques used in many natural language processing applications.
This is the practice of using words or phrases from a vocabulary and mapping them to vectors of real numbers. Simply speaking, word embedding is the process of turning text into numbers and this text-to-numeric transformation is required because most deep learning algorithms require their input to be vectors of continuous numeric values (they don't work on strings of plain text) and, well, computers just unsurprisingly process numbers better.
So, with the preceding definition in mind, word embedding is used to map words or phrases from a vocabulary to a corresponding vector of real numbers that also provides the following benefits:
"Contextual Word Similarity is nothing but identifying different types of similarities between words. It is one of the goals of NLP. Statistical approaches are used for computing the degree of similarity between words."
– Robin, December 10th, 2012
For a statistical language model to be able to predict the meaning of some text, it needs to be conscious of the contextual similarity of words.
For example, you would probably agree that you would expect to find words such as martini or cosmopolitan within sentences where they're dry, shaken, stirred, and chilled, but would not expect to find those same concepts in such close proximity to, say, the word automobile.
The word vectors (actually they are numeric vectors) that are produced by applying the logic and reason of word embedding expose these similarities, so words that regularly occur nearby in text will also be in close proximity within a vector space.
It is very important to understand how these words or numeric vectors work, so let's go over a short (and hopefully simple), explanation of this notion.
If a word vector is divided into several hundred elements, each word in a vocabulary is represented by a distribution of weights across those elements (in that vector). So instead of a one-to-one mapping between an element in the vector and a word, the representation of that word is spread across all of the elements in that vector, and each element in the vector contributes to the definition of many words. Such a vector comes to represent in some abstract way the meaning of a word.
A really easy to understand tutorial along with some nice illustrations on word or numeric vectors can be found online at: https://blog.acolyer.org/2016/04/21/the-amazing-power-of-word-vectors/.
So, again, let's answer the question of what is word embedding?
"…Word Embedding is a means of creating a low-dimensional vector representation from corpus of text, which preserves the contextual similarity of words…"
An additional bonus of implementing word vectors is that they can be manipulated arithmetically (just like any other numeric vector can). Since words in a vocabulary are translated into numerical vectors, and there are semantic relationships in the position of those vectors, one can use or apply simple arithmetic on the vectors to find additional meanings and insights.
Many examples do exist to illustrate this concept, including the operation of moving across in embedding space from Man to Queen by subtracting King and adding Woman.
By exploiting this technique, groupings of words are not simply close variations or synonyms, but rather unique words that make up a contextual collection or just belong together.
One of my most favorite machine learning use case examples is Netflix (a website that specializes in and provides streaming media and video-on-demand online).
A typical view of Netflix services (movies and videos available for streaming) provides over 40 rows of possible selections. Just like any other business, a consumer loses interest after about two minutes of window shopping for a video to watch so Netflix has very little time to catch the customer's attention.
Rather than rely on customer ratings and surveys, Netflix leverages a very broad set of data assets: what each member watches, when they watch, the place on the Netflix screen the customer found the video, recommendations the customer didn't pick, and the popularity of videos in the catalogue.
"All of this data is read by numerous algorithms powered by machine-learning techniques. Approaches use both supervised (classification, regression) and unsupervised (dimensionality reduction through clustering or compression) approaches…,"
- C. Raphel.
The report mentioned is available online here: https://www.rtinsights.com/netflix-recommendations-machine-learning-algorithms.
A video-to-video similarity algorithm, or Sims, makes recommendations in the "Because You Watched" row
- C. Raphel.
One may discern that selections are made by genre alone, but the idea of contextual similarities surely plays a role in mining selections that fit the consumer or viewers mindset. Words that fit together can spawn ideas for films that might be enjoyed by the viewer. Manipulating word vectors can produce an almost endless list of ideas.
As the following paragraph reports, results from the Netflix algorithms actually have a better success rate in making recommendations that what is intuitively believed:
"…as an example, the authors describe recommendations for shows similar to "House of Cards." While one might think that political or business dramas such as "The West Wing" or "Mad Men" would increase customer engagement, it turns out that popular but outside-of-genre titles such as "Parks and Recreation" and "Orange Is the New Black" fared better. The authors call this a case of "intuition failure..."
– C. Raphel
So how do we implement word or numeric vectors in a typical word embedding application?
One of the most popular algorithms available for producing word embedding models is word2vec, created by Google in 2013. Word2vec, written in C++, but also has been implemented in Java/Scala and Python, accepts a text corpus (or speaking informally, expects a sequence of sentences as its input and each sentence a list of words) as input and produces word vectors as output.
Another note about the input to word2vec, it only requires that your input data be provided as sequential sentences, you do not have to worry about storing it all in memory at one time to process it. This means that you can:
This means large data, such as those that qualify as big data, sources (discussed in Chapter 11, Topic Modeling of this book), which may consist of data spread over several files in multiple locations, can be processed by one sentence per line (instead of loading everything into an in-memory list, input file by file, line by line). This kind of architecture also allows preprocessing such as converting to Unicode, lowercase, removing numbers, extracting named entities, and so on, to occur without word2vec even being aware of it.
Word2vec is also set up to accept some parameters such as min_count
.
This parameter is very effective for setting the lower limit for words to appear in the data. For example, any words that appear only a few times in a million-word data source are probably typos and garbage and should be ignored in word vector creation. This parameter allows you to automatically drop uninteresting or unimportant words. The default is set to 5
.
Word2vec first constructs a vocabulary from the text data provided as input and then learns vector representation of words. The resulting word vector file can be used as featured in many natural language processing and machine learning applications.
The following is a partial word vector image created by word2vec:
Even though word2vec is a powerful tool, even Google declares that it is not user-friendly and various open source packages have been developed over time to add a user friendly interface to the algorithm. You can go online and access word2vec at https://code.google.com/p/word2vec.
As with many implementations in statistics, there is some disagreement as to exactly what word2vec is or how the logic has ultimately been implemented. Is it an example of the classical machine learning model? An example of implemented deep learning? Or, can we say that it is some sort of hybrid model?
A bit of online research reveals numerous opinions, for example, A.Thakker, June 18, 2017:
"…Word2Vec is considered (by some within the industry) as a starter of "Deep Learning in NLP". However, Word2Vec is not deep. But the output of Word2Vec is what Deep Learning models can easily understand. Word2vec is basically a computationally efficient predictive model for learning word embeddings from raw text. The purpose of Word2Vec is to group words that are semantically similar in vector space. It computes similarities mathematically. Given a huge amount of data…."
Let's go over the architectures of deep learning, starting in the next section.
We indicated earlier in this chapter, under the Deep Learning section that there are currently (or at least at the time of writing) four basic deep learning architectures. We'll fleetingly look at three (Unsupervised Pre-Trained, Convolutional Neural, and Recursive Neural) now and then do a deeper dive into one of the most stimulating and effective (at least for appropriate use cases) Recurrent Neural Networks:
Artificial neural networks (ANNs) systems are computing systems, algorithms, or models that are inspired by and based upon how biological neural networks in our human brains work.
These systems learn to perform work and solve problems by considering patterns found in data (referred to as gaining experience), generally without having to program specific logic prompts.
ANNs are a big part of deep learning.
Most artificial neural networks bear only a slight resemblance to their more complex biological counterparts, but are very effective at intended tasks such as classification or segmentation. For more information, refer to: https://en.wikipedia.org/wiki/Types_of_artificial_neural_networks.
In Chapter 5, Neural Networks, we covered Neural Networks in some detail, specifically, ANNs. In the next section of this chapter we pick up that thread again and move onto the topic of Recurrent neural networks (RNNs).
Generally speaking, it is accepted within the industry that there really are just two chief types of neural networks.
These are:
The feed forward neural network was the first and simplest type that was developed.
In a feed forward network, activation is pushed through the network from the input layers to the output layers. In this network the information moves only from the input layer straight through any hidden layers to the output layer without cycles or looping.
In other words, feed forward neural networks are a one-way street.
Most all of the types of neural networks are organized in layers. Layers are made up of interconnected nodes which contain what is known as an activation function. Patterns are presented to the network by the input layer, which then communicates to one or more hidden layers. Hidden layers are where the real work is done using a system of weighted connections.
Let's continue on with our dialogue by stating that a recurrent neural network (or RNN) is an interesting and unique class of ANN.
The objective of using RNN logic is to make use of sequential or chronological data. This is much different to the logic used by a traditional neural network, where it is assumed that all inputs and outputs are independent of each other, or have no relevance to each others. This kind of presumption (or limitation) works or is at least sufficient for some applications, but for many tasks this is not an acceptable premise. For example, if you are trying to predict the next word someone is typing in a search engine, you need to know which words were typed before it.
RNNs are called recurrent because they perform the same task for every element in a sequence, with the output being dependent on all of the previous computations.
Another way to think about RNNs is that they can remember information about what has been calculated thus far within a sequence. This allows it to exhibit dynamic temporal (or related) behaviors.
Remember, RNNs use a special layer that is called a state layer, which is updated not only with the external input information of the network, but also with activation information from the previous forward propagation.
There is an interesting blog that provides valuable insight into how RNNs work. The following figure is based upon that information. The reader can review the information at: https://shapeofdata.wordpress.com/2015/10/20/recurrent-neural-networks.
To show how much this is a valuable feature, as an example, the word aliens might have a different meaning if it was part of the sequence ancient aliens.
As we stated earlier in this section, unlike an artificial neural network where connections between logic layers do not form a loop (technically referred to as a feed forward neural network), RNNs can use their internal memory to process arbitrary sequences of inputs. This makes them a great choice for applications such as handwriting recognition or speech recognition.