What is deep learning?

Deep learning (also known by some within the industry as deep structured learning or hierarchical learning, among other titles) is really part of a wider family, or branch, of machine learning methods, as mentioned earlier. These methods are based on learning what is known as representations (that is, where the model discovers from the data the representations, patterns, or rules needed to carry out a desired task or meet an objective), as opposed to task specific algorithms (that is, detailed rules written out or predefined, describing how to perform a specific task).

Note

Representations or feature representations are critical to all types of learning. Feature representations can be learned and predefined manually or defined automatically by the model while analyzing the data.

An alternative to manual instruction

As an alternative to the process of manually creating rules, instructions, or equations deemed essential to solving a problem and then organizing data to be run through them, the process of deep learning simply sets up fundamental parameters about the problem to be solved and then trains the computer to learn on its own by recognizing patterns within that data.

This is accomplished by using multiple layers of processing. For example, the first layer may establish the most basic feature or features by finding a simple or basic pattern. The next layer is then fed this identified information, which then works to break out the next level of information and feed that to another layer and so on, until the final layer can determine an outcome or make a prediction.

This process is typically illustrated using the tree-like flow of a decision tree or decision flow diagram. This graphical representation can visually show decisions and their possible consequences, including chance event outcomes, and so on.

If we again exploit our previously mentioned physical height and body weight example, using machine learning, one would have to define features, instructions, or rules based upon whether an individual was male or female, their age and ethnicity, and perhaps their BMI or body mass index. In short, you would outline the physical attributes to be used to meet the objective (guess the correct body weight) and then let the system use the more important features to determine a subject's suspected body weight.

So, deep learning automatically discovers or finds out the features that are important to be used for making the prediction. This finding out process might be described as following the steps listed as follows (again, if we use the body height and weight use case example):

  • First the process attempts to identify which physical attributes are most relevant to determining body weight
  • Next, it builds a hierarchy rather like that decision flowchart we mentioned earlier which it can use to determine a subject's body weight (for example, whether a subject is male or female or is within a certain height range, and so on)
  • After consecutive hierarchical identification (or classification) of these combinations, it then decides which of these features are responsible for predicting the answer (that is, the subjects body weight)

To summarize, while classical machine learning requires the extraction and establishment of rules or features from data, followed by the preprocessing or organizing of the data (and these steps are typically 85 to 90 percent a human effort) before the model can be used to make predictions, deep learning uses deep learning algorithms to perform its own feature learning and then is able to make its predictions.

At the time of writing, deep learning is typically assumed to be one of four fundamental architectures.

These are:

  • Unsupervised Pre-Trained
  • Convolutional Neural
  • Recurrent Neural
  • Recursive Neural

These deep learning architectures have been successfully applied to various fields and have produced results comparable to (and in some cases superior to) appropriately skilled human subject matter experts (SMEs):

  • Computer vision
  • Speech recognition
  • Natural language processing
  • Audio recognition
  • Social network filtering
  • Machine translation
  • Bioinformatics

Growing importance

Today, deep learning has been established as a key instrument for practical machine learning use cases. Since computers are ever more powerful, using deep learning techniques to learn from the ever growing data sources (even big data), we can expect to process and predict quicker and with higher rates of accuracy than ever before.

Note

Big data is a term being used for data that is so large or complex that traditional algorithms and system software is insufficient to deal with it.

Furthermore, the concept of deep learning has been described many times in the media as more than a method or practice of machine learning (as we mentioned earlier in this chapter), but more of a ground-breaking attitude to learning, using cognitive skills such as the ability to analyze, produce, solve problems, and thinking meta-cognitively in order to construct long-term understanding.

Note

Cognitive skills usually refers to the capacity to develop a meaning and/or certain knowledge from reviewing data (also called experience or information).

The use of deep learning techniques promotes understanding and application for life at a much more advanced, more effective, and quicker proportion than other forms of learning, therefore it is an area with extremely high potential to impact the world as we know it.

Deeper data?

Pretty much everyone, everywhere has heard of the term big data. Although there may still be some debate or disagreement as to what the term actually means, the bottom line is that there is a lot more data available today then there was yesterday (and there will be even more tomorrow!).

What this means is that this data is available to build more neural networks with many deeper layers, providing even more accurate (or at least perhaps more interesting) outcomes.

Deep learning for IoT

Also, new and exciting, is the fascinating world of the internet of things (IoT). The acronym IoT describes the way devices, vehicles, buildings, and many, many other items speak or communicate with each other. Almost all devices these days and into the future have or will have the ability to be smart devices or become connected devices, capturing information about their usage and surrounding environments and conditions, and then connecting and sharing the information and events they collect.

Machine and deep learning models and algorithms will play a significant role in the IoT analytics. Data from IoT devices is sparse and/or has a temporal element in it, and deep learning algorithms can be trained with this information to yield significant insights.

The many, many recent advances in the area of distributed cloud computing and graphics processing units have made incredible computing power available for use, which in turn advances the ability for maximum positive effectiveness of deep learning applications.

Use cases

Many real-life use cases exist today for applying deep learning algorithms, including (just to name a few):

  • Fraud detection
  • Image recognition
  • Voice recognition
  • Natural language processing

Now becoming more main stream, the growing field of predictive analysis and predictive analytics is using deep learning in the areas of finance, accounting, government, security, hardware manufacturing, search engines, e-commerce, and medicine.

One newer, very exciting, and perhaps growing ever more important use case for deep learning is with motion detection for situation evaluation, security, and defense.

Word embedding

Natural language processing (NLP) is an area of computer science (or more specifically, computational linguistics) that focuses on the interactions between computers and the human language.

In a natural language application, there is an attempt to process an extreme amount of real-world text, formally called a natural language corpora data source.

Note

Corpora is equivalent to the word samples. In this context, a natural language corpora data source would be a database or file filled with actual words and phrases of text, in an expected language.

Speech recognition is one of the most well-known and perhaps most developed applications of NLP, even so, challenges are many and typically include:

  • Natural language understanding
  • Natural language generation
  • Connecting language and machine perception
  • Dialog systems
  • Some combination of all of these

Word embedding is a very popular method of language modeling and feature learning techniques used in many natural language processing applications.

This is the practice of using words or phrases from a vocabulary and mapping them to vectors of real numbers. Simply speaking, word embedding is the process of turning text into numbers and this text-to-numeric transformation is required because most deep learning algorithms require their input to be vectors of continuous numeric values (they don't work on strings of plain text) and, well, computers just unsurprisingly process numbers better.

So, with the preceding definition in mind, word embedding is used to map words or phrases from a vocabulary to a corresponding vector of real numbers that also provides the following benefits:

  • Dimensionality Reduction: Reducing phrases to numbers obviously is a more efficient representation
  • Contextual Similarity: Numerics can be a more expressive representation

"Contextual Word Similarity is nothing but identifying different types of similarities between words. It is one of the goals of NLP. Statistical approaches are used for computing the degree of similarity between words."

– Robin, December 10th, 2012

Word prediction

For a statistical language model to be able to predict the meaning of some text, it needs to be conscious of the contextual similarity of words.

For example, you would probably agree that you would expect to find words such as martini or cosmopolitan within sentences where they're dry, shaken, stirred, and chilled, but would not expect to find those same concepts in such close proximity to, say, the word automobile.

Note

Another form of word prediction is Autocomplete or word completion. This is when an algorithm can predict the rest of a word a user is typing.

Word vectors

The word vectors (actually they are numeric vectors) that are produced by applying the logic and reason of word embedding expose these similarities, so words that regularly occur nearby in text will also be in close proximity within a vector space.

It is very important to understand how these words or numeric vectors work, so let's go over a short (and hopefully simple), explanation of this notion.

If a word vector is divided into several hundred elements, each word in a vocabulary is represented by a distribution of weights across those elements (in that vector). So instead of a one-to-one mapping between an element in the vector and a word, the representation of that word is spread across all of the elements in that vector, and each element in the vector contributes to the definition of many words. Such a vector comes to represent in some abstract way the meaning of a word.

Note

A really easy to understand tutorial along with some nice illustrations on word or numeric vectors can be found online at: https://blog.acolyer.org/2016/04/21/the-amazing-power-of-word-vectors/.

So, again, let's answer the question of what is word embedding?

"…Word Embedding is a means of creating a low-dimensional vector representation from corpus of text, which preserves the contextual similarity of words…"

Numerical representations of contextual similarities

An additional bonus of implementing word vectors is that they can be manipulated arithmetically (just like any other numeric vector can). Since words in a vocabulary are translated into numerical vectors, and there are semantic relationships in the position of those vectors, one can use or apply simple arithmetic on the vectors to find additional meanings and insights.

Many examples do exist to illustrate this concept, including the operation of moving across in embedding space from Man to Queen by subtracting King and adding Woman.

Note

The arithmetic manipulation performed on word or numeric vectors is known within the field as vector math.

By exploiting this technique, groupings of words are not simply close variations or synonyms, but rather unique words that make up a contextual collection or just belong together.

Netflix learns

One of my most favorite machine learning use case examples is Netflix (a website that specializes in and provides streaming media and video-on-demand online).

A typical view of Netflix services (movies and videos available for streaming) provides over 40 rows of possible selections. Just like any other business, a consumer loses interest after about two minutes of window shopping for a video to watch so Netflix has very little time to catch the customer's attention.

Rather than rely on customer ratings and surveys, Netflix leverages a very broad set of data assets: what each member watches, when they watch, the place on the Netflix screen the customer found the video, recommendations the customer didn't pick, and the popularity of videos in the catalogue.

"All of this data is read by numerous algorithms powered by machine-learning techniques. Approaches use both supervised (classification, regression) and unsupervised (dimensionality reduction through clustering or compression) approaches…,"

- C. Raphel.

Note

The report mentioned is available online here: https://www.rtinsights.com/netflix-recommendations-machine-learning-algorithms.

A video-to-video similarity algorithm, or Sims, makes recommendations in the "Because You Watched" row

- C. Raphel.

One may discern that selections are made by genre alone, but the idea of contextual similarities surely plays a role in mining selections that fit the consumer or viewers mindset. Words that fit together can spawn ideas for films that might be enjoyed by the viewer. Manipulating word vectors can produce an almost endless list of ideas.

As the following paragraph reports, results from the Netflix algorithms actually have a better success rate in making recommendations that what is intuitively believed:

"…as an example, the authors describe recommendations for shows similar to "House of Cards." While one might think that political or business dramas such as "The West Wing" or "Mad Men" would increase customer engagement, it turns out that popular but outside-of-genre titles such as "Parks and Recreation" and "Orange Is the New Black" fared better. The authors call this a case of "intuition failure..."

– C. Raphel

Implementations

So how do we implement word or numeric vectors in a typical word embedding application?

One of the most popular algorithms available for producing word embedding models is word2vec, created by Google in 2013. Word2vec, written in C++, but also has been implemented in Java/Scala and Python, accepts a text corpus (or speaking informally, expects a sequence of sentences as its input and each sentence a list of words) as input and produces word vectors as output.

Another note about the input to word2vec, it only requires that your input data be provided as sequential sentences, you do not have to worry about storing it all in memory at one time to process it. This means that you can:

  • Provide one sentence
  • Process it
  • Load another sentence
  • Process it
  • Repeat…

This means large data, such as those that qualify as big data, sources (discussed in Chapter 11, Topic Modeling of this book), which may consist of data spread over several files in multiple locations, can be processed by one sentence per line (instead of loading everything into an in-memory list, input file by file, line by line). This kind of architecture also allows preprocessing such as converting to Unicode, lowercase, removing numbers, extracting named entities, and so on, to occur without word2vec even being aware of it.

Note

Word2vec isn't a good choice for data that is very small in size. For real results, as reported through trials, you should have a minimum of a million words. Small data files or sources are not enough for a concise word similarity or proper word vector creation.

Word2vec is also set up to accept some parameters such as min_count.

This parameter is very effective for setting the lower limit for words to appear in the data. For example, any words that appear only a few times in a million-word data source are probably typos and garbage and should be ignored in word vector creation. This parameter allows you to automatically drop uninteresting or unimportant words. The default is set to 5.

Word2vec first constructs a vocabulary from the text data provided as input and then learns vector representation of words. The resulting word vector file can be used as featured in many natural language processing and machine learning applications.

The following is a partial word vector image created by word2vec:

Implementations

Note

Even though word2vec is a powerful tool, even Google declares that it is not user-friendly and various open source packages have been developed over time to add a user friendly interface to the algorithm. You can go online and access word2vec at https://code.google.com/p/word2vec.

As with many implementations in statistics, there is some disagreement as to exactly what word2vec is or how the logic has ultimately been implemented. Is it an example of the classical machine learning model? An example of implemented deep learning? Or, can we say that it is some sort of hybrid model?

A bit of online research reveals numerous opinions, for example, A.Thakker, June 18, 2017:

"…Word2Vec is considered (by some within the industry) as a starter of "Deep Learning in NLP". However, Word2Vec is not deep. But the output of Word2Vec is what Deep Learning models can easily understand. Word2vec is basically a computationally efficient predictive model for learning word embeddings from raw text. The purpose of Word2Vec is to group words that are semantically similar in vector space. It computes similarities mathematically. Given a huge amount of data…."

Let's go over the architectures of deep learning, starting in the next section.

Deep learning architectures

We indicated earlier in this chapter, under the Deep Learning section that there are currently (or at least at the time of writing) four basic deep learning architectures. We'll fleetingly look at three (Unsupervised Pre-Trained, Convolutional Neural, and Recursive Neural) now and then do a deeper dive into one of the most stimulating and effective (at least for appropriate use cases) Recurrent Neural Networks:

  1. Unsupervised pre-trained neural networks: Think of stacking the deck by making weighting adjustments before the model training actually begins.
  2. Convolutional neural networks: A feed-forward model, that uses a variation of multilayer perceptrons (or individual learning units) designed to require minimal preprocessing, used for visual imagery processing and natural language processing.
  3. Recursive neural networks: These are created by applying the same set of weights recursively over a structure, in an attempt to produce a structured prediction (that is, the ability to predict structured objects rather than discreet or real values).

Artificial neural networks

Artificial neural networks (ANNs) systems are computing systems, algorithms, or models that are inspired by and based upon how biological neural networks in our human brains work.

These systems learn to perform work and solve problems by considering patterns found in data (referred to as gaining experience), generally without having to program specific logic prompts.

ANNs are a big part of deep learning.

Note

Most artificial neural networks bear only a slight resemblance to their more complex biological counterparts, but are very effective at intended tasks such as classification or segmentation. For more information, refer to: https://en.wikipedia.org/wiki/Types_of_artificial_neural_networks.

In Chapter 5, Neural Networks, we covered Neural Networks in some detail, specifically, ANNs. In the next section of this chapter we pick up that thread again and move onto the topic of Recurrent neural networks (RNNs).

Recurrent neural networks

Generally speaking, it is accepted within the industry that there really are just two chief types of neural networks.

These are:

  • Feed forward
  • Recurrent

The feed forward neural network was the first and simplest type that was developed.

In a feed forward network, activation is pushed through the network from the input layers to the output layers. In this network the information moves only from the input layer straight through any hidden layers to the output layer without cycles or looping.

In other words, feed forward neural networks are a one-way street.

Note

Most all of the types of neural networks are organized in layers. Layers are made up of interconnected nodes which contain what is known as an activation function. Patterns are presented to the network by the input layer, which then communicates to one or more hidden layers. Hidden layers are where the real work is done using a system of weighted connections.

Let's continue on with our dialogue by stating that a recurrent neural network (or RNN) is an interesting and unique class of ANN.

The objective of using RNN logic is to make use of sequential or chronological data. This is much different to the logic used by a traditional neural network, where it is assumed that all inputs and outputs are independent of each other, or have no relevance to each others. This kind of presumption (or limitation) works or is at least sufficient for some applications, but for many tasks this is not an acceptable premise. For example, if you are trying to predict the next word someone is typing in a search engine, you need to know which words were typed before it.

RNNs are called recurrent because they perform the same task for every element in a sequence, with the output being dependent on all of the previous computations.

Another way to think about RNNs is that they can remember information about what has been calculated thus far within a sequence. This allows it to exhibit dynamic temporal (or related) behaviors.

Remember, RNNs use a special layer that is called a state layer, which is updated not only with the external input information of the network, but also with activation information from the previous forward propagation.

There is an interesting blog that provides valuable insight into how RNNs work. The following figure is based upon that information. The reader can review the information at: https://shapeofdata.wordpress.com/2015/10/20/recurrent-neural-networks.

Recurrent neural networks

To show how much this is a valuable feature, as an example, the word aliens might have a different meaning if it was part of the sequence ancient aliens.

Note

In theory, RNNs can make use of information in arbitrarily long sequences, but in practice they are limited to looking back only a few steps.

As we stated earlier in this section, unlike an artificial neural network where connections between logic layers do not form a loop (technically referred to as a feed forward neural network), RNNs can use their internal memory to process arbitrary sequences of inputs. This makes them a great choice for applications such as handwriting recognition or speech recognition.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset