Chapter 11. Future Trends: Toward Robust AI

This book has been about techniques for fooling AI that would not fool a human being. We should not forget, however, that we are susceptible to optical and audio illusions. Interestingly, research has shown that some adversarial examples can also fool time-limited humans.1 Conversely, some optical illusions can also trick neural networks.2

These cases suggest that there may be some similarities between biological and artificial perception, but adversarial inputs exploit the fundamental principle that deep learning models process data differently than their biological counterparts. While deep learning may create models that match or exceed human capability in processing sensory input, these models are likely to be a long way from how humans actually learn and perceive visual and auditory information.

There are fascinating areas of investigation opening up in the field of deep learning that are likely to bring about greater convergence between artificial and biological perception. Such research may result in AI that has greater resilience to adversarial examples. Here is a selection.

Increasing Robustness Through Outline Recognition

Neuroscientists and psychologists have known for many years that our understanding of the world around us is built through movement and physical exploration. A baby views items from different angles by moving, or because the items themselves move. We know that visual perception is very dependent on movement and viewing angles, allowing us to learn about objects and their boundaries. If DNNs placed greater emphasis on object outlines, a perturbation attack would not be feasible. A similar principle applies to audio, where certain broad temporal patterns and relative pitches determine understanding. The salient features that we extract from sensory data to understand the world are clearly very different from those extracted by a DNN.

Researchers Geirhos et al.3 argue that CNNs trained on ImageNet data (such as ResNet50) place greater emphasis on textures in images than on object outlines. In contrast, humans place greater emphasis on object shape in making a decision. The researchers tested this with images generated from ImageNet that had conflicting shape and texture information. For example, Figure 11-1 from the paper illustrates how ResNet50 makes a classification biased toward texture when presented with an image where the texture and shape conflict.

Image of elephant skin, a cat and a cat outline with elephant skin and the associated ResNet50 predictions.
Figure 11-1. ResNet50 classification of elephant skin (a), a cat (b), and a cat with elephant skin texture cues (c) (from Geirhos et al. 2019)

By generating a new training dataset comprised of “stylized” images with training labels that corresponded to the shape of the object rather than the texture, the researchers were able to retrain CNNs to reduce their bias toward texture and make them more reliant on outlines in the image. This approach has the benefit that the classifiers were more robust to distortion in an images because such distortions tend to affect texture, whereas the object outline stays relatively stable.

From the perspective of creating neural networks robust to adversarial examples, this research is promising. If the CNN places greater emphasis on outlines to make its decision, the decision is less likely to be affected by adversarial perturbation that spans the image.

Multisensory Input

There are other fascinating new ideas originating from theoretical neuroscience on how the brain might work. These ideas may provide new approaches to building AI that more accurately mimics human perception.

For example, the human brain successfully combines multiple sensory inputs (such as sight, hearing, touch, temperature sensitivity) in order to establish an understanding of the world. Multiple sensory inputs provide constant validation of data around us (though sometimes we can get it wrong, as described in the following note).

The McGurk Effect

The McGurk effect is an auditory illusion that illustrates the brain’s perception of speech based on both visual and audio information.

When presented with an audio stimulus (such as someone making the sound “bah”) paired with conflicting visual information (such as video showing someone making the sound “fah”), our brains override the audio with the visual stimulus and “hear” the sound “fah.”

Watch the BBC Two video Try The McGurk Effect!—Horizon: Is Seeing Believing? and try it for yourself.

As AI systems evolve, they will increasingly fuse data from disparate sources—both sensor data and nonsensor data—to verify their understanding of the world. If we can understand nature’s approach to this problem, we may be able to create AI systems that work to higher levels of accuracy when fusing unstructured data such as audio and images.

In their research, Hawkins et al.4 present the “Thousand Brains Model of Intelligence.” They theorize that the neocortex comprises many mini-brains called “cortical columns” all working in parallel to process sensory information. Each of these cortical columns has a complete understanding of objects in the world. They each ingress specific sensory input—say from a small area of the visual field or from the touch of a finger—and use this data to establish the object that the sensory input is from (for example, a cup or hat) along with the location of the input relative to the rest of the object.

The researchers give the example of cortical columns simultaneously receiving sensory input from a mug that someone is looking at and touching with one finger. One column might receive “touch” data originating from the finger that is touching the mug, while other columns receive data from the visual cortex that represents different visual parts of the mug. Based on the learned model of the world that each cortical column holds, each column will make an educated guess as to what object it is sensing and the location of the sensed data on the object. The resulting information is collated from all the columns to determine what the object is, with some kind of “voting mechanism” in the case of disputes.

The idea that thousands of models of a single object are being generated by the brain simultaneously is very different from the strict hierarchical approach of a DNN, where one object is understood by gradually extracting higher-level features. This parallelism also merges multiple inputs from different senses, enabling greater resilience to error, and perhaps goes some way to explaining the McGurk effect.

Object Composition and Hierarchy

The Thousand Brains theory also proposes a mechanism for how the brain processes composition of objects within other objects. The researchers posit that, along with location information, the cortical columns also understand displacement and therefore the relationship between one object and another.

Other researchers recommend alternative approaches to rethinking the CNNs that have become the staple approach for image processing. The convolutional layers extract features from across the image but do not incorporate learning based on the relative position of those features within an image. For example, a CNN might recognize a face based on the existence of features such as eyes, nose, and mouth, but the relative position of those features is unimportant.

Geoffrey Hinton and his team propose capsule networks5 to better formulate hierarchical representations and relationships of objects in the world. Making these relationships core to the capsule network’s calculations means that these networks incorporate an understanding of the relationships between the parts that make up objects, enabling them to better understand objects in images, regardless of viewing angle. Capsule networks have been shown to display greater resilience to adversarial attacks due to their better contextual understanding of images.

Finally…

There is increasing interest in the reconvergence of the neuroscience and AI disciplines. Better understanding of how the brain works and the application of these new ideas to AI methodologies is likely to result in better imitations of human learning. This, however, implies that our AI algorithms will be susceptible to the same foibles as our brains are, unless we program deliberate strategies to prevent this. From a neuroscience perspective, AI provides a way to test hypotheses regarding the workings of the brain. Researching how and when data is interpreted incorrectly by humans and AI will aid this convergence.

As AI moves closer to biological intelligence, we might remove any discrepancy between how humans and machines are fooled. Perhaps, as the neural networks evolve to better mimic human perception, image and audio adversarial examples will become a thing of the past.

1 Gamaleldin F. Elsayed et al., “Adversarial Examples that Fool both Computer Vision and Time-Limited Humans” (2018), http://bit.ly/2RtU032.

2 Watanabe Eiji et al., “Illusory Motion Reproduced by Deep Neural Networks Trained for Prediction,” Frontiers of Psychology (March 2018), http://bit.ly/2FkVxmZ.

3 Robert Geirhos et al., “ImageNet-Trained CNNs Are Biased Towards Texture; Increasing Shape Bias Improves Accuracy and Robustness” (2019), http://bit.ly/2N0FuB2.

4 Jeff Hawkins et al., “A Framework for Intelligence and Cortical Function Based on Grid Cells in the Neocortex,” Frontiers in Neural Circuits 12, 121 (2018), http://bit.ly/2ZtvEJk.

5 Sara Sabour et al., “Dynamic Routing Between Capsules” (2017), http://bit.ly/2FiVAjm.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset