6

Perception of Objects in the World

The study of perception consists … of attempts to explain why things appear as they do.

J. Hochberg
1988

INTRODUCTION

In the last chapter we introduced the visual system and some of the perceptual effects that can arise from the way that the visual system is put together. We continue this discussion in this chapter, emphasizing now the more complicated aspects of perceptual experience. Whereas previously we talked about how intense a perceptual experience is (in terms of brightness or lightness), now we focus on less quantifiable experiences such as color or shape.

You know something now about the basic signals that the brain uses to construct a perception. An amazing phenomenal characteristic of perception is how automatically and effortlessly a meaningful, organized world is perceived given these very simple neural signals. From a two-dimensional (2D) array of light energy, somehow we are able to determine how its pieces go together to form objects, where those objects are located in three dimensions, and whether changes in position of the image on the retina are due to movements of objects in the environment or to our own movements.

To drive a car down a highway, fly a plane, or even just walk across a room, a person must accurately perceive the locations of objects in either 2D or 3D space. Information presented on gauges, indicators, and signs must be not only detected but also identified and interpreted correctly. Consequently, the design of control panels, workstations, or other environments often relies on information about how people perceive objects around them. Design engineers must recognize how people perceive color and depth, organize the visual world into objects, and recognize patterns. These are the topics that we cover in the present chapter.

COLOR PERCEPTION

In daylight, most people see a world that consists of objects in a range of colors. Color is a fundamental part of our emotional and social lives (Davis, 2000). In art, color is used to convey many emotions. The color of your wardrobe tells others what kind of person you are. Using color, we can discriminate between good and bad foods or decide if someone is healthy or sick. Color plays a crucial role in helping us acquire knowledge about the world. Among other things, it aids in localizing and identifying objects.

At the most basic level, color is determined by the wavelength of light reflected from or emitted by an object (Malacara, 2011; Ohta and Robertson, 2005). Long-wavelength light tends to be seen as red and short-wavelength light tends to be seen as blue. But your experience of blue may be very different from your best friend’s experience of blue. As with brightness, the perception of color is psychological, whereas wavelength distinctions are physical. This means that other factors, such as ambient lighting and background color, influence the perception of color.

COLOR MIXING

Most colors that we see in the environment are not spectral colors. That is, they are not composed of light of a single wavelength. Rather, they are mixtures of light of different wavelengths. We call colors from these mixtures nonspectral colors. Nonspectral colors differ from spectral colors in their degree of saturation, or color purity. By definition, spectral colors, consisting of a single wavelength, are pure, or completely saturated. Nonspectral colors are not completely saturated.

There are two ways to mix colors. First, imagine the colors that result when you mix two buckets of paint together. Paint contains different pigments that reflect light of different wavelengths. Mixtures of pigments result in what is called a subtractive color mixture. Next, imagine shining each of two light sources through a gel of a different color, like the lighting systems on a theatrical stage. If the gels placed in front of the light sources are of different colors, then when those two light sources are focused on the same location, their combination is an additive color mixture. Most of the rules of color mixing that you can recall (e.g., “blue plus yellow makes green”) refer to subtractive color mixtures. Because of the different pigments that color different substances, it is less easy to predict the results of a subtractive color mixture than an additive one.

What happens when light of two wavelengths is mixed additively? It depends on the specific wavelengths and the relative amounts of each. In some cases, a color may look very different from its components. For example, if long-wavelength (red) light and middle-wavelength (yellow) light are mixed in approximately equal amounts, the color of the combination will be orange. If the middle-wavelength component is increased, then the mixture will appear more yellowish. Combinations of other spectral light sources may yield no color. For example, if a short-wavelength (blue) light and an upper-middle-wavelength (yellow) light are mixed in approximately equal amounts, the resulting combination will have no hue. More generally, we can reconstruct any hue (with any saturation) as an additive mixture of three primary colors (one long, one middle, and one short wavelength).

A color system that describes the dimensions of hue and saturation is the color circle (see Figure 6.1). Isaac Newton created the color circle by imagining the spectrum curved around the outside of a circle. He connected the low (red) and high (blue) wavelengths with nonspectral purples. Thus, the outer boundary of the color circle corresponds to the monochromatic or spectral colors plus the highly saturated purples. The center of the circle is neutral (white or gray). If we draw a diagonal from the center to a point on the rim, the hue for any point on this line corresponds to the hue at the rim. The saturation increases as the point shifts from the center to the rim.

FIGURE 6.1The color circle.

We can estimate the appearance of any mixture of two spectral colors from the color circle by first drawing the chord that connects the points for the spectral colors. The point corresponding to the mixture falls on this chord, with the specific location determined by the relative amounts of the two colors. If the two are mixed in equal percentages, the mixture will be located at the midpoint of the chord. The hue that corresponds to this mixture will be the one at the rim at that particular angle, and the saturation will be indicated by the distance from the rim.

A more sophisticated color mixing system is the one developed in 1931 by the Commission Internationale de l’Eclairage (CIE; the International Color Commission; Oleari, 2016). This system incorporates the fact that any color can be described as a mixture of three primaries. The CIE system uses a triangular “chromaticity” space (see Figure 6.2). In this system, a color is specified by its location in the space according to its values on three imaginary primaries, called X, Y, and Z. These primaries correspond to long-, medium-, and short-wavelength light, respectively. The coordinates in the chromaticity space are determined by finding the proportions of the color mixture that are X and Y:

FIGURE 6.2The CIE color space.

x=X/(X+Y+Z)
y=Y/(X+Y+Z)

Because x + y + z= 1.0, z is determined when x and y are known, and we can diagram the space in terms of the x and y values.

TRICHROMATIC THEORY

The fact that any hue can be matched with a combination of three primary colors is evidence, recognized as early as the 1800s, for the view that human color vision is trichromatic (Helmholtz, 1852; Young, 1802; see Mollon, 2003). Trichromatic color theory proposes that there are three types of photoreceptors, corresponding to blue, green, and red, that determine our color perception. According to trichromatic theory, the relative activity of the three photoreceptors determines the color that a person perceives.

As trichromatic theory predicted, there are three types of cones with distinct photopigments. Color information is coded by the cones in terms of the relative sensitivities of the pigments. For example, a light source of 500 nm will affect all three cone types, with the middle-wavelength cones being affected the most, the short-wavelength cones the least, and the long-wavelength cones an intermediate amount (see Figure 5.11). Because each color is signaled by the relative levels of activity in the three cone systems, any spectral color can be matched with a combination of three primary colors.

Because there is only one rod photopigment, which is sensitive to a range of wavelengths across the visual spectrum, there is no way to determine whether a high level of rod activity is being caused by high-intensity light of a wavelength to which the photopigment is not very sensitive or by lower‑intensity light of a wavelength to which the photopigment is more sensitive. This means that it is the relative levels of activity within the three cone subsystems that allow the perception of color.

Approximately 200 million people worldwide are color blind, or, more accurately, have a congenital color vision deficiency (Machado, Oliveira, & Fernandes, 2009). Men are more likely to have deficient color vision, with as many as 8% of men affected but only 0.5% of women (Simunovic, 2010). Deficient color vision is characterized by how many primary colors a person needs to match any hue. We say that someone is color blind when they need fewer than three primaries to match any color. Most color blind individuals are dichromats: they have dichromatic vision, meaning that two colors may look the same to a color blind individual that look different to a person with normal trichromatic vision (a trichromat). These people are usually missing one of the three types of cone photopigments, although the total number of cones is similar to that of a normal trichromat (Cicerone & Nerger, 1989). The most common form of dichromatic color vision is deuteranopia, which is attributed to a malfunction of the green cone system. Deuteranopes have difficulty distinguishing between red and green, although research findings have shown that they have a richer color experience than one might expect and use the words “red” and “green” consistently to label their color percepts (Wachtler, Dohrmann, & Hertel, 2004).

Some people still need three primaries to match all spectral colors but are said to have anomalous color vision, because the color matches that they make are not the same as those made by normal trichromats (Frane, 2015). Finally, there are rare individuals (monochromats) with no cones or only one type of cone who have monochromatic vision, as we mentioned briefly in Chapter 5.

Commercial products that use color filters, often in the form of a tinted contact lens, have been developed for use by people with color blindness to try to reduce their color confusions (Simunovic, 2010). A red-green color blind individual wears a red lens monocularly, which passes light primarily in the long-wavelength region of the spectrum. The basic idea is that a green color will look relatively darker in the filtered image than in the unfiltered image at the other eye, whereas a red color will not, providing a cue to help differentiate red from green. Unfortunately, the benefits of such filters are limited (Sharpe & Jägle, 2001). In fact, they may have serious side effects: They reduce luminance (since some light is filtered out), which may be particularly harmful for other aspects of vision at night, and impair depth perception (by altering binocular cues; see later in this chapter).

OPPONENT PROCESS THEORY

Although human color vision is based on trichromatic physiology, there are some characteristics of color perception that seem to be due to the way that the signals from cone cells interact in the retina. We mentioned already that if equal amounts of blue and yellow light are mixed additively, the result is an absence of any hue, a white or gray. The same effect occurs when red and green are mixed additively. Also, no colors seem to be combinations of either blue and yellow or red and green. For example, although orange seems to be a combination of red and yellow, there is no color corresponding to a combination of red and green.

The relations of red with green and blue with yellow show up in other ways, too. If you fixate on a yellow (or red) patch of color briefly (a procedure known as adaptation) and then look at a gray or white surface, you will see an afterimage that is blue (or green). Similarly, if you look at a neutral gray patch surrounded by a region of one of the colors, the gray region will take on the hue of the complementary color. A gray square surrounded by blue will take on a yellow hue, and a gray square surrounded by red will take on a green hue. Therefore, in addition to the three primary colors red, green and blue, yellow appears to be a fourth basic color.

These phenomena led Ewald Hering to develop the opponent process theory of color vision in the 1800s. He proposed that neural pathways linked blue and yellow together and red and green together. Within each of these pathways, one or the other color could be signaled, but not both at the same time. Neurophysiological evidence for such opponent coding was obtained initially from the retina of a goldfish (Svaetchin, 1956) and later in the neural pathways of rhesus monkeys (DeMonasterio, 1978; De Valois & De Valois, 1980). The nature of the cells in these pathways is such that, for example, red light will increase their firing rate and green light will decrease it. Other cells respond similarly for blue and yellow light.

There are a number of other perceptual phenomena that support the idea of opponent color processes. Many phenomena depend on the orientation of the stimulus (linking color perception to processing in the visual cortex), direction of motion, spatial frequency, and so on. We can explain most of these phenomena by the fact that the initial sensory coding of color is trichromatic and that these color codes are wired into an opponent-process arrangement that pairs red with green and blue with yellow (see, e.g., Chichilnisky & Wandell, 1999). By the time the color signal reaches the visual cortex, color is evidently coded along with other basic features of the visual scene.

Most of the environments that we negotiate every day contain important information conveyed by color. Traffic signals, display screens, and mechanical equipment of all types are designed under the assumption that everyone can see and understand color-coded messages. For most color blind people, this bias is not too much of a concern. After all, the stop light is red, but it is also always at the top of the traffic signal, so it does not matter very much if one out of every ten male drivers cannot tell the difference between red and green.

However, there are other situations where color perception is more important. Commercial pilots, for example, must have good color vision so that they can quickly and accurately perceive the many displays in a cockpit. Electricians must be able to distinguish wiring of different colors, because the colors indicate which wires are “hot,” and also (for more complex electronics) which wires connect to which components. Paint and dye manufacturing processes require trained operators who can distinguish between different pigments and the colors of the products being produced. Therefore, although most color blind individuals do not perceive themselves as being disabled in any way, color blindness can limit their performance in some circumstances.

The human factors engineer must anticipate the high probability of color blindness in the population and, when possible, reduce the possibility of human error due to confusion. The best way to do this is to use dimensions other than color to distinguish signals, buttons, commands, or conditions on a graph (Frane, 2015; MacDonald, 1999). The redundant coding of location and color for traffic lights described above is an example of using more than one dimension. This guideline is followed inconsistently, as illustrated by the fact that the standard default coding in current Web browsers uses redundant coding for links (blue color and underlined) but only color to distinguish sites that have recently been visited from ones that have not.

PERCEPTUAL ORGANIZATION

Our perceptual experience is not one of color patches and blobs, but one of objects of different colors at specific locations around us. The perceptual world we experience is constructed; the senses provide rough cues, for example, similarities and differences of color, that are used to evaluate hypotheses about the state of the world, but it is these hypotheses themselves that constitute perception. A good example involves the blind spot, which we discussed in Chapter 5. Sensory input is not received from the part of the image that falls on the blind spot, yet no hole is perceived in the visual field. Rather, the field is perceived as complete. The blind spot is filled in on the basis of sensory evidence provided by other parts of the image. In the rest of this chapter we will discuss how the perceptual system operates to construct a percept.

Perceptual organization is how the brain determines what pieces in the visual field go together (Kimchi, Behrmann, & Olson, 2003), or “the process by which we apprehend particular relationships among potentially separate stimulus elements (e.g., parts, features, dimensions)” (Boff & Lincoln, 1988, p. 1238). A widely held view around the beginning of the 20th century was that complex perceptions are simply additive combinations of basic sensory elements. A square, for example, is just a combination of horizontal and vertical lines. However, a group of German psychologists known as the Gestalt psychologists demonstrated that perceptual organization is more complicated than this. Complex patterns of elementary features show properties that emerge from the configuration of features that could not be predicted from the features alone (Koffka, 1935).

A clear demonstration of this point was made by Max Wertheimer in 1912 with a phenomenon of apparent movement that is called stroboscopic motion (Wade & Heller, 2003). Two lights are arranged in a row. If the left light alone is presented briefly, it looks like a single light turning on and off in the left location. Similarly, if the right light alone is presented briefly, then it looks like a single light turning on and off in the right location. Based on these elementary features, when the left and right lights are presented in succession, the perception should be that the left light comes on and goes off, and then the right light comes on and goes off. However, if the left and right lights are presented one after the other fairly quickly, the two lights now look like a single light moving from left to right. This apparent movement is the emergent property that cannot be predicted on the basis of the elementary features.

One of the most fundamental tasks the perceptual system must perform is the organization into figure and ground (Wagemans et al., 2012). Visual scenes are effortlessly perceived as objects against a background. Sometimes, however, the visual system can be fooled when the figure–ground arrangement is ambiguous. For the images shown in Figure 6.3, each part of the display can be seen as either figure or ground. Figure–ground ambiguity can produce problems with perception of signs, as for the one shown in Figure 6.4.

FIGURE 6.3Factors that determine figure–ground organization: (a) surroundedness; (b) symmetry; (c) convexity; (d) orientation; (e) lightness or contrast; and (f) area.

FIGURE 6.4Road sign intended to depict “no left turn.”

Examples like those shown in Figure 6.3 illustrate some major distinctions between objects classified as figure and those classified as ground. The figure is more salient than the ground and appears to be in front of it; contours usually seem to belong to the figure; and the figure seems to be an object, whereas the ground does not. Six principles of figure–ground organization are summarized in Table 6.1 and illustrated in Figure 6.3. The cues for distinguishing figure from ground include symmetry, area, and convexity. In addition, lower regions of a figure tend to be seen as figure more than upper regions (Vecera, Vogel, & Woodman, 2002). Images, scenes, and displays that violate the principles of figure–ground organization will have ambiguous figure–ground organizations and may be misperceived.

TABLE 6.1

Principles of Figure-Ground Organization

Principle

Description

Surroundedness

A surrounded region tends to be seen as figure while the surrounding region is seen as ground

Symmetry

A region with symmetry is perceived as figure in preference to a region that is not symmetric

Convexity

Convex contours are seen as figure in preference to concave contours

Orientation

A region oriented horizontally or vertically is seen as figure in preference to one that is not

Lightness or contrast

A region that contrasts more with the overall surround is preferred as figure over one that does not

Area

A region that occupies less area is preferred as figure

Probably more important for display design are the principles of Gestalt grouping (Gillam, 2001; Wagemans et al., 2012), which are illustrated in Figure 6.5. This figure demonstrates the principles of proximity, similarity, continuity, and closure. The principle of proximity is that elements close together in space tend to be perceived as a group. Similarity refers to the fact that similar elements (in terms of color, form, or orientation) tend to be grouped together perceptually. The principle of continuity is embodied in the phenomenon that points connected in straight or smoothly curving lines tend to be seen as belonging together. Closure refers to a tendency for open curves to be perceived as complete forms. Finally, an important principle called common fate, which is not shown in the figure, is that elements that are moving in a common direction at a common speed are grouped together.

FIGURE 6.5The Gestalt organizational principles of proximity, similarity, continuity, and closure.

Figure 6.6 shows a very complicated arrangement of displays in an interior view of a simulation of the cockpit of the now-decommissioned space shuttle Atlantis. This simulator is a faithful reproduction of the real Atlantis. Several Gestalt principles are evident in the design of the cockpit. First, displays and controls with common functions are placed close to each other, and the principle of proximity assures that they are perceived as a group. This is particularly obvious for the controls and indicators in the upper center of the cockpit. The principles of proximity and similarity organize the linear gauges below the ceiling panel into three groups. The digital LEDs to the right of the array of gauges use both proximity and continuity to form perceptual groups.

FIGURE 6.6Interior of a simulator of the cockpit of the space shuttle Atlantis.

There are two ways that grouping can be artificially induced by the inclusion of extra contours (Rock & Palmer, 1990). Dials or gauges that share a common function can be grouped within an explicit boundary on the display panel or connected by explicit lines (see Figure 6.7). Rock and Palmer call these methods of grouping common region and connectedness, respectively. They seem to be particularly useful ways to ensure that dials are grouped by the observer in the manner intended. Returning to the cockpit of the Atlantis, you can see several places where the cockpit designers exploited these principles to ensure the groupings of similar displays.

FIGURE 6.7Displays grouped by common region and connectedness.

Wickens and Andre (1990) demonstrated that when a task (like landing the shuttle) requires integration across display elements, organizational factors have different effects on performance than when the task requires focused attention on a single display element. The task that they used involved three dials that might be found in an aircraft cockpit, indicating air speed, bank, and flaps. Pilots either estimated the likelihood of a stall (a task that required integrating the information from all three dials) or indicated the reading from one of the three dials (the task that required focused attention on a single dial).

Spatial proximity of the dials had no effect in Wickens and Andre’s (1990) experiments. However, they found that performance for focused attention was better when display elements were of different colors than when they were all the same color. In contrast, integration performance was best when all display elements were the same color. Wickens and Andre also experimented with displays that combined the information given by the three elements into a single rectangular object, the position and area of which were determined by air speed, bank, and flaps. They concluded that the usefulness of such an integrated display depends upon how well the emergent feature (the rectangle) conveys task-relevant information.

Another feature of displays that determines perceptual organization is the orientation of different components in the display. People are particularly sensitive to the orientation of stimuli (e.g., Beck, 1966). When forms must be discriminated that are the same except for orientation (e.g., upright Ts from tilted Ts), responses are fast and accurate. However, when pieces of the stimuli are all oriented in the same direction, for example, upright Ts from backward Ls (see Figure 6.8), it is much harder to discriminate between them.

FIGURE 6.8Example of orientation as an organizing feature.

An example where grouping by orientation can be useful is shown in Figure 6.9. This figure shows two example display panels for which check reading is required. In check reading, panels of gauges or dials must each be checked to determine whether they all register normal operating values. The bottom of Figure 6.9 shows a configuration where the normal settings are indicated by pointers at the same orientation, whereas the top shows a configuration where they differ. Because orientation is a fundamental organizing feature, it is much easier to tell from the bottom display than the top that one dial is deviating from normal (Mital & Ramanan, 1985; White, Warrick, & Grether, 1953). With the bottom arrangement, the dial that deviates from the vertical setting would “pop out” and the determination that a problem existed would be made rapidly and easily.

FIGURE 6.9Displays grouped by proximity and similarity (a), and display groups with similar and dissimilar orientations (b).

More generally, the identification of information in displays will be faster and more accurate when the organization of the display is such that critical elements are segregated from the distracting elements. For example, when observers must indicate whether an F or a T is included in a display that has noise elements composed of features from both letters (see Figure 6.10), they are slower to respond if the critical letter is “hidden” among the distractors by good continuity, as in Figure 6.10b, or proximity (Banks & Prinzmetal, 1976; Prinzmetal & Banks, 1977).

FIGURE 6.10Example stimuli used to illustrate how good continuation influences target (F) identification when it is grouped (a) separately from and (b) together with distractors.

When designing pages for the World Wide Web, organizing the page in a manner consistent with the Gestalt grouping principles can facilitate a visitor’s perception of the information on the page. Because of the difficulty of evaluating the overall organizational “goodness” of Web pages based on the various individual principles, Hsiao and Chu (2006) developed a mathematical model based on five grouping principles: Proximity; Similarity; Continuity; Symmetry; Closure. Web-page designers use a seven-point scale (from very bad to very good) to rate the extent to which each of these principles is used on a Web page for (1) layout of graphics, (2) arrangement of text, and (3) optimal use of colors. The model generates a value from 0 to 1 from these ratings, with a higher value for a Web page indicating more effective use of the Gestalt principles. This measure can be used by Web-page designers to evaluate whether a page is organized well visually, which should be correlated with the ease with which the content of the page can be comprehended and navigated by users. Though developed specifically for Web-page design, the method may also be useful for visual interface design more generally.

In summary, we can use Gestalt organizational principles to help determine how visual displays will be perceived and the ease with which specific information can be extracted from them. A good display design will use these principles to cause the necessary information to “pop out.” Similarly, if we wish to obscure an object, as in camouflage, the object can be colored or patterned in such a manner that the parts will blend into the background.

DEPTH PERCEPTION

One of the most amazing things that our visual system does is transform the 2D image that falls on the retina into a complex 3D scene, where objects fall behind other objects in depth. As a first guess, you might think that our ability to see depth is a function of binocular cues associated with having two eyes. This is, in fact, part of the story. However, by closing one eye, you can see that it is not the entire story. Depth can still be perceived to some extent when the world is viewed with only a single eye.

The visual system uses a number of simple cues to construct depth (Howard, 2002, 2012; Proffitt & Caudek, 2013), and most of them are summarized in Figure 6.11. Notice that while many of them are derived from the retinal image, some come from the movement of the eyes. Many depth cues are monocular, explaining why a person can see depth with a single eye. In fact, depth perception from monocular cues is so accurate that the ability of pilots to land aircraft is not degraded by patching one eye (Grosslight, Fletcher, Masterton, & Hagen, 1978), nor is the ability of young adults to drive a car (Wood & Troutbeck, 1994). Another study examined the driving practices of monocular and binocular truck drivers, and found that monocular drivers were just as safe as binocular drivers (McKnight, Shinar, & Hilburn, 1991).

Figure 6.11A hierarchical arrangement of the cues to depth.

The extent to which the cues outlined in Figure 6.11 contribute to the perception of a 3D image is something to be considered when designing displays for virtual environments of the type used in simulators (see Box 6.1). The view outside the simulator window shown in Figure 6.6 is an artificial scene constructed using simple depth cues. How people use these cues to perceive depth is a basic problem that has been the focus of a great deal of study. We will now discuss each type of cue, oculomotor and visual, and explain how the visual system uses these cues in the perceptual organization of depth.

OCULOMOTOR CUES

Oculomotor depth cues are provided proprioceptively. Proprioception is the ability to feel what your muscles are doing and where your limbs are positioned. The position of the muscles of the eye can also be perceived proprioceptively. We have already discussed (in Chapter 5) how the muscles of the eye work and even how abusing these tiny muscles can lead to eye strain and fatigue. The two motions that these muscles accomplish are accommodation and vergence, and the states of accommodation and vergence are two oculomotor cues to depth.

Recall that accommodation refers to automatic adjustments of the lens that occur to maintain a focused image on the retina, and vergence refers to the degree to which the eyes are turned inward to maintain fixation on an object. The information about the position of the muscles controlling the degree of vergence and accommodation could potentially be used as feedback in the visual system to help determine information about depth. Because the extent of both accommodation and vergence depends on the distance of the fixated object from the observer, high levels of accommodation and vergence signal that an object is close to the observer, whereas the information that the eye muscles are relatively relaxed signals that an object is farther from the observer.

Accommodation only varies for stimuli that are between approximately 20 and 300 cm from the observer. This means that proprioceptive information about accommodation could only be useful for objects that are very close. Vergence varies for objects up to 600 cm from the observer, so proprioceptive information about vergence is potentially useful over a wider range of distances than accommodation.

Morrison and Whiteside (1984) performed an interesting experiment to determine how important vergence and accommodation were to the perception of depth. They asked observers to guess how far away a light source was. They did this in such a way that in some situations the observers’ degree of vergence was held constant and accommodation varied with distance, but in other situations accommodation was held constant while vergence varied with distance. They determined that changes in vergence were useful for making accurate distance estimates over a range of several meters, but changes in accommodation were not. Mon-Williams and Tresilian (1999, 2000) reached a similar conclusion that vergence plays a significant role in near-space perception and that “accommodation is almost certain to play no direct role in distance perception under normal viewing conditions” (Mon-Williams & Tresilian, 2000, p. 402).

We should take note of one important factor in Morrison and Whiteside’s (1984) experiment. The light was presented very briefly, too briefly for the observers to actually make the necessary vergence changes, and so the proprioceptive information provided by vergence posture could not have been the source of the information used to make distance estimates. Morrison and Whiteside proposed instead that the observers relaxed into the constant dark vergence posture (see Chapter 5) and used other cues, like binocular disparity (see later), as a source of information about depth. This finding suggests that although in some cases vergence cues may contribute directly to depth perception, in others their contribution may occur indirectly through joint effects with other cues.

BOX 6.1THREE-DIMENSIONAL DISPLAYS

A standard computer display screen is 2D, and many of the displays presented on them are also 2D. For example, the 2D start screen for any version of the Windows operating system contains any number of 2D icons displayed at various locations on the screen. Also, a word processor used for preparing and editing documents displays part of a page of the document, framed by toolbars that contain icons for various operations and (sometimes) rulers that specify horizontal distance from the left side and vertical distance from the top. One reason why the 2D display works well for icon selection is because the input device used for selection, typically a computer mouse, operates in two dimensions. Similarly, the 2D display for text editing is deliberately representative of the paper on which a copy can be printed or on which one can choose to write or type instead.

However, our interactions with the world and knowledge of the relations among objects involve the third dimension of depth. For example, an air-traffic controller must be able to comprehend the locations and flight paths of many aircraft within the flight environment that he is controlling. Likewise, the operator of a telerobotic system must be able to manipulate the movements of a remotely controlled robot in three dimensions. In situations like these, a person’s performance may benefit from 3D displays. Depth can be represented on a two-dimensional screen using many of the monocular cues described in this chapter. Static monocular cues can be used to provide depth information, as in many Windows icons intended to represent objects. One common icon depicts a document contained in a folder, an image that uses interposition as a cue to depth. Some monocular cues can also be used to create more complex perspective displays of 3D relationships. The perception of depth can be particularly compelling when movement is introduced to the display. Using specialized goggles, stereoscopic views can be created by presenting different images to the two eyes, resulting in an even more compelling experience of depth. These kinds of tricks are used in gaming and virtual reality software.

3D displays are aesthetically and intuitively appealing because they depict shapes of objects and the relations among them in a realistic way. However, there are problems with 3D displays that may limit their effectiveness. Rendering a 3D image on a 2D screen means that some information has been lost. This loss can introduce ambiguity about the location of objects along lines of sight, and distortions of distances and angles, making it difficult to determine exactly where an object is supposed to be. One way to overcome the effects of these ambiguities and distortions is (for certain tasks) to use a multiple-view 2D display instead of a single 3D display.

Park and Woldstad (2000) examined a person’s performance of a simulated telerobotic task, where the goal was to use a Spaceball® 2003 3D controller (a sphere that responds to pressure in the appropriate direction and that has buttons for specific operations such as picking up) to pick up an object and place it in a rack. They provided the people performing the task with either a multiple-view 2D, monocular 3D, or stereoscopic 3D display of the work area. The multiple-view 2D display consisted of two rows of three displays each: force-torque display, plan-view, right side–view, left side–view, front-view, and task status display.

People performed the task best when using the multiple-view 2D display. When visual enhancement cues (e.g., reference lines extending from the face of the gripper to the object to be grasped) were added to the 3D displays, the performance differences were eliminated, but the 3D displays still produced no better performance than the multiple-view 2D display.

It seems a bit surprising that the 3D displays do not result in better performance than the multiple-view 2D display. However, this finding has been replicated (St. John, Cowen, Smallman, & Oonk, 2001). Observers asked to make position judgments about two objects or two natural terrain locations performed better with multiple-view 2D displays than 3D ­displays. However, when the task required identifying the shapes of a block figure or terrain, they did better with 3D displays. The advantage of the multiple-view 2D displays in relative position judgment (which, it should be noted, was also an important component of Park and Woldstad’s, 2000, simulated telerobotic task) is due to the fact that those displays minimize distortion and ambiguity. The advantage for 3D perspective displays in understanding shape and layout is due to the fact that the three dimensions are integrated in the display, rather than requiring the user to expend effort to integrate them mentally. The 3D displays also allow the rendering of extra depth cues and the depiction of hidden features, both of which can aid in shape identification. Therefore, it should not be too surprising that a recent review concluded that stereoscopic 3D displays are most useful for tasks that require manipulation of objects or locating, identifying, and categorizing objects (McIntire, Havig, & Geiselman, 2014).

One use of 3D is in the area of virtual environment, or virtual reality, displays. In virtual reality, the goal is not just to depict the 3D environment accurately, but also to have the user experience a strong sense of “presence,” that is, of actually being in the environment. Because vision is only one sensory modality involved in virtual environments, we will delay discussion of them until Box 7.1 in the next chapter.

MONOCULAR VISUAL CUES

The monocular visual cues sometimes are called pictorial cues, because they convey impressions of depth in a still photograph. Artists use these cues to portray depth in paintings. Figure 6.12 illustrates several of these cues.

Figure 6.12The depth cues of relative size (1), linear perspective (2), interposition (3), and texture gradient (4) in the complete image

The top panel (a) in Figure 6.12 shows a complex 3D scene. The scene seems to consist of a rectangular object lying flat on a field, with three monoliths to the left and two monoliths to the right. The separate components of the scene are unpacked in the bottom panel (b). The changes in the texture gradient of the field aid in the perception that the field recedes in depth toward a horizon. The changes in size of the three monoliths provides a relative size cue, which makes them appear to be three equally sized monoliths placed at different distances from the observer. The linear perspective implied by the unequal angles of the quadrangle in the foreground suggests that a flat rectangle recedes into the distance.

Probably the most important cue, that of interposition, is based on the fact that a near object will block the view of a more distant one if they are in the same line of vision. For the two monoliths on the right, the view of one monolith is partially obscured by the other, suggesting that the obscured monolith is farther away than the other. Interposition also contributes to the perceived locations of the other objects in Figure 6.12 because of the way the receding field is obscured by each piece.

Interposition can be very compelling. Edward Collier’s painting Quod Libet, shown in Figure 6.13, relies heavily on interposition to portray a collage of 3D objects. In this painting, Collier also makes very clever use of the attached shadow cue, which we discuss below. This type of painting is referred to as a “trompe l’oeil,” a French phrase that means “fool the eye.” An artist’s expert use of pictorial depth cues can sometimes, as in Quod Libet, give the impression that the images in a painting are real and not rendered.

FIGURE 6.13Edward Collier, Quod Libet (1701).

Another important source of information about depth is the size of the perceived objects. Size cues can be of two sorts. First, an object, like a coffee cup, might have a familiar size that you have learned through experience. If you sense an image of a tiny coffee cup, you might conclude that the cup is very far away. Beginning around the late 1970s, the size of the average car began to decrease. In 1985, when larger cars perhaps were still more familiar than small cars, small cars tended to be involved in accidents more frequently than large cars (Eberts and MacMillan, 1985). One reason for this is that the smaller visual image of the smaller cars made a less familiar small car look like a more familiar large car far away. This meant that the smaller cars were routinely judged to be farther away than they really were, resulting in a higher accident rate.

Second, the image of the object has a retinal size, referring to the area on the retina taken up by the image. This cue depends on the idea of a visual angle, which we discussed in Chapter 5. For an object of constant size, like a quarter, the closer it is to you, the larger the size of the retinal image will be. Thus, the relative size of images within the visual field can be used as a cue to distance.

Perspective is another important cue to depth. There are two types of perspective: aerial and linear. We saw an example of linear perspective in Figure 6.12. More formally, linear perspective refers to the fact that parallel lines receding into depth converge to a point in an image. This is true not only for visible lines, but also for the relations among objects that can be captured by invisible lines (see Figure 6.14).

FIGURE 6.14Vanishing points for linear perspective.

Aerial perspective refers to interference in an image produced by particles in the air. The farther away an object is, the more opportunity there is for some of the light from it to be scattered and absorbed. This causes the image from a faraway object to be bluer than images from nearby objects and not as sharply defined. The blue coloration comes from the fact that short-wavelength blue light scatters more than longer-wavelength light.

Linear perspective and relative size are combined in texture gradients (see Figure 6.15 and also Figure 6.12). A gradient is characterized by parts of a texture’s surface that become smaller and more densely packed as they recede in depth. A systematic texture gradient specifies the depth relations of the surface. If the texture is constant, it must be from an object facing the observer directly in the frontal plane (Figure 6.15, panel a). If the texture changes systematically, it indicates a surface that recedes in depth. The rate of change specifies the angle of the surface. The faster the texture increases in density, the more perpendicular to the observer the surface is.

FIGURE 6.15Texture gradients for surfaces (a) parallel to the frontal plane and (b and c) receding in depth.

The attached shadow cue is based on the location of shadows in a picture (Ramachandran, 1988; see Figure 6.16). Regions with shadows at the bottom tend to be perceived as elevated. Regions with shadows at the top tend to be perceived as depressed into the surface. These perceptions are what we expect to see when the light source projects from above, as is typically the case. The light source in Collier’s Quod Libet is from above, so all of the objects shaded from below tend to project forward from the surface of the painting. For instance, the sheaves of paper curl outward because of the shadows he painted below each curl.

FIGURE 6.16The attached shadow cue.

In situations where the light on an image projects from below, the attached shadow cue can be misleading. Take another look at Figure 6.16, but this time turn the book upside-down. The bubbles that popped out when the image was upright should now appear to be depressions. This happens because we tend to see the figure with the light source coming from above no matter what the ­orientation of the figure is. If the light source is actually from below, what appear to be bubbles in the upright image are actually depressions (and vice versa).

All the monocular cues described to this point are available to a stationary observer. It is because the observer cannot move relative to the objects in an image that sometimes our perceptions can be fooled. For instance, there is usually one best place to look at a trompe l’oeil painting. If you move around, the illusion can be much less compelling. This means that some information about depth is conveyed through movement. One important movement-based cue is called motion parallax (Ono & Wade, 2005). If you are a passenger in a car, fixate on an object to the side of the car, such as a cow. Objects in the foreground, like telephone poles or fence posts, will appear to move backward, whereas objects in the background, like trees or other cows, will seem to move forward in your direction. Also, the closer an object is to you, the faster its position in the visual field will change. The fence posts will travel by very rapidly, but the trees in the background will move very slowly. Similar movement cues can be produced on a smaller scale by turning your head while looking at an image.

Motion parallax is perceived when an observer is moving along beside an image. Motion also provides depth information when you move straight ahead. The movement of objects as you look straight ahead is called optical flow, which can convey information about how fast you are moving and how your position is changing relative to those of environmental objects. For example, as you drive down the road, the retinal images of trees on the roadside expand and move outward to the edges of the retina (see Figure 6.17). When the relation between the speed of your movement and the rate of the optical flow pattern changes, the perception of speed is altered. This is apparent if you watch from the window of an airplane taking off. As the plane leaves the ground and altitude increases, the size of the objects in the image decreases, the optical flow changes, and the speed at which the plane is moving seems to decrease.

FIGURE 6.17The optical flow of a roadway image for a driver moving straight ahead.

BINOCULAR VISUAL CUES

Although you can see depth relatively well with only one eye, you can perceive depth relations more accurately with two. This is most obvious when comparing the perception of depth obtained from a 2D picture or movie with that provided by 3D, or stereoscopic, pictures and movies. Stereoscopic pictures mimic the binocular depth information that would be available from a real 3D scene. People can perform most tasks that involve depth information much more rapidly and accurately when using both eyes (Sheedy, Bailey, Burl, & Bass, 1986). For example, surgeons’ perceptual-motor performance during operations is worse with image-guided surgical procedures, such as laparoscopic surgery, than with standard procedures, in part because of degraded depth perception caused by the elimination of binocular cues (DeLucia, Mather, Griswold, & Mitra, 2006).

The cues for binocular depth perception arise from binocular disparity: each eye receives a slightly different image of the world because of the eyes’ different locations. The two images are merged through the process of fusion. When you fixate on an object, the image from the fixated area falls on the fovea of each eye. An imaginary, curved plane (like the wall of a cylinder) can be drawn through the fixated object, and the images from any objects located on this plane will fall at the same locations on each retina. This curved plane is called the horopter (see Figure 6.18). Objects in front of or behind the horopter will have retinal images that fall on different points in the two retinas.

FIGURE 6.18The horopter, with crossed and uncrossed disparity regions indicated.

Objects that are further away than the point of fixation will have uncrossed disparity, whereas those closer than fixation will have crossed disparity. The amount of disparity depends on the distance of the object from the horopter, and the direction of disparity indicates whether an object is in front of or behind the horopter. Thus, disparity provides accurate information about depth relative to the fixated object.

Stereoscopic pictures take advantage of binocular disparity to create an impression of depth. A camera takes two pictures at a separation that corresponds to the distance between the eyes. A stereoscope presents these disparate images to each respective eye. The red and green or polarized lenses used for 3D movies accomplish the same purpose. The lenses allow each eye to see a different image. A similar effect occurs while viewing random-dot stereograms (Julesz, 1971), pairs of pictures in which the right stereogram is created by shifting a pattern of dots slightly from the locations in the left stereogram (see Figure 6.19). This perception of objects in depth takes place in the absence of visible contours. “Magic Eye®” posters, called autostereograms, produce the perception of 3D images in the same way but in a single picture (see Figure 6.20). This happens when you fixate at a point in front of or behind the picture plane, which then allows each eye to see a different image (Ninio, J. (2007). The science and craft of autostereograms. Spatial Vision, 21, 185�200). We don’t yet understand how the visual system determines what dots or part of an image go together to compute these depth relations in random-dot stereograms.

FIGURE 6.19A random-dot stereogram in which the left and right images are identical except for a central square region that is displaced slightly in one image.

FIGURE 6.20Random-dot autostereogram.

SIZE AND SHAPE CONSTANCY

Depth perception is closely related to the phenomena of size constancy and shape constancy (Walsh & Kulikowski, 1998). These refer to the fact that we tend to see an object as having a constant size and shape, regardless of the size of its retinal image (which changes with distance) and the shape of its retinal image (which changes with slant). The relationship between these constancies and depth perception is captured by the size-distance and shape-slant invariance hypotheses (Epstein, Park, & Casey, 1961). The size-distance hypothesis states that perceived size depends on estimated distance; the shape-slant hypothesis states that perceived shape is a function of estimated slant. The strongest evidence supporting these relations is that size and shape constancy are not very strong when depth cues are eliminated. Without depth cues, there is no way to estimate the distance and slant of an object (Holway & Boring, 1941).

ILLUSIONS OF SIZE AND DIRECTION

In most situations, the Gestalt organizational principles and depth cues we have discussed contribute to an unambiguous, accurate percept of objects in 3D space. However, many illusions occur that attest to the fallibility of perception. Figures 6.21 and 6.22 show several such illusions of size and direction.

FIGURE 6.21Illusions of size: (a) the Müller-Lyer illusion; (b) the Ponzo illusion; (c) the vertical-­horizontal illusion; (d) a variation of the Delboeuf illusion; and (e) the Ebbinghaus illusion.

FIGURE 6.22Illusions of direction: (a) the Poggendorff illusion; (b) the Zöllner illusion; (c) the Hering illusion; (d) the Wundt illusion; (e) the Ehrenstein illusion; and (f) the Orbison illusion.

Figure 6.21 illustrates five size illusions. In each panel, there are two lines or circles that you should compare. For instance, in the Müller-Lyer illusion (panel a), which of the two horizontal lines is longer? Because of the contours at the end of each line, the left line appears to be longer than the right. However, the two lines are exactly the same size (measure them to convince yourself). In each of the panels in Figure 6.21, the forms to be compared are exactly the same size.

Figure 6.21 shows several illusions of direction. In each panel, perfectly straight or parallel lines appear bent or curved. For instance, the Poggendorff illusion (panel a) shows a straight line running behind two parallel vertical lines. Although the line is perfectly straight, the upper part of the line does not seem to continue from the bottom of the line: it looks offset by at least a small amount. Using a straight edge, convince yourself that the line is really straight. In each of the panels in Figure 6.22, the presence of irrelevant contours causes distortions of linearity and shape.

There are many reasons why these illusions occur (Coren and Girgus, 1978; Robinson, 1998). These include inaccurate perception of depth, displacement of contours, and inaccurate eye movements, among others. Consider, for example, the Ponzo illusion (see Figure 6.21, panel b). The defining feature of this illusion is the two vertical lines that converge toward the top of the figure. Although the two horizontal lines are exactly the same length, the top line appears to be slightly longer than the bottom line. Recall from Figure 6.12 that vertical lines, like these, that converge at the top suggest (through linear perspective) a recession into the distance. If this depth cue is applied here, where it shouldn’t be, the horizontal line located higher in the display is further away than the one located lower in the display. Now, the retinal images of these two lines are exactly the same. Therefore, if the top one is further away than the bottom one, it must be longer than the bottom one. Hence, the top line is perceived as longer than the bottom.

Note that a similar illusion can be seen in the two monoliths illustrating interposition in Figure 6.12. The occluded monolith appears more distant than the monolith in front of it, and this distance is exaggerated by the receding texture gradient of the field. However, the two monoliths are exactly the same size (measure them). That is why, in panel a, the monolith in the back appears larger than the monolith in the front.

These seemingly artificial illusions can create real-world problems. Coren and Girgus (1978) describe a collision between two commercial aircraft that were approaching the New York City area at 11,000 and 10,000 ft, respectively. At the time, clouds were protruding above a height of 10,000 ft, forming an upward-sloping bar of white against the blue sky. The crew of the lower aircraft misperceived the planes to be on a collision course and increased their altitude quickly. The two aircraft then collided at approximately 11,000 ft. The U.S. Civil Aeronautics Board attributed the misjudgment of altitude to a naturally occurring variant of the Poggendorff illusion (see Figure 6.22a) created by the upward-sloping contours of the cloud tops. The clouds gave the illusion that the two flight paths were aligned even though they were not, and the altitude correction brought the lower plane into a collision course with the upper plane.

A recurring problem for pilots flying at night occurs when landing under “black hole” conditions in which only runway lights are visible. In such situations, pilots tend to fly lower approaches than normal, with the consequence that a relatively high proportion of night flying accidents involve crashes short of the runway. Experiments have shown that the low approaches arise from overestimates of approach angles due to the insufficiency of the available depth cues, like motion parallax and linear perspective (Mertens & Lewis, 1981, 1982). Because the pilot must evaluate the few cues provided by the runway lights according to some familiar standard, he or she will tend to make lower approaches when the runway has a larger ratio of length to width than that of a familiar runway with which the pilot has had recent experience (Mertens & Lewis, 1981).

PERCEPTION OF MOTION

Not only do we perceive a structured, meaningful world, but we see it composed of distinct objects, some stationary and others moving in various directions at different rates of speed. How is motion perceived? An initial answer that you might think of is that changes in displacement of an image on the retina are detected. However, how can the visual system determine when an object moves versus when the observer moves? Changes in retinal location can be due either to movement of objects in the environment or to the observer’s movement. How the perceptual system resolves the locus of movement constitutes the primary problem of motion perception.

OBJECT MOTION

Motion perception can be thought of in terms of two separate kinds of systems (Gregory, 2015). The image-retina system responds to changes in retinal position, whereas the eye-head system takes into account the motion from our eye and head movements. The image-retina system is very sensitive. People are good at discriminating movement as a function of changes in retinal position. Movement can be seen if a small dot moves against a stationary background at speeds as low as 0.2° of visual angle per second. (From a 1 m viewing distance, 0.2° corresponds approximately to 3 mm.) Sensitivity to movement is even greater if a stationary visual reference point is present (Palmer, 1986). In such situations, tiny changes of as little as 0.03° of visual angle (approximately 0.5 mm at 1 m distance) per second produce a perception of movement.

Displacement of a retinal image does not necessarily mean that an object is moving, because the displacement may be due to movement of the observer. However, if an object is moving, we might track that object by moving our eyes. Such eye movements are called smooth-pursuit movements. During smooth-pursuit movements, the image remains on the fovea but we perceive that the object is moving. This sort of motion perception is due to the eye-head movement system.

Two theories have been proposed to explain how the eye-head system can tell the difference between an observer’s own movements and movement of objects in the world (Bridgeman, 1995). Sherrington (1906) proposed what is often called inflow theory. According to this theory, feedback from the muscles that control eye movements is monitored by the brain. The change in the position of the eyes is then subtracted from the shift in location of the image on the retina. In contrast, outflow theory, proposed by Helmholtz (1867), states that the motor signal sent to the eyes is monitored instead. A copy of this outgoing signal, which is called a corollary discharge, is used to cancel the resulting movement of the image on the retina.

Research on motion perception has tended to favor outflow theory over inflow theory. Helmholtz noticed that if you press (gently!) on your eyelid while looking at an object, the object appears to move. In this situation, your eye is moving because you have moved it with your finger, not by moving the eye muscles. Because the muscles have not moved, there is no corollary discharge from them. According to outflow theory, this discharge must be subtracted from the movement of the retinal image; without the discharge, the retinal movement cannot be corrected, and the object appears to move.

One prediction of outflow theory is that if the muscles of the eye provide a corollary discharge but the retinal image remains fixed, motion of an object should also be perceived. This prediction has been confirmed (Bridgeman & Delgado, 1984; Stark & Bridgeman, 1984). Imagine a situation where pressure is applied to your eye, as with your finger, but you use the eye muscles to prevent the eye from moving. A corollary discharge will occur, but the retinal image will remain fixed, and the object appears to move. More complicated experiments have been performed using curare to temporarily paralyze an observer. When the observer tries to move his or her eyes (which do not actually move), the scene appears to move to a new position (Matin, Picoult, Stevens, Edwards, & McArthur, 1982; Stevens, Emerson, Gerstein, Kallos, Neufeld, Nicholas, & Rosenquist, 1976).

Induced Motion

Although we are very good at perceiving very small movements against a stationary background, stationary backgrounds can also lead to illusions of movement. In such illusions, movement is attributed to the wrong parts of the scene. One example of this is called the waterfall effect, which can take many forms. If you stare closely at a waterfall, a downward-moving pattern of water against a stationary background of rocks, you may experience the perception that the water stops falling and the rocks begin moving upward. You can also experience a waterfall effect while watching clouds pass over the moon at night. Often, the moon appears to move and the clouds remain still.

Motion illusions are easy to reproduce in a laboratory setting by presenting observers with a test patch of stationary texture and surrounding it with a downward-drifting inducing texture. When the test and inducing objects are in close spatial proximity, the effect is called motion contrast. When the test and inducing objects are spatially separated, the phenomenon is called induced motion. Induced motion can be demonstrated when one of two stimuli is larger than and encloses another. If the larger stimulus moves, at least part of the movement is attributed to the smaller enclosed stimulus. The enclosing figure serves as a frame of reference relative to which the smaller stimulus is displaced (Mack, 1986).

Apparent Motion

We usually perceive the movement of retinal images as smooth, continuous movement of objects through a visual scene. However, discrete jumps of a retinal image can produce the same perception of smooth movement. We introduced this phenomenon, called apparent motion, when discussing Gestalt organization. Apparent motion is the basis for the perceived movement of lights on a theater marquee, as well as for the movement perceived in motion pictures and on television. The fact that we perceive smooth movement from motion pictures conveys the power of apparent motion.

We know a lot about when apparent motion will be perceived from experiments conducted with very simple displays, such as the two lights used to illustrate stroboscopic motion discussed earlier. Two factors determine the extent to which apparent motion will be perceived: the distance and the time between successive retinal images. Apparent motion can be obtained over distances as large as 18°, and the interval that provides the strongest impression of apparent motion depends on the distance. As the degree of spatial separation increases, the strongest impression of apparent motion will be given by interval durations that are longer and longer.

Our current understanding of apparent motion is that there are two processes involved. A short-range process is responsible for computing motion over very small distances (15 min of visual angle or less) and rapid presentations (100 ms or less). Another long-range process operates across large distances (tens of degrees of separation) and over time intervals of up to 500 ms. Whereas the short-range process is probably a very low-level visual effect, the long-range process appears to involve more complex inferential operations.

PATTERN RECOGNITION

Up to this point, we have been discussing how our perceptual system uses different kinds of visual information to construct a coherent picture of the world. Another important job that the perceptual system must perform is the recognition of familiar patterns in the world. In other words, we have to be able to identify what we see. This process is called pattern recognition.

Because pattern recognition seems to be a skill that is fundamental to almost every other cognitive process a person might engage, it has been the focus of a tremendous amount of basic research. Many experiments have examined performance in a task called “visual search,” which requires observers to decide if a predetermined target item is present in a visual display. Earlier in this chapter, we talked about how grouping of display elements can help make a target letter “F” more or less easy to find (Figure 6.9). This is an example of a visual search task. Knowing how people perform this task is critical to the good design of certain displays and task environments.

The idea that objects in a visual scene can be taken apart in terms of their basic “features” is again an important concept in understanding pattern recognition (Treisman, 1986). Visual search that is based on a search for primitive features such as color or shape can be performed very rapidly and accurately: it is very easy to find the one green object in a display containing red objects, no matter how many red objects there might be. However, if a target is defined by a combination of more than one primitive feature, and those features are shared by other objects in the display, the time to determine whether the target is present is determined by the number of nontarget objects in the array (array size). Whereas search for a single primitive feature is rapid and effortless, search for conjunctions of features requires attention and is effortful.

This basic fact of pattern recognition in visual search has implications for the design of computer interfaces. For menu navigation, highlighting subsets of options by presenting them in a distinct color should shorten the time for users to search the display. This has been confirmed in several studies (Fisher, Coury, Tengs, & Duffy, 1989; Fisher & Tan, 1989). Users are faster when a target is in the highlighted set and slower when it is not or when no highlighting is used. Moreover, even if the target is not always guaranteed to be in the highlighted set, the benefit of highlighting is greater when the probability is high that the target will be in the highlighted set than when the probability is low.

Another characteristic of primitive features that is important for display design is that of integral and separable dimensions (Garner, 1974). Whereas a feature is a primitive characteristic of an object, a dimension is the set of all possible features of a certain type that an object might have. For example, one feature of an object might be that it is red. The dimension that we might be interested in is an object’s color, whether red, green or blue.

Dimensions are said to be integral if it is not possible to specify a value on one feature dimension without also specifying the value on the other dimension. For example, the hue and brightness of a colored object are integral feature dimensions. If dimensional combinations can exist independently of one another, they are called separable. For example, color and form are separable dimensions. You can pay attention to each of two separable feature dimensions independently, but you cannot do so for two integral dimensions. Thus, if a judgment about an object requires information from one of its dimensions, that judgment can be made faster and more accurately if the dimensions of the object are separable. On the other hand, if a judgment requires information from all of an object’s dimensions, that judgment will be easier if the dimensions are integral. Another way to think about integrality of dimensions is in terms of correlations between object features. If a set of objects has correlated dimensions, a specific value on one dimension always occurs in the presence of a specific value on the other dimension.

Another kind of dimension is called a configural dimension. Configural dimensions interact in such a way that new emergent features are created (Pomerantz, 1981). Emergent features can either facilitate or interfere with pattern recognition, as Figure 6.23 shows. This figure shows an array for a visual search task, in which the target to be detected is the line slanting downward (in the lower right-hand corner). For both the top and bottom panels in the figure, exactly the same set of contextual features is added to each object in the array. Because the same features are added to each object, these features alone do not provide any help in recognizing the target object. However, when we examine the final configuration of the objects after the new features are added, we see in the top array that the contextual features have enhanced the differences between the objects, and the time to recognize the target (RT) is greatly reduced. In the bottom row, the contextual features have obscured the target, and RT is greatly increased.

FIGURE 6.23The additional configural context that facilitates (top row) and impedes (bottom row) performance.

Up to this point, our discussion of pattern recognition has focused on the analysis of elementary features of sensory input. This analysis alone does not determine what we perceive. Expectancies induced by the context of an object also affect what we perceive. Figure 6.24 shows a famous example of the influence of context. This figure shows two words, “CAT” and “THE.” We easily perceive the letter in the middle of CAT to be an A and the letter in the middle of the word THE to be an H. However, the character that appears in the middle of each word is ambiguous: it is neither an A nor an H, and it is in fact identical for both words. The context provided by the surrounding letters determines whether we recognize an A or an H.

FIGURE 6.24The effect of context on perception. The same symbol is seen as the letter H in THE and CAT.

Similar expectancy effects occur for objects in the world. Biederman, Glass, and Stacy (1973) presented organized and jumbled pictures and had subjects search for specific objects within these pictures. They presumed that the jumbled picture would not allow the viewers to use their expectations to aid in searching for the object. Consistently with this hypothesis, search times for coherent scenes were much faster than those for jumbled scenes. Biederman et al. also examined the influence of probable and improbable objects within the coherent and jumbled scenes. They found that for both kinds of pictures it was much easier to determine that an improbable object was not present than to determine that a probable object was not present. This finding indicates that observers develop expectations about objects that are possible in a scene with a particular theme. What we perceive thus is influenced by our expectancies as well as by the information provided by the senses.

The influence of expectations is critical when objects fall into the peripheral visual field (Biederman et al., 1981). It is difficult to detect an unexpected object in the periphery, particularly when it is small. The rate at which targets are missed in visual search increases to 70% as the location of an unexpected object shifts from the fovea to 4° in the periphery. The miss rate for a peripheral object is reduced approximately by half when the object is expected.

SUMMARY

Perception involves considerably more than just passive transmission of information from the sensory receptors. The perceived environment is constructed around cues provided by many sensory sources. These cues allow both 2D and 3D organization of visual, auditory, and tactile information as well as pattern recognition. The cues are comprised of encoded relations among stimulus items, such as orientation, depth, and context.

Because perception is constructed, misperceptions can occur if cues are false or misleading, or if the display is inconsistent with what is expected. It is important, therefore, to display information in ways that minimize perceptual ambiguities and conform to the expectancies of the observer. In Chapters 5 and 6, we have concentrated on visual perception because of its importance to human factors and the large amount of research conducted on the visual sense. The next chapter discusses auditory perception and, to a lesser extent, the senses of taste, smell, and touch.

RECOMMENDED READINGS

Cutting, J. E. (1986). Perception with an Eye to Motion. Cambridge, MA: MIT Press.

Hershenson, M. (1999). Visual Space Perception. Cambridge, MA: MIT Press.

Palmer, S. E. (1999). Vision Science: Photons to Phenomenology. Cambridge, MA: MIT Press.

Palmer, S. E. (2003). Visual perception of objects. In A. F. Healy and R. W. Proctor (Eds.), Experimental Psychology (pp. 179–211). Volume 4 in I. B. Weiner (Editor-in-Chief) Handbook of Psychology. Hoboken, NJ: Wiley.

Snowden, R., Thompson, P., and Troscianko, T. (2012). Basic Vision: An Introduction to Visual Perception (revised edn.) Oxford, UK: Oxford University Press.

Wagemans, J. (Ed.) (2015). The Oxford Handbook of Perceptual Organization. Oxford, UK: Oxford University Press.

Wagemans, J., Elder, J. H., Kubovy, M., Palmer, S. E., Peterson, M. A., Singh, M., and von der Heydt, R. (2012). A century of Gestalt psychology in visual perception: I. Perceptual grouping and figure–ground organization. Psychological Bulletin, 138, 1172–1217.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset