CHAPTER  7

Feed Forward Versus Feed Backward

The typical reader of this book is a human being who hopes to learn something by reading it. Human readers are seldom hungry or thirsty when they read and learn. Eating and drinking during reading is more likely to distract than to fix human attention on a book. The traditional experiment on rat or pigeon learning submits the learners to severe deprivation to make them hungry or thirsty and rewards their primary needs with food or water. Meanwhile, human beings learn a great deal every day without any deprivation of food or water or any other primary need and without immediate, or even prompt, primary reward. Indeed, human beings probably learn better when they are comfortable and well fed. Rats and pigeons, dogs and cats, not to mention cows and horses, and pigs and chickens, learn all sorts of things without being hungry and without prompt primary reward. What could the traditional theory of learning by contingent reinforcement possibly have to do with the bulk of human and nonhuman animal learning?

The traditional answer to this question is the claim that there are secondary or conditioned rewards, such as money and praise, that serve as surrogates for primary rewards, such as food and drink, as a result of past association. Without experimental evidence to support this claim, the traditional theory is irrelevant to most of the learning of live human and nonhuman beings. The purpose of this chapter is to take a hard look at the experimental evidence for this claim.

OPERATIONAL DEFINITION

As chapter 5 showed, decades of debates and experiments on the question of latent learning failed to produce an operational definition of S*, the foundation concept of the law of effect. In the course of the dispute, it became clear that whenever animals learned, reinforcement theorists could find an S* that reinforced the learning—but only post hoc, after the fact. At the same time, whenever reinforcement theorists found an S* that could reinforce learning, cognitive theorists easily reinterpreted it as an incentive that governed performance rather than learning. Both contingent reinforcement and cognitive expectancy share the same post hoc position. Perhaps, the following example explains why post hoc explanations are unacceptable in modern sciences.

Suppose that you met a distinguished psychologist who specialized in the theory of coin behavior. Suppose that the coin psychologist invited you to toss a coin and then predicted that the coin would land with its head up or its tail up, unless it happened to balance on its edge. Every time the coin landed head up, the coin psychologist could explain that this meant that the coin wanted to land head up more than tail up. Every time the coin landed tail up, the coin psychologist could explain that this meant that the coin wanted to land tail up more than head up. It might take many tries before the coin balanced on its edge, but when it did the coin psychologist could explain that this meant that the coin wanted both head and tail up equally. Clearly, you could not win any bets with the help of coin psychology, because the theory never tells you anything in advance.

You might argue that everyone knows that coins cannot have wants. Suppose we humored you on this (you really have no proof; it is just an assumption) and asked you how you know that food-deprived rats want to eat food. If you answered that you know that they want to eat food because they eat it, and that when they stop eating food that means that they stopped wanting to eat, you would be agreeing with traditional cognitive psychology, but your theory would be exactly the same as the theory of coin psychology. You would be saying that the animal must have wanted to eat, because it ate. This is why both hedonistic reinforcement theories and hedonistic cognitive theories must find an operational definition of pleasure and pain. Otherwise, they tell us nothing in advance, like the coin psychologist.

The notion of reward and punishment as a principle of learning appears throughout the history of Western civilization, since biblical times at least. The law of effect is so ancient and so deeply ingrained in Western culture that many readers will be surprised to find that it has ever been questioned. Progress in modern science largely depends on demanding evidence in place of faith in tradition.

The first formal expression of the modern law of effect appears in the writings of Thorndike.

The Law of Effect is that: Of several responses made to the same situation, those which are accompanied or closely followed by satisfaction to the animal will, other things being equal, be more firmly connected with the situation, so that, when it recurs, they will be more likely to recur; those which are accompanied or closely followed by discomfort to the animal will, other things being equal, have their connections with that situation weakened, so that, when it recurs, they will be less likely to occur. The greater the satisfaction or discomfort, the greater the strengthening or weakening of the bond. (1911, p. 244)

This is the commonsense view that pleasing consequences strengthen responses and displeasing consequences weaken responses. More recent statements of this principle differ chiefly in substituting other terms (reward, reinforcement, drive-reduction, homeostasis) for satisfaction and annoyance.

There have been many attempts to find an operational definition of satisfaction and annoyance. Thorndike tried the following: “By satisfying state of affairs is meant one which the animal does nothing to avoid, often doing such things as to attain and preserve it. By a discomforting or annoying state of affairs is meant one which the animal commonly avoids and abandons” (1911, p. 245).

Other definitions emphasized the tension-reducing character of reinforcement. They pointed out that an animal in a problem box or maze is aroused, restless, hypertensive, until food appears. After in-gesting food, an animal becomes relaxed and quiescent. Many writers have described the end of goal activity to be the reaching of a state of “complacency” similar to that which Cannon (1932) called “homeostasis.” T. L. McCulloch (1939) proposed that reward may be effective because it causes the disappearance of restless, excited behavior. C. L. Hull’s principle of drive-reduction is similar.

So far all attempts at operational definition have failed. The following definition written by Morse and Kelleher (1977) is typical. Note that this definition appeared in the Handbook of Operant Behavior, considered one of the most authoritative reference books in the Skinnerian (also called behaviorist) tradition:

The increased occurrence of responses similar to one that immediately preceded some event identifies that event as a reinforcer … the decreased occurrence of responses similar to one that immediately preceded some event identifies that event as a punisher … There is no concept that predicts reliably when events will be reinforcers or punishers; the defining characteristics of reinforcers and punishers are how they change behavior. Events that increase or decrease the subsequent occurrence of one response may not modify other responses in the same way. (p. 176)

With this statement, Morse and Kelleher seem to concede (a) that it is impossible to say in advance which events will reinforce or punish which responses, and (b) worse still, that the reinforcing effect of any particular event on any particular response may or may not predict the effect of that same event on a different response. In other words, all that the language of reinforcement theory has to offer is names for things that we know already, as in coin psychology.

The basic lack of an operational definition of primary reinforcement is sometimes obscured by describing an S* as a “biologically significant stimulus.” This seems obviously true in cases where hungry animals receive pellets of food, but what about the latent learning experiments in which rats seemed to learn the path to the goal box for the reinforcement of being taken out of the maze?

There are other interesting cases. Kish (1955), for example, reported the first of a series of experiments demonstrating that mice pressed a lever more when the only contingent stimulus was a brief lighting of a lamp. When the response failed to change the light, extinction set in. Kish and others (J. F. Hall, 1976, pp. 238–239; Kimble, 1961, pp. 254–256; Kish, 1966; Osborne & Shelby, 1975) called this stimulus change reinforcement because either light on or light off increased lever-pressing. All this tells us, however, is that light change must have been a reinforcer because lever-pressing increased. Without some independent way of specifying in advance which effects will be reinforcing and which will not, this circularity must continue.

Reports that direct brain stimulation reinforces lever-pressing and maze-running are supposed to prove that there are specific pleasure centers in the brain. This is the same sort of post hoc reasoning. How could anyone have known that these consequences reinforce before they saw the experimental results? And, what good does it do if someone tells us that these are cases of reinforcement after we already know that the rats learned? This chapter considers how the concept of secondary or conditioned reward can offer an experimental way out of this problem. The next section of this chapter looks at the operational definition of secondary reward.

DEFINING SECONDARY REWARD

Even though the operational definition of primary reinforcement may always elude us, secondary or conditioned reward may be easier to define in terms of experimental operations. The following typical definition appears in Hulse, Egeth, and Deese (1980), a highly respected review of this field:

A conditioned reinforcer is a neutral stimulus that acquires the functional properties of a primary reinforcer by being paired with a primary reinforcer or with another conditioned reinforcer … conditioned reinforcers are stimuli that acquire the power to reinforce by being paired with other stimuli that already possess the power to reinforce (p. 52). (Note that Hulse et al. uses the term neutral stimulus where this book uses the term arbitrary stimulus.)

Actual experiments can fulfill this definition without defining primary reward in advance. All we need is agreement that a particular stimulus has the power of primary reinforcement. This has always been easy. Everyone who believes in primary reinforcement agrees that food is a primary reward for lever-pressing. Given that agreement, we can pair an arbitrary stimulus, Sa, with food and then test whether this Sa then functions as a secondary reward, Sr. Now we have an operational definition of secondary reinforcement. With this definition we can find out whether there is indeed a class of phenomena that fit this definition or whether, instead, the definition specifies a scientifically empty class, like unicorns.

Before reviewing the evidence, consider why the Hulse et al. definition is both typical of all published definitions of secondary reward and also why it is the only possible definition of secondary reward. To begin with, a primary reward must be a stimulus. This is true because the only way that information enters a nervous system is in the form of stimuli or inputs. Consequently, the only way that a nervous system can tell that it has been reinforced is by receiving a reinforcing stimulus.

The fact that the actual reinforcing stimulus often originates inside the animal may obscure this slightly. Dryness of the throat and salinity of the blood are signals of thirst. Levels of sugar in the blood and contractions of the stomach are signals of hunger. The basis of primary reinforcement must be sensory, even if the sensors are monitoring conditions inside the animal.

To serve as a surrogate for primary reward, a secondary reward must also be a stimulus. If the secondary reward acquires its surrogate power by conditioning, it must be by S-S* conditioning—that is, by Pavlovian higher order conditioning. The trouble with this source of secondary reward is the evidence reviewed in chapter 4 that Pavlovian higher order conditioning is difficult to establish in experiments and also weak and easily extinguished. This chapter describes the parallel problem with experimental attempts to fulfill the operational definition of secondary reward.

Magazine Clicks

The following experiment by Bugelski (1938) is often cited as evidence for secondary reward in the Skinner box. Bugelski first rewarded rats with food pellets for lever-pressing. Then he divided the animals into two groups for extinction. For both groups the food magazine, the device that holds the pellets and drops them into the food dish, was empty during extinction. Bugelski extinguished the experimental group with the lever still wired to the food magazine so that the magazine continued to operate making its usual audible clicks without delivering any pellets. Bugelski extinguished the control group with the lever disconnected from the magazine so that they received neither clicks nor pellets when they pressed the lever. The experimental group made many more responses under extinction than the control group. Since the test was conducted without any primary reward, the clicks alone must have maintained performance. Is this evidence for secondary reward?

Although some textbooks continued to cite Bugelski (1938) as evidence for secondary reward right up to the 1990s (e.g., Schwartz & Reisberg, 1991, p. 154), most modern critics reject this evidence for the following reason. The Bugelski experiment only demonstrates transfer from one stimulus situation to another. All things being equal, transfer from a training situation to a testing situation depends on the similarity between training and testing conditions. College students, for example, who memorize a list of words in one room recall more of the words on the list when tested later in the same room than when tested in a different room. Similarly, rats conditioned to press a lever in one Skinner box in one room make more responses in an extinction test carried out in the same box and the same room, than in a test carried out in a different box in a different room. If there is a special light in the box and it is lighted during training, then there are more responses in an extinction test carried out with the light on than with the light off (Thomas & Morrison, 1994). In the Bugelski experiment, the clicks are part of the stimulus situation during training. Therefore, we must expect more responses during extinction with clicks than without clicks based on stimulus transfer whether or not there is any such thing as secondary reward.

The Bugelski (1938) experiment fails to demonstrate secondary reward for a significant reason. The clicks in this experiment only maintain responding in extinction, but secondary rewards must reinforce new learning. Consequently, the only appropriate test is one in which the secondary reward acquired in one learning situation reinforces new responses in a new situation. Once again, the discipline of operational definition reveals critical implications of a scientific notion.

New Learning

Saltzman (1949) was the first experiment that fulfilled the defining operations of secondary reward. Saltzman’s apparatus consisted of two separate parts, a straight alley and a single-unit U-maze as shown in Fig. 7.1. The purpose of the straight alley was to establish a distinctive goal box as a secondary reward, by associating it with food in Phase I. The purpose of the U-maze was to measure the secondary reward value of the distinctive goal box as the only incentive for new learning in Phase II.

In Phase I, all rats found food in one of two distinctive goal boxes at the end of the straight alley. One goal box was painted black and the rats had to climb over a low hurdle to enter it. The other goal box was painted white and the rats had to climb down to a lower level to enter it. Two groups, the continuous reward group and the control group, received exactly the same treatment during Phase I. Over a 5-day period they found food in the same distinctive goal box, either the black one or the white one, on every trial for 25 trials. A third group, the partial reward group, also received 25 rewarded trials during Phase I, but, in addition, they received 14 nonrewarded trials in the same distinctive goal box. These nonrewarded trials appeared haphazardly in the series but were always preceded and followed by a rewarded trial. A fourth group, the differential reward group, received the same treatment as the partial reward group except that on all rewarded trials the food was in one of the distinctive goal boxes, either the black one or the white one, and on all nonrewarded trials they were detained briefly in the other distinctive box.

In Phase II, all of the rats received 15 test trials in the U-maze. All groups except the control group found the black goal box at the end of one arm of the U-maze and the white goal box at the end of the other. This was counterbalanced so that the goal box associated with reward in Phase I was at the end of the left arm for half of the animals and at the end of the right arm for the other half. The control group found one of two new identical gray goal boxes at the end of each arm of the U-maze. The animals in the experimental groups never received any food during Phase II. Their only incentive for choosing one arm or the other of the U-maze was to get to the goal box that had contained food during Phase I. The control group found food in one of the new gray boxes in one arm or the other of the U-maze; for half of the animals this was the left arm and for the other half this was the right arm.

images

FIG. 7.1. Typical apparatus used in demonstrations of secondary reward. Rats run first to distinctive goal boxes in the runway, next to the same goal boxes in the U-maze. S indicates start boxes; G indicates goal boxes. Copyright © 1997 by R. Allen Gardner.

The U-shape of the maze prevented the animals from seeing either of the distinctive goal boxes until after they made their choice. Experimenters frequently use U-mazes for this purpose. Gates were lowered behind the animals to prevent retracing.

The Saltzman experiment fulfills the operational definition of secondary reward. During Phase I, an Sa was associated with an S*. During Phase II, the animals learned a new response in a new apparatus without any primary reward, so the only possible source of reinforcement was the Sa.

Results were as follows. In the 15 trials of Phase II, the control group was rewarded with food in fresh new goal boxes and chose the rewarded side on an average of 10.0 trials. The continuous, partial, and differential reward groups chose the side leading to the goal box in which they had received reward during Phase I on an average of 8.3, 9.0, and 10.7 trials, respectively. Since there were 15 trials in all, and since the U-maze is a two-choice situation, the average level of choice expected by chance would be one half of 15 or 7.5. Other experimenters have replicated the general results of Saltzman’s (1949) experiment, although some have failed to replicate them. In general, however, a good argument for a weak positive result can be sustained from the existing evidence (see J. F. Hall, 1976, pp. 246–254, for an extensive review).

You might at first suppose that, if the experimental groups were learning the correct path in the U-maze because they were secondarily rewarded by the distinctive goal box, they could easily have improved further and raised their performance higher above chance, if only Saltzman had given them more trials. The reason that Saltzman and others ended the experimental tests when they did, however, was that the secondarily rewarded rats were showing signs of extinction. Performance was declining rather than improving. This is a problem that cannot be avoided because tests of the power of secondary reward to reinforce new learning must be carried out without any primary reward.

The principle of secondary reward supports a heavy burden in response reinforcement theories. It is supposed to account for the bulk of human learning as well as a huge amount of learning by nonhuman beings, which also occurs without any primary reward. Secondary reward is supposed to account for the value of money and prestige, love of home and love of country. Secondary rewards, such as money, are supposed to be more powerful and more resistant to extinction than food and water. How can secondary reward that is too weak to support robust amounts of behavior in the Saltzman type of experiment carry the immense theoretical load assigned to it in reinforcement theories? Evidence for secondary reinforcement must be strong, precisely under extinction conditions. Chapter 4 already considered a similar requirement and a similar failure of evidence for higher-order conditioning.

Partial Reward

Zimmerman (1957, 1959) proposed that the problem was one of weak resistance to extinction after 100% reinforcement. He reasoned that other demonstrations of secondary reward found weak evidence because they failed to use Skinnerian schedules of reinforcement. Since schedules of partial primary reward generate much higher rates of responding and much greater resistance to extinction, they should generate more robust effects of secondary reward.

Zimmerman’s (1957) demonstration used a procedure like the one in Coate (1956). Chapter 6 describes how Coate designed a special Skinner box to measure lever-pressing and food-finding separately. Coate’s apparatus delivered food pellets into a dish that was outside of the box. Rats reached the food by poking their heads through a hole in the wall of the box. There was a curtain over the hole so that the rats could not see the food until they poked their heads through the curtain.

Using a similar apparatus, Zimmerman first trained rats to find water by putting their heads through an opening in the box when they heard a buzzer. There was no lever in the box during this phase of the experiment. To discourage the animals from waiting near the opening, Zimmerman sounded the buzzer only when they were in some other part of the box. Soon the rats rushed to the opening as soon as they heard the buzzer. Next, he set the mechanism so that it delivered water only half the time when the buzzer sounded, but haphazardly so that the rats could not tell in advance when they would find water. The animals continued to rush to the water opening every time they heard the buzzer. Gradually, Zimmerman increased the ratio to 10 buzzes for each water reward. The animals continued to rush to the watering place every time they heard the buzzer.

In Phase II of this demonstration, Zimmerman introduced a lever into the box for the first time, and wired the lever to the mechanism in such a way that each lever-press sounded the buzzer although the rat only found water once on the average for every 10 buzzes. The animals soon began to press the lever and then go to the opening when they heard the buzzer. Gradually, Zimmerman increased the requirement so that the rats had to press the lever more and more times before they heard the buzzer. As in the case of the buzzer, he required a variable and random number of presses for each buzz. Eventually, the animals were pressing the lever an average of 10 times for each time that the buzzer sounded. Water followed the buzzer after 1 out of 10 buzzes, so the rats were pressing the lever about 100 times for every tiny drink that they got. They also demonstrated robust resistance to extinction. Reasoning that the buzzer was secondarily reinforcing lever-pressing, Zimmerman concluded that partial primary reinforcement could generate robust amounts of secondary reinforcement.

That delivering rewards for a fraction of the responses rather than for every response generates higher rates of responding and greater resistance to extinction, is probably the most well-documented phenomenon in all of experimental psychology. But this is the opposite of what should happen if rewards such as food reinforce responses such as lever-pressing. That is, if food reinforces lever-pressing, how is it possible for 10% reinforcement to be more reinforcing than 100% reinforcement? If the buzzer acquires its secondary reinforcing power by association with food, then how could a buzzer that was paired with food only 10% of the time acquire more power than Saltzman’s distinctive goal boxes that were paired with food 100% of the time? This additional demonstration that partial reward is more reinforcing than 100% reward is actually an additional embarrassment to reinforcement theory.

This is a good place to point out the advantages of referring to the food, water, and other commodities dispensed in these experiments as rewards rather than as reinforcements. The statement that 10% reinforcement is more reinforcing than 100% reinforcement is a contradiction in terms. This useage “does violence to language” because it uses the same word in two contradictory senses in the same argument. Using words in this way can only lead to confusion. Distinguishing between rewards and their theoretical reinforcing effect avoids the problem. In this book commodities such as food, water, and so on, that the experimenter delivers are rewards and the effect of rewards on learning is reinforcement.

It is illogical to say that fewer reinforcements are more reinforcing than more reinforcements, but it is logically possible for fewer rewards to be more reinforcing than more rewards. The argument that reward refers to the intentions of the trainer and reinforcement refers to what happens to the trainee, favors the terminology used in this book. By separating rewards—what the experimenter does that is contingent on what the animal does—from reinforcement—the theoretical effect of what the experimenter does—we can evaluate the theory objectively.

Zimmerman’s (1959) demonstration carried his argument a step further. The apparatus in the second demonstration appears in Fig. 7.2. It consisted of a 9-inch-square starting box with a 4-inch-wide door that opened onto a 40-inch-long alley which ended in another door that opened into a 4-inch by 12-inch goal box. In Phase I, the rats first found food in the goal box after running from the start box through the alley to reach the goal box. Next, Zimmerman kept the door leading from the start box to the alley closed until he sounded a buzzer. At that point, the start box door opened and the rats could run through the alley to the goal box. At first, they found food in the goal box on every trial, but later Zimmerman reduced the schedule to the point where they found food less than one out of every four times that they got to the goal box.

images

FIG. 7.2. Apparatus used by Zimmerman (1959). Copyright © 1997 by R. Allen Gardner.

In Phase II, Zimmerman introduced a lever into the start box and never again put any food in the goal box. Thus, the animals had to learn to press the lever for secondary reward without any primary reward. Now Zimmerman arranged the contingencies so that when a rat pressed the lever, the buzzer sounded and the start box door opened. The rats quickly learned to press the lever and then ran out into the alley as before. Gradually, Zimmerman raised the ratio of responses to buzzes to 20 to 1 (FR 20).

Zimmerman found that the animals pressed the lever at high rates that are quite comparable to rates found in similar experiments when the apparatus delivers primary food rewards. But they never received any food reward for pressing the lever. Indeed, they only received the buzzer as a secondary reward for 1 out of every 20 times that they pressed the lever.

Zimmerman (1959) included two important control groups. Control 1 received the same treatment as the experimental group during both Phase I and Phase II, with one exception. The buzzer never sounded during Phase II. Control 2 received the same treatment as the experimental group during Phase II, including the buzzer, but they received no treatment at all during Phase I. That is, they never found any food in the goal box at the end of the alley.

These two control groups served an essential function. Rats and many other animals tend to explore new openings. A very effective type of mouse trap consists of an opening into a mouse-sized tunnel. Mice that enter the tunnel are trapped there. It is a very effective trap with the attraction of the open tunnel as the only bait. The opportunity to run through the door and explore the runway could have been rewarding all by itself. Because the control groups also ran out into the alley, they controlled for the effect of opening a door and letting rats run into an alley. The control groups did press the lever but much less than the experimental group. Even if a run in the alley was a primary reward, the experimental group still pressed the lever much more than control rats that only received that primary reward.

Zimmerman interpreted his 1957 experiment as a demonstration of the power of partial secondary reward. Zimmerman (1959) reported another result that alters this interpretation, however. The response of running in the alley extinguished during acquisition of the response of pressing the lever. At first, the animals ran through the alley all the way to the empty goal box. Gradually, they began to slow down as they approached the goal box to the point where they stopped before they entered it. They stopped sooner and sooner until they ran out only a short distance into the alley and waited for Zimmerman to pick them up and return them to the start box. Thus, at the same time that the goal box and the later segments of the alley were losing their secondary reward value, the secondary reward value of the buzzer was effective enough to increase the rate of lever pressing. In fact, lever-pressing only declined after the rats stopped running out into the alley at the sound of the buzzer.

Now, if the power of secondary reward depends on S-S* contiguity, then the goal box and the last segments of the alley should have acquired more secondary reward than the buzzer. Food, when present, was contiguous with the goal box in space and time. The buzzer sounded before the door to the alley opened, long before the food in time, and far away from the food in space. Yet, at the same time that the goal box was losing its power to reinforce, the buzzer was gaining power, at least according to the theory that attributes reinforcing power to these stimuli. At this point it may be helpful for readers to review the section on Latent Extinction in chapter 6 that analyzes S-R units in a T-maze.

Interstimulus Interval

Schoenfeld, Antonitis, and Bersh (1950) attempted to demonstrate secondary reward in the following way. They delivered food one pellet at a time in a Skinner box with the lever removed. To be sure that a 1-sec light stimulus would be maximally associated with the food, they watched each rat take each pellet and only lighted the light at the moment when a rat put the food in its mouth. Next, they replaced the lever and lighted the light every time that a rat pressed the lever, but never delivered any more food to the rats in the box. A control group of rats had the same conditions except that they never experienced any pairing of light and food. Schoenfeld et al. found no difference between the experimental and control groups.

After this failure to demonstrate secondary reward, Schoenfeld et al. remembered Pavlov’s much replicated discovery that simultaneous presentation of CS and UCS yields very weak classical conditioning or none at all. If, secondary rewards acquire their reinforcing power by classical conditioning, then simultaneous presentation of light and food was a mistake. All that trouble to light the light at just the moment that the rats started eating only made sure that the light appeared at an unfavorable time for classical conditioning. This set the stage for the next experiment.

Bersh (1951) systematically varied the ISI between the lighting of the light and the delivery of the food pellets in a Skinner box. In Phase I, Bersh paired light with food with no lever present. Rather than waiting and watching to see what the rats did, Bersh set the mechanism to deliver light and food, automatically. The light always stayed lighted for 2 seconds after the apparatus delivered each pellet into the food dish, regardless of what the rat did. Bersh divided the rats into six groups that differed in the interval between the onset of the light and the delivery of the food. The intervals were 0.0 seconds, 0.5 seconds, 1.0 seconds, 2.0 seconds, 4.0 seconds, and 10.0 seconds. In Phase II, the lever was present but lever-pressing only lighted the light; no food was delivered. Figure 7.3 shows the results of Bersh’s experiment. Just as in so many other classical conditioning experiments, the most favorable ISI was a half a second.

Bersh (1951) concluded that the light had acquired secondary reinforcing power, and that the most favorable ISI for the pairing of primary and secondary rewarding stimuli was the same as the most favorable interval in classical conditioning. From the point of view of reinforcement theory, the trouble with this finding is that the light was supposed to acquire its secondary reinforcement power by S-S* contiguity. But the shorter the ISI the closer the contiguity. Hence the findings (a) that simultaneous presentation of light and food is unfavorable, and (b) that one half second is the most favorable interval directly contradict the notion that S-S* contiguity is the mechanism of secondary reinforcement. Chapter 4 discusses how these same findings contradict the notion that S-S* contiguity is the mechanism of Pavlovian conditioning. Once again, the time interval strongly implicates some response to the Sa as the basis of the experimental results—something that the animal does between the onset of the Sa and the onset of the S*.

images

FIG. 7.3. Secondary reward during extinction as a function of ISI during training (after Bersh, 1951). Lever-pressing in the first 10 minutes of extinction depends on the light-food interval in acquisition: light then food without lever during acquisition; lever then light without food during testing. Copyright © 1997 by R. Allen Gardner.

Summary

Secondary reward (Sr) plays a vital role in theories based on response contingent reinforcement. It is supposed to link laboratory studies of hungry and thirsty rats and pigeons rewarded with food and water to the vast amount of learning by comfortable, well-fed human and nonhuman animals in their everyday lives outside of the laboratory. It is also supposed to account for the incentive value of arbitrary symbols such as money. The operational definition of secondary reward avoids the central weakness of all S-R-S* theories: their inability to provide an operational definition of primary reinforcement that is independent of its effects on conditioning.

The best existing research shows that the amount of Sr that can be demonstrated experimentally is rather weak. The phenomena that appear in the laboratory seem entirely too weak to bear the theoretical burden placed on Sr in response contingent reinforcement theories. The effects that appear under laboratory conditions seem very different from the effects specified by the theories.

Demonstrations of an Sr effect have succeeded only when the interval between the Sr and the S* was long enough for the subjects to make some anticipatory response to the S*. In chapter 6, when experimenters such as Coate (1956) and Zimmerman (1959) observed and reported responses to the Sr, they found that the effect involves a distinctive response that the subject makes to the alleged Sr.

S-R CONTIGUITY

In reinforcement theories, an Sr acquires secondary reinforcing power by classical conditioning. Chapter 4 reviews the evidence against an S-;S or an S-S* mechanism of classical conditioning. The favorable ISI in classical conditioning must be long enough to permit the subject to make an anticipatory response to the S*. The length and asymmetry of the ISI supports an S-R contiguity mechanism of classical conditioning. According to S-R contiguity, if a human or nonhuman animal can be induced to make some response shortly after an Sa (usually within one half second), then that response will be conditioned to the Sa. This chapter considers the possibility that instrumental conditioning also depends on the mechanism of S-R contiguity.

S-R Chains in a Skinner Box

Figure 7.4 analyzes the series of stimuli and responses that occur in a Skinner box in detail. This diagram and similar diagrams in chapters 4 and 6 rely on the fact that every response of a human or nonhuman animal has some stimulus effect. When people walk, receptors in all of their joints send back new information about the movement and position of the joints, the resistance and angle of the ground, balance with respect to gravity; their eyes send back information about the passing visual scene, and so on. In the same way, as a rat runs through a maze, each movement changes the stimulus situation. Figure 7.4 describes some details of the series of movements and movement-produced stimuli in a Skinner box, but it is still a simplified description.

images

FIG. 7.4. The chain of stimuli and responses in the Skinner box. Copyright © 1997 by R. Allen Gardner.

Figure 7.4 represents lever-pressing in the Skinner box as a chain of stimuli and responses. The food gets into the gut after it is swallowed. It is swallowed after it reaches the proper mushy consistency in the mouth. It becomes mush after it is chewed. It is chewed after it is put into the mouth. It is put into the mouth after it is in the paws. It gets into the paws after it is grabbed from the dish. It is grabbed after it is found in the dish. It is found in the dish after the rat inspects the dish. The rat inspects the dish after hearing the click. The rat hears the click after pressing the lever. The rat presses the lever after seeing that the lever is near. The rat sees that the lever is near after approaching the lever. The rat approaches the lever after seeing that the lever is far. The rat sees that the lever is far after orienting its head and eyes around the box. The rat orients after the experimenter places it in the box.

Preliminary Training

Most textbooks describe instrumental conditioning in the Skinner box as though the experimenter puts a hungry rat in the box and then waits for the first response, which causes the apparatus to deliver food or water reward. This procedure may work, but virtually all experimenters divide the procedure into steps. First, the experimenter adapts a rat in its home cage to eat the kind of food that the apparatus will deliver during conditioning. Next, the experimenter adapts the rat to eat small pellets of this food in a Skinner box from which the lever has been removed. A delivery mechanism, called a food magazine, delivers the pellets one at a time and makes a sharp click each time it operates (see Fig. 7.5).

The sharp click of the magazine is critical. Early Skinner boxes used inexpensive mechanical technology. Modern electronic technology provides a number of improved ways of delivering food, including a mechanism based on silent laser disk technology. Experimenters report little or no conditioning with silent food magazines unless they add some distinctive sound between the lever-pressing response and the appearance of the food (Bolles, 1988; Plonsky et al., 1984).

Some experimenters watch the rat carefully during magazine training and wait until the animal has moved away from the food dish before they operate the magazine with a remote switch. They fear that, otherwise, the rats might get conditioned to wait at the food dish for the pellets to appear. Soon, the animals learn to run to the dish when they hear the click of the magazine. That is the procedure that Skinner recommended, and the procedure of Zimmerman (1957; chap. 6). In the automatic procedure, the experimenter sets the apparatus to deliver pellets automatically at irregular intervals and this procedure also conditions rats to rush to the food dish when they hear the sound of the magazine delivering food. The automatic procedure obviously saves a lot of experimenter time. Coate (1956; chap. 6) used such an automatic procedure in his experiment on latent extinction.

images

FIG. 7.5. Diagram of a typical Skinner box or operant conditioning chamber showing the Lever that a rat must press, the Magazine that makes a sharp noise as it delivers food to a chute, and the Food Dish that receives the food inside the chamber. Copyright © 1997 by R. Allen Gardner.

Traditionally, the next step in the procedure is to replace the lever in the box and shape the rat to approach the lever and press it. The experimenter watches the animal carefully and operates the magazine (thus producing a click followed by a pellet) whenever the rat comes closer to making the desired response; in this case pressing the lever. The rat has already been conditioned to inspect the dish at the click of the magazine, grab each pellet when it appears in the dish, pop the pellet in its mouth, and eat. Gradually, the rat becomes conditioned to press the lever. Chapter 8 discusses an automated procedure, called autoshaping, that works as well and and saves a great deal of experimenter time. For the discussion in the present chapter, manual shaping and autoshaping have equivalent effects.

S-R-Sr

According to S-R-S* theory, the click of the magazine serves as an Sr that rewards the rat for pressing the lever. The close association of the click with food pellets during magazine training established the association between click and food. Rats have to learn to eat the pellets in the first place, so the sight of the pellets must itself be an Sr. If the sight of the pellet became an Sr by contiguity with a previously conditioned Sr or S*, then the sound of the magazine can become an Sr by contiguity with the sight of food.

What is the S* that makes the sight (or smell) of the pellets into an Sr? Experimenters have conditioned rats to accept or reject foods on the basis of arbitrary tastes and colors. Where is the S* in that case? And certainly, human beings accept and reject food arbitrarily on the basis of learned tastes and sights. Perhaps it is the feel of the food in the stomach that finally must be the S*. Reinforcement theories have so far failed to produce an operational definition that specifies the original S* in advance for any case of learning. Fortunately, for the purposes of the discussion in this chapter, all we have to assume is that the final S* happens somewhere between the sight of the pellet in the dish and the sensation of food in the gut. The feel of food in the gut is the last possible stimulus that could serve as an S*; Fig. 7.4 labels this stimulus as the S*. Whichever stimulus we choose as the theoretical final S*, however, the result is the same as long we choose a stimulus that comes later than the sight of the pellet in the dish. The stimulus that comes just before the S* becomes the first Sr by S-S* contiguity, then the stimulus just before that becomes the second Sr by S-S* contiguity with the first Sr, and so on to the begining of the chain. Each stimulus becomes an Sr by contiguity with the Sr that follows it and each Sr rewards the response that it follows.

Zimmerman (1959) raises doubts here. In that experiment, the Sr at the end of a runway was the goal box that was closely contiguous with the only food that rats received. Yet, at the same time that the response of running to the goal box was extinguishing, presumably because the empty goal box was losing its power to reward running, a buzzer that signaled the opening of a door to the runway continued to enhance lever-pressing. The stimuli in the goal box were much closer in space and time to food than the buzzer. How could the closer goal box lose its Sr power before the buzzer lost its Sr power?

DISCRIMINATIVE STIMULUS: Sd VERSUS Sr

Feed Forward Versus Feed Backward

Traditionally, secondary reward is a feed backward concept in which the stimulus consequences of behavior act backward to reinforce past S-R connections. According to this description, each S-R unit in Fig. 7.4 is rewarded by the appearance of the stimulus that begins the next unit in the series. The Sd is a feed forward concept in which the stimulus consequences of behavior feed the animal forward from one action to the next. According to this description, each S-R unit starts with an Sd and ends when the next Sd starts the next S-R unit. What matters is not the positive or negative effect of the Sd on the S-R unit that just happened, but rather the response that the Sd evokes next; not how the Sd makes learners feel after the last response, but what it makes them do next.

Have you ever wondered why everything you lose is always in the last place you look for it? The reason, of course, is that you always stop looking as soon as you find it. But, you do more than stop looking. When you find your lost car keys after a long search, for example, you take the keys out to start the car. Finding the keys ends the search because the keys are a stimulus for the next thing you do. This is the feed forward principle. The response that produced the Sd is always the last response made in the stimulus situation before the Sd appeared. This is because the Sd always evokes the next response in the chain.

Conditioned Eating

Like most other baby mammals, rat pups try putting just about everything in their mouths. It is a way to learn the difference between food and nonfood. When they chew food, it turns into mush and they swallow it. Other materials—pebbles, twigs, sawdust, sand, and so forth—are difficult or impossible either to chew or to swallow, and the pups soon spit them out. Consequently, the last thing they do to food in their mouths is chew because chewing turns food to mush, and they swallow it. The last thing they do to other objects, say pebbles or sawdust, is spit them out. Soon they drop nonfood objects before putting them in their mouths and may not pick them up at all. This is how conditioned eating could take place by S-R contiguity alone. The sight and smell of food becomes a discriminative stimulus for grabbing the food and putting it in the mouth.

Magazine Click as Sd

For a rat that enters a Skinner box, the sight and smell of familiar laboratory food is an Sd for grabbing. The feel of food in the paws is an Sd for popping the food in the mouth. The taste and feel of food in the mouth is an Sd for chewing. And finally, mush is an Sd for swallowing. Swallowing is the last response after feeling mush in the mouth, because after the rat swallows, the mush is gone. Chewing is the last response after feeling food in the mouth, because after the rat chews, the food turns to mush and mush is the Sd for swallowing. Putting the food in the mouth is the last response after feeling food in the paws, because after the rat puts the food in its mouth, the food is gone from the paws. Grabbing the food in the dish is the last response after seeing the food in the dish, because after the rat picks up the food it is gone from the dish.

Conditioning the rat to inspect the dish after each click of the magazine is only one simple step further. The experimenter places the rat in the Skinner box with the lever removed and delivers pellets of food by operating the magazine. Whenever the rat inspects the dish about one half second after hearing the magazine click, it finds food in the dish. That is the last response to the click because the next stimulus is the sight of the pellet, which is the Sd for grabbing the food and removing it from the dish.

At first, of course, the rat may take longer than one half second to inspect the dish. In those cases, some extraneous stimulus—say a creaking sound in the woodwork, an odd smell, or a slight change in temperature—may be the last stimulus before the rat finds food in the dish. Next time that extraneous stimulus appears, the rat will inspect the dish because that was the last response to the extraneous stimulus. Most of the time that the rat inspects the dish after an extraneous stimulus, however, there is no food in the dish. This gives the rat time to do something else, say scratch itself, rise up on its hind legs, or clean its whiskers. These extraneous responses become the last thing the rat did after extraneous stimuli. The only stimulus that is always the last stimulus before the response of inspecting the dish is the magazine click because, if the rat inspects the dish at that time, it always finds the Sd for grabbing the pellet and that initiates the rest of the chain.

When the rat is inspecting the food dish promptly after each magazine click, the experimenter inserts the lever in the Skinner box and after that each food delivery is contingent on lever-pressing. If the rat approaches the lever and presses it, then that will be the last response to the lever, because the next stimulus will be the magazine click, which is now the Sd for inspecting the dish. After inspecting the dish, the rat finds a pellet, which is the Sd for grabbing. After grabbing the pellet, the rat finds food in its paws, which is the Sd for putting food in its mouth. After putting the food in the mouth, the rat finds food in its mouth, which is the Sd for chewing. After chewing, it finds mush, which is the Sd for swallowing. After swallowing, the food is gone. If the rat does anything else near the lever—say scratch, rise up, or clean its whiskers—the stimulus situation remains the same. So, the rat goes on to do other extraneous things and none of them remains the last response for long. Only pressing the lever becomes the last response to the lever, because after the rat presses the lever, the magazine clicks and this sets the chain of responses in motion once more.

Viewed in this way, instrumental conditioning in the Skinner box is an inefficient kind of classical conditioning. In classical conditioning, the experimenter would place the rat in a harness in front of the food dish and arrange for a response, such as salivation to appear about one half second after a stimulus such as a magazine click. In the usual classical conditioning procedure, food evokes salivation, but inspecting the dish can serve just as well as the UCR evoked by magazine clicks. In the classical conditioning procedure, the experimenter makes sure that the CS and the UCR evoked by the UCS appear on every trial. In the Skinner box, the experimenter makes sure that the sight of the lever is paired with the sound of the magazine whenever the rat presses the lever. When the rat presses the lever, the pairing of lever and magazine click is the same as in classical conditioning. The rat makes many other extraneous responses, such as sniffing in corners, rearing up on its hind legs, exploring the box, and so on. Extraneous behavior postpones lever-pressing and clicks, wasting the rat’s time compared with classical conditioning in which the experimenter schedules each trial. But the procedure is only inefficient from the point of view of the rat. From the point of view of the experimenter, it is much easier to set the apparatus to pair lever with clicks and let the rat present its own trials at its own pace. This is a powerful reason for the popularity of the Skinner box. It is a very convenient and efficient conditioning apparatus from the point of view of the experimenter.

Token Rewards for Chimpanzees

The stage is now set to consider the most famous of all demonstrations of secondary reward. Cowles (1937) and Wolfe (1936) demonstrated that chimpanzees could learn to use poker chips to operate a vending machine that dispensed grapes, and then learn to pull a lever (much like a Nevada slot machine lever) to earn poker chips, which they could insert into a slot to operate a vending machine. Textbooks and teachers often cite this result as a demonstration that the value of money depends on secondary reinforcement.

In later experiments, Kelleher (1956, 1957a, 1957b, 1958a, 1958b) replicated and extended Cowles’ (1937) and Wolfe’s (1936) findings. Basically, Kelleher taught two young male chimpanzees first to get grapes by operating the vending machine with poker chips, and then to earn poker chips by pressing a telegraph key. Kelleher next varied the schedules of poker chip reward for key-pressing. When the chimpanzees were working for poker chips, he lighted a white earning light; when they could spend their poker chips, he turned off the earning light and lighted a red spending light.

Time has to be divided into earning periods and spending periods in this experiment. If Kelleher had allowed the chimpanzees to press the key to earn poker chips and then let them spend each poker chip in the vending machine as soon as they received it, the demonstration would be much less significant. Without the division into earning and spending periods, Kelleher’s procedure would only be an example of a chain of responses starting with a key-press, followed by a poker chip, followed by picking up the chip and putting it in the slot, followed by receiving a grape. This is practically the same thing as the chains analyzed by the goal gradient principle for rats in mazes discussed in chapter 5. The only difference is that chimpanzees have hands that they can use to manipulate objects so that they can execute chains of movement that are superficially more complex than the chains of running executed by rats in mazes. Other attempts to demonstrate an Sr effect through complex chains of reinforcement schedules in a Skinner box fail for the same reason (e.g., Jacob & Fantino, 1988). An Sd can maintain an earlier link in a chain by feeding forward to the next link without feeding backward to reinforce the link that it follows.

By lighting one light to signal working periods and a second light to signal spending periods, Kelleher produced a much more interesting situation, something much more like a chimpanzee working to earn poker chips in order to spend them later. Human beings also work during designated times, and spend during other, quite separate, times. Even street vendors who get paid in coins, transaction by transaction, normally collect the coins during designated earning times and spend the money later.

The rate at which Kelleher’s chimpanzees worked at pressing the key depended on the schedule of poker chip reward. Like many human factory workers, however, they got to a point where each chimpanzee worked at a stable rate for a given schedule of payment so that it took about the same amount of time, about 4 hours, each day for them to earn the allotted number of poker chips, about 50 chips, depending on the condition. Consequently, they could have been working for a stable period of time or for a stable number of chips. The spending light came on after about the same amount of working time either way. Note that this is because of the stable rate of working maintained by a stable schedule of reward (see chap. 3).

Kelleher’s chimpanzees lived in rather boring cages when they were not serving in experiments. The experimental enclosure was larger and more interesting. At the beginning of the experiment, the chimpanzees naturally spent a certain amount of time running, jumping, climbing, playing with the apparatus, and otherwise enjoying the place before they settled down. This period of playfulness before settling down to work persisted through hundreds of hours of experimental sessions. At the beginning of each session, the chimpanzees took between 20 and 40 minutes before they pressed the key for the first time. The next key-press came a little sooner, the next sooner, and so on, faster and faster until they reached their top speed usually with a spurt at the end of the earning period.

Kelleher (1958b) reasoned that this pattern of results could be interpreted in either of two ways. First, if the poker chips acted as secondary rewards, then the pause at the beginning of each session might be the result of lack of reward. The first chip rewarded the first few responses, which reinforced key-pressing so that responding increased, which resulted in more rewards, which further increased responding, and so on until the chimpanzees reached their top speed.

Perhaps the poker chips acted, instead, as discriminative stimuli. As chips collected in a pile beside the lever, the steadily growing pile was a kind of clock telling the chimpanzees how close they were coming to the end of the earning period and the beginning of the spending period with its delicious grapes. At the beginning of the session with no pile at all, a chimpanzee could see that spending time was a long way off. Even after accumulating a few chips he could see that grape time was still far away. He might respond more than he had at the start, but still sluggishly. As the pile grew, he could tell that spending time was getting closer and this stimulated him to press faster and faster until the end spurt when the pile was highest. In the secondary reward description, the poker chips act backward to reinforce what the chimpanzee had done before. In the discriminative stimulus description, the poker chips act forward to stimulate the next thing the chimpanzee does.

With this in mind, Kelleher tried the following ingenious test. He put 50 poker chips in the experimental enclosure before each chimpanzee arrived for his daily session. If the poker chips were acting as response contingent secondary rewards (Sr), then the chimpanzees would be finding a large heap of free Srs as they entered the enclosure. The free chips would then reward the beginning laziness and playfulness, and reward this behavior at a better rate than key-pressing ever had. If the chips were Srs, the chimpanzees should take much longer to settle down to work, or they might never settle down to work at all. If the pile of chips was acting as a discriminative stimulus, however, we would expect just the opposite. The chimpanzees would arrive to find the clock set ahead telling them that they were near to spending time. If the chips were Sds the chimpanzees should begin at their high middle-of-the-session rate, immediately, and improve from then on as more chips piled up.

Kelleher’s results were decisive. When the chimpanzees found the pile of free chips, they omitted their usual 20-to 40-minute period of no responding. They went directly to work pressing the key at a high rate.

Kelleher’s experiment tests whether the pile of poker chips acts backward as an Sr to reward the chimpanzees for pressing the key, or acts forward as an Sd to stimulate them to press the key more rapidly. When the chimpanzees got a pile of poker chips for doing nothing, they immediately started to press the key rapidly as if the pile of chips stimulated them forward to intense activity rather than rewarding them backward for doing nothing.

Money Feeds Forward

The poker chip experiments originally aimed to support the traditional view that money is an Sr that controls human beings by feeding backward to reward past behavior. Suppose, however, that the poker chip experiments had been designed to investigate the role of money, to discover something new rather than to shore up traditional beliefs. What would these experiments teach us then? Do Kelleher’s experiments with chimpanzees agree with the role of money in human life?

When a customer purchases a pair of shoes at a store, the customer pays before taking the shoes out of the store. Rather than rewarding the store for giving up the shoes, the customer pays first and takes possession later. The money feeds forward to induce the cashier to give up the shoes.

The cashier puts the money in a safe place, or stores the credit card receipt. Most cashiers never spend the money they receive in the name of the store. The owner, who may also be the cashier, seldom spends the money soon after earning it. Usually the owner deposits the money in a bank and only much later spends the money credited to the store from checks and credit cards. When the owner spends the money it induces someone else to give up a purchased item or to perform a service.

When the owner pays the cashier, people say that the owner is rewarding the cashier for past service. But why should the owner pay the cashier for work done last week or last month? When owners fail to pay in real life, there is usually little or nothing that a worker can do about it except to refuse to return to work. Owners pay up promptly, because prompt payment induces workers to return to work. Salaries also feed forward.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset