Chapter 8 Web Topics
8.1 Computing Red Lines
The payoff matrix
Consider a thirsty spider monkey that has to choose between two possible actions it will perform next: descend from its tree to drink, or stay in the canopy for now. The consequences of each action depend on which of two alternative conditions is currently the case: either there is a predator, such as a jaguar or ocelot, lurking in the bushes at the foot of the tree, or there is not. The monkey therefore must consider four possible outcomes (or payoffs) of her decision. We can organize these alternative outcomes into a payoff matrix as follows:
Suppose we can quantify the four possible consequences using some common currency. Ignoring for the moment what currency might work for all consequence types, let us assume that the larger the payoff in this currency, the better off the monkey will be. Now suppose that we make the relevant measurements in the field and obtain the following table for this monkey:
Clearly, the best consequence for the monkey is to descend and drink when no predator is present. However, if it stays in the canopy, even if no predator is present, it still does fairly well. If there is a predator present and it descends, it has a very high chance of being eaten and thus its payoff for this situation is the lowest. If it stays in the tree and there is a predator present, it avoids being eaten for now, but it also is vulnerable to being tracked by the predator when it moves through the canopy with a risk that it will be eaten at some later point. In this example, the right choice for the monkey when a predator is present is to stay in the tree, and the right choice when no predator is present is to descend and drink. The problem facing the monkey is that it does not know which condition is currently the case: is there a predator present or not?
Conditional payoffs
While the monkey is not sure whether or not predators are lurking nearby, she can usually come up with some general estimate of the probability that a predator is nearby based on prior experiences. Let us suppose that in the past, predators have turned out to be present on 20% of such occasions, or writing the probability as a fraction, 0.20, and thus the probability that no predator is nearby is 1.00 – 0.20 = 0.80. If the monkey stays in the canopy and a predator is present, her payoff is 15; if she stays in the canopy and no predator is present, her payoff is 25. The average payoff (called “expected value” in economics) of staying in the canopy is the sum of each of these payoffs discounted (multiplied) by the probability of the relevant situation. In this case, her estimate of the average payoff of staying in the canopy is:
Such a calculation is called a conditional payoff because, in fact, no monkey will get a payoff of 23 if it stays in the canopy: instead, the payoff will be conditional upon whether a predator is present (when it will get 15) or not (when it will get 25). However, the average payoff of 23 is the best overall guess for what staying in the canopy will give the monkey since it cannot know for sure before it decides which condition will turn out to be true.
We can now calculate the average payoff of descending to drink when there is a 20% chance that a predator is nearby:
PO_{descend} = (0.20)(3) + (0.80)(35) = 28.6
We can see that when we compare the conditional payoffs for staying in the canopy versus descending to drink when the probability of a nearby predator is 20%, the monkey will do better (on average) if it descends and drinks (28.6 versus 23.0). This makes sense in that if there is a low chance of a predator lurking in the bushes, it is probably better to take a chance and descend.
But suppose the monkey thinks that there is a 70% chance that a predator is lurking nearby. What is the best average strategy then? We again compute the conditional payoff of staying in the canopy to get:
PO_{stay} = (0.70)(15) + (0.30)(25) = 18.0,
and the corresponding conditional payoff of descending to drink to get:
PO_{descend} = (0.70)(3) + (0.30)(35) = 12.6.
We see that staying in the canopy is now the better action. Again, this makes sense: if there is a good chance of a predator nearby, it is better stay in the canopy and live to drink another day.
These calculations suggest that there is some intermediate probability of a predator being present at which the optimal action switches between staying in the canopy versus descending to drink. That switch-point value should be somewhere between 20% (where descending was optimal) and 70% (where staying in the canopy was optimal). At the switch-point probability, it should not matter whether the monkey stays in the canopy or descends: it can expect to get the same conditional payoff. For probabilities less than the switch-point probability, the monkey should descend and drink, and at probabilities above the switch-point, it should stay in the canopy. The switch-point is the red line that we are seeking. How can we compute it?
Computing the red line
Let us denote the monkey’s current probability estimate that a predator is nearby by P, and the switch-point probability at which staying in the canopy and descending to drink give the same payoff by P_{s}. Substituting P_{s} into our conditional probability computations as before, and setting the two conditional payoffs equal to each other, we get the following equation:
PO_{stay} = (P_{s})(15) + (1.0 – P_{s})(25) = PO_{descend} = (P_{s})(3) + (1.0 – P_{s})(35).
Solving for P_{s}, we get P_{s} = 0.45. Whenever the monkey estimates that P < 0.45, it should descend and drink; when it estimates that P > 0.45, it should stay in the canopy. When P = 0.45, it can do either and expect the same average payoff.
There is a quick and dirty way to estimate P_{s}. If we define P as the estimated probability that the left-hand situation in a 2 × 2 payoff matrix is true, and (1 – P) as the estimated probability that the other situation is true, then P_{s} can be computed as the difference in payoffs between getting it right versus wrong in the right column divided by the sum of the differences between getting it right versus wrong in both columns. In this case, we would find (again):
Payoff differences versus absolute values
The computational shortcut for P_{s} demonstrates an important point: where the red line is drawn on a decision maker’s meter depends on the relative differences in payoffs between right and wrong decisions in the two columns, and not on the absolute values of each payoff. Doubling the values of each cell in the matrix will not change the location of the red line. Another way to look at the red line is to consider the difference in payoffs in any column as the cost of errors when the condition for that column is the current one. As errors in the right-hand column become more costly relative to those in the left column, the red line probability moves to higher values; as the costs in the left column increase relative to those in the right column, the red line moves to lower values. This simplifies the task for a decision maker: it only needs to estimate the relative differences in payoffs between right and wrong decisions to know where to set its red line. Then all it needs to do is compare its current estimate of the probability that a predator is present to the red line.
8.2 Types of Probabilities
Sampling and base probabilities
Both male and female northern cardinals (Cardinalis cardinalis) can sing, although males do most of the loud advertisement singing. Suppose we are interested in estimates of the probabilities that a male, his mate, or both will sing in response to a song recorded from another male and played back just inside the territory of the pair. We find 10 different cardinal pairs, and we perform the experiment once with each. We get the following table summarizing our results by putting an “X” in a cell if the individual identified in that row sang in the trial identified in that column:
Female sings |
Based on this (small) sample, we can estimate the probabilities that each sex will sing in reply to a playback by counting the number of times a member of that sex sang and dividing it by the total number of trials in which that sex could have sung. Letting P(M) be the estimated probability that a male will sing and P(F) be the estimated probability a female will sing, we get:
Combinatorial probabilities
The same table allows us to compute some probabilities involving both members of a pair. For example, the probability that either the male or the female sings on an average trial, P(M or F) is computed by counting the number of times at least one of them sang (in our case, 8 times) and dividing by the total trials (here, 10 trials):
Alternatively, we might want to know the probability that both members of a pair will sing when stimulated with playback, P(M and F). We thus count the number of trials in which both birds sang (in the table, we see this is only 2 times) and divide by the number of trials (again, 10 total trials):
Conditional probabilities
The prior probabilities were based on the entire suite of trials. However, sometimes we are interested in subsets of the sample. For example, we might want to know the probability that the female will sing given that the male also sang, P(F|M). Here the notation uses a “|” to separate the events of interest (females singing) and the relevant subset of the sample (males singing). We compute P(F|M) by counting how many times males sang (here 6 trials), and counting in how many of those 6 trials the female also sang (2). We then compute
We can also compute the conditional probability that a male would sing given that a female sang. By examining the table, we see that this is:
We observe that P(F|M) and P(M|F) are not necessarily equal. In fact, they are often different values.
Relations between probability types
It is easy to show that these various probabilities have simple relationships to each other. One important relationship is the following:
This says that the fraction of occasions on which at least one of the members of the pair sang is the fraction of trials in which males sang plus the fraction of trials in which females sang minus the fraction of trials in which both sang. The latter term corrects for the fact that the sum of the number of times that males sang and that females sang will count the number of occasions on which both occurred twice. Since we should count these events only once, we need to subtract out the number of times both occurred from either the male sum or the female sum. We saw earlier, in P(M or F) = 0.8. We can get that same number using the equation above as:
Note that if, for some reason, males and females never sang in the same trial—that is, if male and female songs were exclusive events—then P(M and F) = 0 and P(M or F) = P(M) + P(F). This last form of the equation is very commonly used in computing payoffs because many events are in fact exclusive. This equation is sometimes called the “OR Rule”: the probability that one or the other of several exclusive events will occur is simply the sum of the probabilities of the individual events.
In those cases where events can occur jointly, we may invoke a second equation:
P(M and F) = P(M) P(F|M) = P(F) × P(M|F)
This also makes intuitive sense: the fraction of times that both the male and the female sing, P(M and F), cannot be greater than the total fraction of times that the male sings, P(M). P(F|M) is the fraction of those times that a male did sing in which the female also sang. The product of the fraction of trials in which a male sang and the fraction of those trials in which the female also sang is clearly the fraction of trials that both males and females sang. The same logic can be applied if we start with the fraction of trials in which females sang, P(F), and multiply this by the conditional probability that a male will sing, given that the female did, P(M|F).
Earlier, we used the table to show that P(M and F) = 0.2. We can get the same answer using the second equation as follows:
As with the “OR Rule,” the second equation has a simpler form in special cases. What if we find that P(M|F) = P(M)? This means that the probability that a male will sing is the same for the full 10 trials as it is for the 6 trials in which females also sang. Put another way, what if the presence of female song has no effect on the probability of males singing? It is easy to show that if P(M|F) = P(M) then it also has to be true that P(F|M) = P(F). When these conditions are met, we say that male and female songs are stochastically independent: the occurrence of one does not alter the likelihood of the other. This allows us to reduce the second equation to the following “AND Rule”: if two events are stochastically independent and can co-occur, then
To use these basic definitions and relations, just substitute the events in question for the responses, M and F, in these examples.
Utility of the rules
These simple probability rules are used repeatedly in this book. The “backbone” for computing conditional payoffs is the relevant probability rule with each component probability weighting (discounting) the relevant payoff value. Similarly, updating using Bayes Rule (see Web Topic 8.3) also relies on these simple rules. It pays to become familiar with them!
8.3 Bayesian Updating
The task
A female songbird is searching for a healthy mate. She has learned that, on average, about 60% of males are healthy and 40% are sick with parasites. Her prior probabilities are thus P(Healthy) = 0.60 and P(Sick) = 0.40. The female has also learned that the speed of courtship songs is often an index of the health of the singer: healthy males sing fast songs about 80% of the time, and sing slow songs the remaining 20%. In contrast, sick males sing fast songs only 30% of the time, and sing slow songs 70% of the time. She assembles these conditional probabilities into a coding matrix that appears as follows:
She then encounters a male who sings a Fast song. What is her best estimate of the a posteriori probability, P(Healthy|Fast Song), that this male is healthy?
The method
Bayes’ Theorem provides a way to compute an updated (a posteriori) probability given (a) the prior probabilities, (b) detection of a signal or cue, say S1, and, (c) knowledge of the conditional probabilities (coding matrix) relating signals to alternative conditions. Given two possible alternative conditions, A and B, prior probabilities P(A) and P(B) respectively, and a coding matrix of the form
then detection of a signal S1 allows the computation of the a posteriori probability
Without access to additional information, the probability on the left side of this equation is the best possible estimate that can be computed. It is the Bayesian estimate.
If the signal detected had been an S2, then the Bayesian estimate would be
When there are more than two alternative conditions, the method is the same except that there will be one term in the denominator for each possible condition.
Updating using probabilities
Applying this method to the situation facing the female bird who just heard a candidate male sing a fast song, we can substitute her prior and conditional probabilities into the first Bayesian equation to get
If, instead, the female heard a slow song, her Bayesian estimate would be
Updating using frequencies
Gigerenzer and Hoffrage (1995) have pointed out that people are more likely to use frequencies (whole numbers) than probabilities (fractions) in their updating. To see how they would do this, note that the numerator on the right side of the equation (above) for P(Healthy|FastSong) is the fraction of the population that is both healthy and singing fast songs. The denominator includes this same number and adds to it the fraction of the population that is sick but also singing fast songs. Thus the denominator of the Bayesian equation is simply the total fraction of the population of males that are singing fast songs at any given moment. The Bayesian a posteriori probability is then the fraction of males who are singing fast songs that are also healthy.
Now, simply replace the fractions in the original Bayesian equation with the actual numbers of males in each case. This is equivalent to multiplying each term in the Bayesian equation by total number of males available. For example, suppose there were 100 males that the female was likely to encounter. Of these, 60 would be healthy and 40 would be sick. Of the 60 healthy males, 80% or 48 would be singing fast songs at any given time. Similarly, of the 40 sick males, 30% or 12 would be singing fast songs at a given moment. The total number of males singing fast songs will thus be 48 + 12 = 60. The Bayesian estimated a posteriori probability that a male who sings a fast song is healthy is then just 48/60 = 0.80, the same value we got using probabilities.
Cumulative updating
How does a female update her estimate that a given male is healthy after hearing him sing several successive songs? The Bayesian process is the same except that after the second song, she should replace the prior probabilities that she used in the first computation with the new a posteriori probabilities obtained after updating from the first song. For example, we saw that if the male’s first song was fast, her estimated probability that he is healthy would increase from her initial prior value of 60% to 80%. Suppose his second song is also fast. She should compute her second update as follows:
The conditional probabilities remain the same, but the prior probabilities change in both the numerator and the denominator. Suppose the third song that he sings is slow. She would then update again to give a new estimated probability that he is healthy of
This successive process can continue as long as the same male sings songs and the female has the patience to listen and do the relevant updating. Figure 1 shows the average trajectory of the female’s estimated probability that the male is healthy for each kind of male:
Figure 1: Cumulative Bayesian estimates of male health given songs heard. Results are based on computer simulations of a female using the initial prior probabilities and coding matrix outlined in the text above. Any one simulation would show a jagged approach to probabilities of 1.0 (if the male were healthy) or 0 (if he were sick). Dots show mean values of 1000 simulations and error bars show range of variation around those means. In both cases, cumulative sampling eventually asymptotes to the axis representing the true state of the singing male.
Sender versus receiver errors
The examples above assume that receivers always identify correctly whether a male’s song was fast or slow: the only “errors” in the system arise because healthy males sometimes sing slow songs and sick males sometimes sing fast songs. More realistically, the signal system will exhibit errors by both parties. Do receiver errors have a similar impact on the trajectory of successive sampling as sender errors? The answer is yes: similar increases in errors by either party have similar overall effects on successive sampling. To see this, the coding matrix for senders can be combined with that for correct identification of which signal was sent by females into an overall coding matrix (see Web Topic 8.8 for matrix details). We plot the average trajectories for 1000 computer simulations but with differing errors for senders and receivers in Figure 2.
Figure 2: Effects of sender versus receiver errors on updated probabilities. Each point is the mean of 1000 random simulations of a receiver using cumulative Bayesian updating to estimate the probability that a singing healthy male is healthy. Filled circles indicate sender and receiver coding matrices that are both 90% consistent (e.g., healthy males sing fast songs 90% of time and only err 10% of the time; receivers correctly identify a song as fast 90% of the time and only misclassify a song 10% of the time). Open circles indicate males that sing the appropriate song for their health only 70% of the time but receivers that correctly classify songs by speed 90% of the time. Open triangles indicate males that sing the appropriate song for their health 90% of the time but females that classify song speeds correctly only 70% of the time. Filled squares indicate males that sing appropriate songs only 70% of the time and females that correctly classify songs by speed only 70% of the time. As both parties make more errors, trajectories rise to asymptote much more slowly. An increase in error by one party while the other remains unchanged has the same effect regardless of which party experiences the change in error rates.
References Cited
Gigerenzer, G. and U. Hoffrage. 1995. How to improve Bayesian reasoning without instruction: frequency formats. Psychological Review 102: 684–704.
8.4 Signal Detection Theory
Discrete versus overlapping signals
When a receiver detects a signal stimulus, the first task is to assign it to one of several possible categories. Once this assignment has been made, a receiver can use the prior probabilities and the relevant signal coding matrix to update the probabilities of alternative conditions being true, compare expected values of alternative actions, and make a decision on how to respond to this signal.
If the patterns in the signal are completely non-overlapping with those of alternative signals, we say that the signals are discrete. We assumed that signals were in fact discrete in outlining the red line decision process in Web Topic 8.1. But what if signals are not discrete, either because senders emit signals with at least partially overlapping patterns, or because initially discrete signals become distorted and more overlapping during propagation between the sender and the receiver? Can a receiver faced with overlapping signals still define an optimal red line to use in decision making?
An example
The answer is yes, and the method that describes this process is called signal detection theory. To see how a red line can be defined with overlapping signals, consider a female bird trying to assess the health of a potential mate by listening to his courtship song. Suppose that in this species, sick and parasitized males tend to sing slower songs, and healthy males tend to sing faster songs. However, the signals are not discrete and there is considerable overlap in song speeds between males in different states of health:
This plot shows the conditional probability that a male will sing a song at a given speed w depending on his health. There are two distributions shown: one for sick males, P(w|sick), and one for healthy males, P(w|well). We assume here that the two distributions are normal (bell-shaped), but that is for computational convenience and the general conclusions below do not depend on that assumption.
Red lines and types of errors
Suppose the female draws a red line on this plot: whenever she hears a song at a speed lower than the red line value, she will reject the male; whenever she hears a song at a speed higher than the red line value, she will accept that male as a mate. Where is the optimal place to draw this line?
Looking at the same graph with a red line on it, we can see that the red line divides each of the two song speed distributions into two parts. The area under the sick male distribution bounded on the right by the red line and on the bottom by the X axis is the total probability that the female will correctly reject a male when he is in fact sick. We shall denote this probability by P(Correct Rejection). Similarly, the area under the well male distribution that is bounded on the left by the red line and on the bottom by the X axis is the total probability that a female will correctly accept a male when he is healthy. We shall denote this by P(Hit).
With the line in this location, the female cannot avoid making two kinds of errors. The area in the dark blue region to the right of the red line defines the overall probability that the female will erroneously accept a sick male as a mate. This type of error is denoted as P(False Alarm). The area of the dark red region to the left of the red line defines the overall probability that the female will erroneously reject a well male and is denoted by P(Miss).
It should be obvious by looking at this graph that moving the red line to the right will reduce P(false alarm) but it will increase P(Miss). Similarly, moving the red line to the left will reduce P(Miss) but increase P(False Alarm). Since the female cannot reduce the total probability of making some errors, the only way to find an optimal location for the red line is by minimizing the costs of the errors. For example, if false alarms are more costly than misses, then the optimal location for the red line will be at faster song speeds; if misses are more costly than false alarms, then she should set the red line at a lower song speed. To find the optimal location, we therefore need to consider the relative payoffs of each outcome, and her estimated probabilities that a given male is sick or healthy after hearing him sing.
Fitting the red line to payoffs and probabilities
Suppose that the payoffs to a female of accepting or rejecting well versus sick males can be summarized in the following payoff matrix:
Suppose that on average, a fraction P of the males in the population are well and (1–P) are sick. When the female hears a male sing at song speed w, she will update her estimate that he is healthy from P to P(well|w) and that he is sick from (1–P) to P(sick|w). She can now combine these updated probabilities with the relevant payoffs to compute expected values (average payoffs) for each action. The expected value for accepting this male as a mate will be
the expected value for rejecting this male will be
The optimal redline will occur at that w for which the expected value of accepting a male is equal to that for rejecting him; at higher song speeds, the female should accept males, and at slower song speeds, she should reject males. Setting the two expected values equal to each other and rearranging, we get that the critical song speed, wc, is that for which
We can simplify this further by assuming that the female used Bayesian methods (Web Topic 8.3) to update the probabilities that the male was well after hearing him sing. Specifically, she could have updated using the following formula:
Plugging the right hand side of the Bayesian equation into the left side of the previous equation and rearranging, we get that the critical song speed, wc, is the one for which
The left side of this equation is simply the ratio of the Y axis values for the well versus sick distributions at wc. It is called the likelihood ratio and is usually denoted by b. If we increase wc, the likelihood ratio becomes larger since we move more into the well distribution and out of the sick distribution:
The right side of this equation includes the ratio of the prior probabilities (odds ratio) and the ratio of the differences in payoffs between right and wrong choices in the two conditions (payoff ratio). The entire right side is called the operating level in signal detection theory. All of these numbers are fixed before the male begins to sing or the female begins to make a decision. One can also think of the operating level as the ratio of the costs of the two types of errors noted earlier, each discounted by the prior probability that it will occur. As sick males become more common ((1–P) increases), or the cost of false alarms increases (R12 versus R22), the numerator of the operating ratio increases relative to the denominator, and the appropriate value of wc on the left side of the equation has to increase. If healthy males become more common (P increases), and/or the relative cost of misses increases (R21 versus R11), the right side of the equation decreases, and the optimal location for the red line, wc, moves to lower song speeds.
Discrete versus overlapping signals
The strategies for drawing red lines on meters when signals are discrete (Web Topic 8.1), and on pattern axes when signals overlap (this Web Topic), use the same ingredients. Both approaches depend on the differences in the payoffs of right versus wrong decisions, and not on the absolute values of individual payoffs. Both approaches require access to the signal coding scheme and the prior probabilities. Both invoke Bayesian updating upon receipt of a signal to define the optimal probability estimates before computing expected values of alternative actions. And both permit shortcuts to decision making if prior probabilities and payoff differences remain sufficiently stable for reasonable periods. In the case of overlapping signals, a receiver need only compare perceived signal properties to threshold values to make a quick decision. They do not even have to compute a Bayesian update since the ingredients for that update are incorporated into the determination of the optimal red line.
Further reading:
Macmillan, N. A. and C.D. Creelman. 2004. Detection Theory: A User’s Guide. 2^{nd} Edition. New York: Cambridge University Press.
Wiley, R.H. 1994. Errors, exaggeration, and deception in animal communication. In, Behavioral Mechanisms in Evolutionary Biology, (L.A. Real, ed.). Chicago: University of Chicago Press. pp. 157–189.
8.5 Prospect Theory
Background
While both animals and people often make decisions that are consistent with comparisons between expected values of alternative actions, there are also many exceptions that do not fit the classical model of decision making (Real 1996; Kahneman and Tversky 2000). Several common deviations from classical predictions recur commonly enough in human economics and animal decision making that they have received special attention. They have all been called “paradoxes” because they represent behaviors contrary to what a rational person (or animal) should do. Two of the most widely cited examples are:
The St. Petersburg Paradox
Consider a game in which a coin is tossed. If it comes up heads, you win $2 and the game is over. If it comes up tails, you get a second toss. If this second toss comes up heads, you get $4 and the game ends, and if it is tails, you get another toss, and so on. At each stage, the payoff doubles the previous value. What is the expected value of this game? The probability of getting heads on the first toss is 0.5 and the payoff is $2. The discounted payoff if the game ends at this first stage is thus (0.5)($2) = $1. The probability of getting tails on the first toss and heads on the second toss is (0.5)(0.5) = 0.25, and the payoff at this stage is $4. The discounted payoff for a game that ends at the second stage is thus (0.25)($4) = $1. At each possible end point in this game, the discounted payoff will be again $1 since the probability of getting to the next stage is halved at each step, whereas the payoff is doubled. Since the game could go on forever, the expected value for the game is $1 + $1 + $1 +……= infinity. The classical prediction is that someone given a chance to play this game for a fee should agree to pay any amount since the average outcome is an infinitely large number. However, in practice, people are only willing to pay small amounts to play this game, and are thus acting in a risk averse manner. This is known in the economics literature as the St. Petersburg Paradox.
The Allais Paradox
The St. Petersburg Paradox suggests that people generally avoid risky situations. However, it turns out that they are not consistent in this regard. For example, suppose that you are asked to make two decisions. For the first decision, you are invited to pick one of two lotteries to enter. A ticket to Lottery A has a 100% chance of paying you $30, and a ticket to Lottery B has an 80% chance of paying you $40 and nothing otherwise. If you are like most people, you will pick Lottery A over Lottery B. In this decision, you will favor the less risky option (100% vs. 80%).
In the second decision, you are invited to enter Lottery C, in which a ticket has a 25% chance of winning $30 and otherwise pays nothing, or Lottery D, in which a ticket has a 20% chance of paying $40. If you are like most people, you will this time select Lottery D. You will thus choose the more risky option (20% vs. 25%). Since the first choice in each lottery is 1.25 times more likely to pay out winnings than the second (1.00/0.8 = 0.25/0.20 = 1.25), the relative odds of winning are the same in the two decisions. The payoffs are also the same. Despite these similarities, people are routinely risk averse in the first decision and risk prone in the second.
The Four-Way Table
Inspired by the Allais Paradox, researchers again presented subjects with a choice between a non-risky lottery (Lottery A) in which there was a 100% probability of winning $X, and a risky lottery (Lottery B) in which there was a probability P less than 100% of winning $Y and otherwise winning nothing. They then varied both P and $Y systematically to produce different combinations of probability and payoff. For each choice situation, the researchers asked the subjects how large $X would have to be before the subjects felt equally inclined to choose Lottery A or Lottery B. The researchers were thus asking the subjects to indicate their subjective evaluations of risky gambles relative to sure bets. If the values of $X equaled the expected values of Lottery B, then the subjects would be considered risk insensitive: the fact that the second lottery was risky had no effect on their decision and they would be acting rationally. If the value of $X was less than the expected value of Lottery B, then they would be risk averse, since they would prefer a guaranteed lower amount than a larger average amount in a gamble. If the value of $X were larger than the expected value of Lottery B, then they would treat this choice in a risk prone manner since they overvalued the more risky option.
The results of such studies are very interesting and can be summarized in the following table (from Tversky and Kahneman 1992):
These examples suggest that people are risk prone when faced with gains of low probability or losses of high probability; they are risk averse when faced with gains of high probability, and losses of low probability. Clearly, these inconsistencies depend both on the value of the payoffs and on the absolute values of their probabilities.
The theory
Prospect Theory was proposed by psychologist Daniel Kahneman and cognitive scientist Amos Tversky in 1979 (Kahneman and Tversky 1979) and later extended as cumulative prospect theory (Tversky and Kahneman 1992) to reconcile the actual decision making of people with the classical economic theories based on comparisons of expected values. This theory basically replaces both the probabilities and the payoffs in classical theory with nonlinear transformations of each. Specifically:
Rescaling payoffs
The St. Petersburg Paradox can be understood if people (and animals) do not value a given payoff in an absolute sense, but instead compare it to their current state and needs. A hungry animal might value a given item of food much more highly than a well-fed animal. Bernoulli had suggested in 1738 that payoffs should be transformed into a new variable, which he called utility, that better reflected the value of a given payoff to a given decision maker. The transformation of payoffs into utility can follow any of several types of functions. If an animal is already well off, access to further payoffs is likely to have a decelerating (concave) relationship with utility:
Bernoulli pointed out that if one replaces each of the $1 payoffs in the St. Petersburg with utility values based on a decelerating curve like the one above, the expected value of the game is now at most several dollars, and thus fits the amount of money that most people will pay to participate.
Conversely, if a decision maker were desperate, a small payoff might not help it much, but a larger payoff could greatly increase its condition. The appropriate transformation would then be an accelerating (convex) function like the following:
Prospect theory combines these two possible transformations into a single sigmoid curve centered around a reference point which we denote as a payoff of 0:
A decision maker starts with some existing state defined by its position on the X axis of this graph. The change in utility that is generated by access to a new payoff is computed by moving the appropriate direction and distance along the X axis from the starting point and then comparing the new utility value at that new location to the initial one. We can see that if the decision maker starts with a payoff above the reference level, each incremental positive payoff results in a decelerating change in its utility. If it starts out below the reference point, any positive increase in payoffs results in an accelerating change in utility. The process of imposing a reference point at the payoff where accelerated curves shift to decelerated ones is called framing. If the reference point is in fact the status quo, then any additional payoff that moves a decision maker to a higher utility level is considered a gain, and any payoff that leaves the decision maker at a lower utility is considered a loss.
It is easy to show that risky outcomes produce smaller average increases in utility than non-risky ones when a decision maker begins to the right of the reference point (Smallwood 1996). Such individuals should thus be risk averse. Conversely, if one begins to the left of the reference point, risky outcomes produce larger average increases in utility than non-risky ones when the decision maker begins to the left of the reference point. Such individuals should be risk prone. The fact that people in general are more strongly risk prone than risk averse has led to a slight reshaping of the typical framing curve so that it looks as follows:
Note that the part of the curve to the right of the reference point asymptotes much sooner and to a lower absolute value than does the curve to the left of the reference point. This reflects a greater aversion to loss than to gains in most people. In prospect theory, this asymmetrical curve is called the subjective value function to distinguish it from the earlier symmetrical utility function.
Rescaling probabilities
While rescaling payoffs into subjective values explained some of the biases and deviations in human decision making, it was not enough. Tversky and Kahneman thus proposed rescaling probabilities as well (Tversky and Kahneman 1992; Kahneman and Tversky 2000). There are two possible reference points when dealing with probabilities: certainty (P = 1.00) and impossibility (P = 0.0). As we saw in the Allais Paradox and the Fourway Table, people tend to overweight low probabilities (which favors risk aversion for losses and risk prone behavior for gains), and underweight high probabilities (which favors risk aversion for gains and risk prone behavior for losses). Research also suggests that any transformation of probabilities should be asymmetrical: an increase in probability from 0.20 to 0.25 is perceived as much less important than a change from 0.95 to 1.00. According to these observed biases, a function that transforms probabilities into a new variable, called weights in prospect theory, is shown below:
Were weights simply proportional to probabilities, the function would lie along the dashed line. The weighting function rises above this line to the left of the graph (reflecting people’s overweighting of low frequencies), and falls well below the line on the right (reflecting underweighting of higher probabilities). The function is also obligingly asymmetrical with a more gentle concave curvature at low probabilities but a very acute convex curvature at higher values.
Combining Values and Weights
To compute an overall expected value for an action using prospect theory, each payoff that could result from that action is transformed into appropriate subjective values and multiplied by the relevant weight (as determined by the prior graph and the probability of that payoff being realized). The discounted subjective values are then summed for all possible outcomes of that action to give a net subjective expectation. Subjective expectations for different alternative actions are then compared to make a decision.
Biology and prospect theory
Prospect theory has been quite successful in predicting human decision making. As a result of his work on this and related theories, Daniel Kahneman received the Nobel Prize for Economics in 2002. Many of the biases that prospect theory seeks to explain in humans also show up in animals (Real 1996). Risk sensitivity is very common in animal decision making, and biases outlined in the Allais Paradox and the Fourway Table have been described in animals (see Chapter 8). The theory thus seems to have generality.
One concern is that prospect theory has largely been derived by fitting arbitrary functions to observed behaviors: it is thus descriptive as opposed to being derived from and clearly linked to other fundamental principles. The subjective value and weighting functions have been tweaked and shifted until they fit the behaviors, and as a result, they provide reasonable predictive power. What they lack is explanatory power. Why is the subjective value function asymmetrical? Why are people and animals more loss aversive than gains sensitive? These are questions that one would like to have anchored in other known processes and principles.
Some of the components of prospect theory can be rooted in biology (Trepel et al. 2005). For example, there are clear physiological reasons why an animal’s conversion of resources into survival and reproduction is likely to involve curved and not linear functions. Curved functions will automatically lead to optimal decision making that is risk sensitive (Smallwood 1996). Risk sensitivity can also be explained, in part, by known nonlinear scaling during sensory detection and perception (see Web Topic 8.6). However, many of the other characteristics of subjective valuation and probability weighting have no clear ties to known biological or physical laws and remain to be explained.
Literature cited
Kahneman, D. and A. Tversky. 1979. Prospect theory: an analysis of decision under risk. Econometrica 47: 263–291.
Kahneman, D. and A. Tversky. 2000. Choices, Values, and Frames. New York: Cambridge University Press.
Real, L.A. 1996. Paradox, performance, and the architecture of decision-making in animals. American Zoologist 36: 518–529.
Smallwood, P.D. 1996. An introduction to risk sensitivity: The use of Jensen's inequality to clarify evolutionary arguments of adaptation and constraint. American Zoologist 36: 392–401.
Trepel, C., C.R. Fox and R.A. Poldrack. 2005. Prospect theory on the brain? Toward a cognitive neuroscience of decision under risk. Cognitive Brain Research 23: 34–50.
Tversky, A. and D. Kahneman. 1992. Advances in prospect theory: cumulative representation of uncertainty. Journal of Risk and Uncertainty 5: 297–323.
8.6 Weber’s Law and Risk
The problem
Animals (and people) often have to choose between an action that has a sure and known consequence and an alternative action that can lead to any of several alternative consequences. The second option is said to be riskier than the first. Suppose the consequences in each case involve access to some resource such as food. If the differences between the consequences is in the amount of food provided, as opposed to the delay in receiving food after acting, animals (and people) tend to be risk averse: even if the expected value for the riskier action is somewhat larger than the sure bet, they will choose the sure bet. If however, the consequences provide equal amounts of food, but they differ in the delay between the decision and receiving the food, animals (and people) are often risk prone: that is, they favor the riskier option (Kacelnik and Bateson 1996). This is a curious difference that demands an explanation.
Errors in payoff estimation
Optimal decision making requires comparisons between expected values of alternative actions. Expected values depend on the probabilities that different consequences of an action will occur, and the values of the payoffs for each consequence. Because animals cannot know the exact value of a given payoff until it is experienced, decision making relies on estimates of payoffs that are subject to error. These estimates are generated by pooling the information provided by ambient cues, signals from other animals, and prior experience. The result is a probability distribution of different possible values for a payoff. In many cases, this distribution will be bell-shaped (e.g., a normal distribution):
One can identify several key characteristics of such a probability distribution. The possible payoff value that is most likely is called the mode. In this example, that occurs at a payoff value of 30. The average or mean payoff value can be computed by discounting each possible payoff value by its probability and adding these all together (because the possible values are infinite, one should actually use an integral and not a sum here). The median is the payoff at which half of the overall probability is accounted for by payoffs less than it, and half by values greater than it. This particular probability distribution is symmetrical: as a result, the mean, median, and mode will occur at the same payoff value (30). Finally, we want a measure of the variation we might see if we randomly sampled this distribution a 100 times. It seems intuitive that the wider the bell-shape of the distribution, the larger this measure of variation. A useful measure of variation when a distribution is normal is the standard deviation. This is defined as the distance one needs to move away from the mean in either direction to account for 34% of the most likely payoff values. In this example, the standard deviation is 5.
Not all probability distributions are so nicely symmetrical. In nature, distributions are often positively skewed such as this example:
With a positive skew in the distribution, the mean and median will occur at a higher payoff value than the mode. While less common in nature, distributions can also be skewed in a negative direction with the mean occurring at a smaller payoff value than the mode.
Weber’s Law
Weber’s Law builds on the recognition that animals (and people) estimate and measure quantities with some error. Suppose that a jay is trying to decide which of two peanuts is the heavier one before carrying it off to cache it for the winter. It lifts and shakes each peanut and then makes a decision on which to store. We can experimentally give the jay peanuts that are closer and closer together in weight. Eventually we will get to a point where the jay can still identify the heavier peanut, but giving it peanuts any more similar causes the jay to choose randomly. We have identified the just noticeable difference (JND) in peanut weight for these jays. Weber (1834) performed similar tests on people and observed that JND’s got larger as the average measurements on the compared items got larger. In fact, the ratio between the JND and the average magnitude of the two measurements tended to be a constant across a wide range of magnitudes. As a human example, suppose you can just barely identify the heavier of 95 g and 105 g weights. The average weight is 100 g and the JND is 10 g. The ratio between the JND and the average is 10%. According to Weber’s Law, you would also need a 10% difference to identify two weights that averaged 1 kg. However now, the JND would be 10% of 1 kg or 100 g. This is a much larger minimum difference than for the 100 g weights. Put simply, Weber discovered that the perceptual error in measurement increases proportionally with the magnitude of the measurement. This finding has since been confirmed in many species and in each of the sensory modalities.
Some 26 years later, Fechner (1860) proposed that the fixed ratio between JND and average measurement value was likely a consequence of animals trying to achieve a large dynamic range in their sensory systems. An animal with a large dynamic range can measure very small magnitudes and very large ones. However, the cost of such a broad dynamic range is greater error in measuring larger magnitudes. Fechner suggested that if sensory organs and brains perceived stimulus magnitudes on a logarithmic scale, large dynamic range and the larger measurement errors could both be explained. Stevens (1957) challenged Fechner’s logarithmic scaling and proposed a power function alternative. While the dispute over what kind of scaling is actually used in animal sensory organs and brains continues to this day (Shettleworth 2001; Copelli et al. 2002; Johnson et al. 2002; Dehaene 2003), the original proposition that most sensory systems obey Weber’s Law remains widely accepted.
Weber’s Law and risk
All of this means that there are actually two levels at which chance can affect expected values during decisions. The first, which we have dealt with in prior discussions, concerns which of several alternative consequences will occur when an animal chooses a risky action. However, now we see that whether the decision maker selects a “sure bet” option or a risky option, the actual payoff experienced always depends on a random draw from some probability distribution. The difference between sure bets and risky options is that in the former, the draw will be made from a single distribution, whereas the draw for a risky option could be taken from any of several alternative distributions. Another way to look at risky options is to pool the probability distributions for each possible consequence into one single probability distribution. If there are two equally likely consequences for a risky action, the pooled distribution is the sum of the probabilities of the two distributions for each possible payoff value divided by two. If alternative consequences are not equally likely, then the pooled probability for any payoff will be more similar to that of the more likely consequence than to the less likely one.
What does Weber’s Law have to do with decision making? Perhaps quite a bit (Gibson et al. 1988; Reboreda and Kacelnik 1991; Bateson and Kacelnik 1995; Kacelnik and Bateson 1996; Kacelnik and Abreu 1998). Consider a decision maker who has to choose between action A, which will always result in a payoff drawn from distribution X, or action B, which will draw a random payoff from distribution Y 50% of the time or distribution Z the other 50% of the time. Suppose the modes for these three alternative distributions have values such that mode(Y) < mode(X) < mode(Z). According to Weber’s Law, the error in measuring payoffs from distribution Z will be greater than that for the two other distributions because the most likely payoffs in distribution Z are larger than for the other two distributions. The error for distribution Y will be less than for the other distributions by the same logic. Since the total probability in a distribution has to add to one, increasing the error in a distribution will decrease the probability at the mode, and decreasing the error will increase the probability at the mode. This is shown graphically below:
We can see that the probability of drawing the mode value of a distribution decreases with the magnitude of the mode and the resulting error in that distribution.
Now, let us just consider the two possible outcomes for action B, the risky action. If the decision maker elects to perform action B, it will draw the payoff from either the Y or the Z distribution:
In our particular example, an animal electing to perform action B is equally likely to draw a payoff from distribution Y as it is to draw the payoff from distribution Z. We can thus combine these two distributions with equal weighting to get the pooled payoff distribution for action B:
We now combine the distribution for action A with the pooled distribution for action B:
There are several ways that a decision maker could use this information (Kacelnik and Abreu 1998). The means for these two distributions are approximately equal (payoff = 50), so the fact that many animals faced with such a decision favor action A when the issue is the amount of food acquired and action B when the difference is in the delay in reward delivery suggests that they do not rely only on the mean values. The mode for action B (payoff = 10) falls at a much lower payoff than that for action A (payoff = 50). This would explain the data because animals generally seek to maximize their food intake, but to minimize their delay in getting that food. The same is true for the medians (median for action A = 50 and for action B = 42).
If decisions are preceded by randomly drawing a single sample from each of the two distributions and comparing the results, predictions are a bit more complicated but still possible. Sometimes, by chance, the decision maker will draw a higher payoff from the distribution for action B than for A; other times, it will draw the reverse. The question is then what fraction of the time will it draw a better value for the risky option than for the less risky one. The methods for computing this fraction are given in Kacelnik and Abreu (1998). In the example that we have been considering, action B would provide a larger payoff than action A in 47% of the draws. This means that action A will provide a larger draw more often than will action B. Again, if the choice involves different amounts of food, we would expect such decision makers to choose action A because they maximize food intake and are more likely to draw a larger payoff from the X distribution than from the pooled Y and Z distributions. If the choice involves different delays in obtaining that food, then they would select action B because they are more likely to draw the smaller delay from the pooled Y and Z distributions.
Supporting data
Risk sensitive decision making was at first assumed to depend on the absolute variation among the payoffs of a risky choice. This meant that one could use the standard deviation of the relevant distribution, or its square (the variance), as an index of risk. Given two alternative actions each with multiple possible payoffs, the one with the higher variance in payoffs would be avoided by risk averse animals and favored by risk prone ones. However, if Weber’s law is playing an important role in decision making, the best predictor of risk sensitivity would not be the absolute measures of variability but instead the ratio between the error and the average values being measured. One commonly used ratio that fits this description is the coefficient of variation (CV). This is computed as the ratio between the standard deviation and the mean of a distribution. If Weber’s Law does contribute to risk sensitivity as we have described earlier, it should be the case that animals (and people) consider a choice with a higher CV as more risky than one with a lower CV, and they should be indifferent to a choice in which both alternatives had the same CV, even though one might have higher absolute variability as measured by variance or standard deviations. This is a prediction that can be studied.
Recent work suggests that, as predicted, the CV of the relevant distributions is a better predictor of risk sensitivity than either standard deviations or variances in a wide variety of animals (Weber et al. 2004; Shafir et al. 2005). It must be remembered, however, that a good fit of data to a particular model is a necessary but not sufficient condition for believing that this model is the true cause of those results. One must also consider alternative hypotheses that can explain the same data, and then identify critical experiments or observations that will discriminate between them (Platt 1964). In this regard, several authors have suggested that similar data could arise from learning processes that do not need to invoke Weber’s Law per se (Lockhead 2004; Weber et al. 2004). As noted by several authors, Weber’s Law, components of Prospect Theory (see Web Module 1.5), Jensen’s inequality, and processes such as associative learning may all play roles in the observed risk sensitivity of animals and people.
Literature cited
Bateson, M. and A. Kacelnik. 1995. Preferences for fixed and variable food sources—variability in amount and delay. Journal of the Experimental Analysis of Behavior 63: 313–329.
Copelli, M., A.C. Roque, R.F. Oliveira and O. Kinouchi. 2002. Physics of psychophysics: Stevens and Weber-Fechner laws are transfer functions of excitable media. Physical Review E 65: Art. No. 060901
Dehaene, S. 2003. The neural basis of the Weber-Fechner law: a logarithmic mental number. Trends in Cognitive Sciences 7: 145–147.
Fechner, G.T. 1889. Elemente der Psychophysik. Leipzig: Breithof and Harterl.
Gibson, J., R.M. Church, S. Fairhurst and A. Kacelnik. 1988. Scalar expectancy theory and choice between delayed rewards. Psychological Review 95: 102–114.
Johnson, K.O., S.S. Hsiao and T. Yoshioka. 2002. Neural coding and the basic law of psychophysics. Neuroscientist 8: 111–121.
Kacelnik, A. and F.B.E. Abreu. 1998. Risky choice and Weber’s law. Journal of Theoretical Biology 194: 289–298.
Kacelnik, A. and M. Bateson. 1996. Risky theories—The effects of variance on foraging decisions. American Zoologist 36: 402–434.
Lockhead, G.R. 2004. Absolute judgments are relative: A reinterpretation of some psychophysical ideas. Review of General Psychology 8: 265–272.
Platt, J. 1964. Strong inference. Science 146: 347–353.
Reboreda, J.C. and A. Kacelnik. 1991. Risk-sensitivity in starlings: variability in food amount and food delay. Behavoral Ecology 2: 301–308.
Shafir, S., G. Menda and B.H. Smith. 2005. Caste-specific differences in risk sensitivity in honeybees, Apis mellifera. Animal Behaviour 69: 859–868.
Shettleworth, S.J. 2001. Animal cognition and animal behaviour. Animal Behaviour 61: 277–286.
Stevens, S.S. 1957. On the psychophysical law. Psychological Review 64: 153–181.
Weber, E.H. 1834. De Pulsu, Resorptione, Auditu et Tactu: Annotationes, Anatomical et Physiological. Leipzig: Koehler.
Weber, E.U., S. Shafir and A.R. Blais. 2004. Predicting risk sensitivity in humans and lower animals: Risk as variance or coefficient of variation. Psychological Review 111: 430–445.
8.7 Brains and Decision Making
Overview
One of the most exciting developments in the last decade is the increasing ability to monitor the neurobiology of decision makers. In animals, this has traditionally involved inserting electrodes into selected brain areas and monitoring relative activity as the animal makes a decision. Recent technical advances such as two-photon imaging and optogenetics have pushed the envelope even further by identifying events at the individual neuron and sub-neuron levels (Homma et al. 2009; Knopfel et al. 2010). Popular animal systems include roundworms, sea slugs, leeches, honeybees, zebrafish, zebra finches, mice, rats, and monkeys. In people, non-invasive functional magnetic resonance imaging (fMRI) uses the increased metabolic activity of working neurons to track the steps as a person makes different kinds of decisions. These technologies have encouraged neurobiologists to determine whether animals or people really have the necessary machinery for Bayesian updating and optimal decision making, or instead are just relying on some very clever heuristics. They also have looked for possible causes for the biases seen in animal and human decision making. Suitable areas of the brain for these functions have now been located in animals and humans, and this has encouraged a melding of economics and neurobiology into a field called neuroeconomics (Glimcher and Rustichini 2004).
Basic structure of the human brain
The human brain is divided into a series of lobes and internal regions (Figure 1):
Figure 1: (A) The four lobes of the human cerebral cortex and some of their functions. Decision making involves the most anterior (near the face) parts of the frontal lobes, parts of the parietal lobes, and deep inside, parts of the temporal lobes. (B) Top view of human cerebral cortex. Note that the brain is largely divided into left and right hemispheres. Decisions based on more confident probability and payoff estimates tend to activate relevant areas in the left hemisphere; less certain decisions may activate the corresponding regions but in the right hemisphere.
The large and prominently fissured mass on the top is called the cerebrum or cortex. It is divided at most points into right and left hemispheres. Each hemisphere hosts four regions called lobes. While each lobe is in fact multifunctional, one can roughly assign reasoning and voluntary movements to the frontal lobe, audition and some higher level visual processing to the temporal lobe, touch and skin sensations to the parietal lobe, and vision to the occipital lobe. Deep inside the folds in each hemisphere where the temporal and frontal lobes meet are the insula; these deal with visceral functions and taste. Below the cortex is a complex of regions known as the subcortex. This includes the striatum (also known as the basal ganglia), which consists of two parallel structures that begin in the frontal lobe and arc backwards into each temporal lobe, down, and then forwards again. (Figure 2).
Figure 2: The basal ganglia (collectively called the striatum) are located in the center of the brain just below the outer cortex. The major components are the caudate nucleus, the putamen, and the globus pallidus. The ventral striatum (an important area that stores utility values for rewards of actions) consists of ventral and medial parts of the caudate nucleus and the putamen.
This area starts, regulates, and stops voluntary motor actions. Interdigitated with the striatum is another set of subcortical structures called the limbic system (Figure 3):
Figure 3: The limbic system (dark orange in picture) is located in the same subcortical region as the basal ganglia and the two actually wrap around each other. The major components are the cingulate gyrus, an arcing region in each hemisphere at the top of the subcortex, the amygdala and the hippocampus in the lower region of each temporal lobe, and the hypothalamus in the center. Many parts of the prefrontal cortex and the ventral striatum interact extensively with the limbic system during decision making. The ventral striatum is thought to store weighted reward payoff information, and the amygdala to store weighted costs. The hippocampus oversees the storage of memories, including templates of signals and cues that are associated with payoffs and probabilities. The hypothalamus generates “somatic marker” physiological responses to stimuli. The anterior region of the cingulate seems to monitor all decision making, at times helping focus on rational decisions, and at others, alerting the rest of the brain that actual payoffs are different from expected values. The prefrontal lobes play major roles in comparing probabilities, payoffs, and expected values.
These include the cingulate cortex which is a large band above the striatum, and the amygdala and hippocampus zones in the lower parts of each temporal lobe. The limbic system controls emotion (cingulate and amygdala) and regulates what gets stored as memories in the brain (hippocampus). The cerebrum connects to the spinal cord successively through the diencephalon, the midbrain, and the brainstem, all of which handle switching and routing functions for nerve traffic coming into and out of the brain. The lower part of the diencephalon, the hypothalamus, regulates a number of autonomic functions such as body temperature, hunger, thirst, reproduction, and circadian rhythms. The brainstem hosts two systems that modulate activity throughout the brain using specific transmitter substances. One, the dopamine system, plays an important role in assigning reward values to recent stimuli. The second, the norepinephrine system, modulates mental arousal and vigilance. The cerebellum, which controls posture, coordination, and balance, is nestled between the cortex and the midbrain at the rear of the brain.
Rational decision making in human brains
Economists and psychologists had speculated that the brain might have two separate decision-making systems: one, that was fast and heuristic, and a second, that was slower but more rational (Kahneman and Tversky 2000; Camerer et al. 2005). The reality has turned out to be a bit more complicated (McClure et al. 2004; Glimcher et al. 2005; Sugrue et al. 2005; Trepel et al. 2005; Sanfey et al. 2006). Some parts of the brain, such as the prefrontal areas of the frontal lobes, seem to be active during any decision process. There are three (at least) sub-regions within the prefrontal cortex that are activated during decision making (Figure 4). The ventromedial zone is independently sensitive to changes in outcome probabilities and payoffs, but also contributes to their combination as expected values (Knutson et al. 2005; Daw et al. 2006; Sanfey et al. 2006). The adjacent orbitofrontal zone seems involved with contrasts between alternative possible payoffs, but may focus more on losses (and aversive actions) than on gains (and appetitive actions) (O’Doherty et al. 2003; Ursu and Carter 2005; Daw et al. 2006). The dorsolateral zone also appears to track current estimates of expected values and is tightly linked to a final decision zone in the posterior part of the parietal lobe (Kim and Shadlen 1999; Trepel et al. 2005; Sanfey et al. 2006).
Figure 4: Prefrontal lobe regions of human brain activated during decision making. See text for specific roles in this process.
Activity in this latter area remains sensitive to changes in probabilities and payoffs of alternative consequences suggesting that final commitments to action do not take place earlier in the prefrontal regions (Glimcher 2003; Glimcher et al. 2005; Sugrue et al. 2005). Once this parietal region is activated, the next step is performance of an action. Many of these steps may be lateralized: when probabilities and payoffs are relatively certain, the relevant parts of the left hemisphere handle the decision making; when alternatives are equally likely or probabilities are uncertain, the relevant regions in the right hemisphere dominate the decision process (Kim et al. 2004; Knutson et al. 2005). These and similar studies have verified the existence of explicit brain regions that can track probabilities, payoffs, and expected values for multiple alternatives and compare them for rational decision making.
Biased decisions in human brains
What happens when decisions are less than rational? Instead of invocation of a second and separate decision system, recent research suggests that biased decisions are generated when other brain centers such as the limbic system and striatum modulate the rational process (McClure et al. 2004; Yarkoni et al. 2005; Sanfey et al. 2006; Tom et al. 2006). The ventral striatum appears to be a general repository of positive (gain) payoff information (O’Doherty et al. 2004; Knutson et al. 2005; Daw et al. 2006). The amygdala has been proposed as a general repository of negative (loss) payoff information (Dalgleish 2004). Neither site appears to store absolute payoff values but instead converts estimates into “utilities” as a function of the current physiological state of the animal, levels of risk, the degree to which contexts limit choice, and historical associations with similar situations. The striatum and amygdala have tight links to the ventromedial and orbitofrontal prefrontal zones respectively where their weighted utility estimates are then played against the more direct estimates of rational decision making. The degree to which the decision is rational appears to depend on the strength of striatal and amygdala inputs relative to the ongoing rational process (McClure et al. 2004; De Martino et al. 2006). Note that the processing of losses and gains in separate brain regions could explain the observed higher biases for losses than for gains if the influence of the amygdala were generally greater than that of the ventral striatum.
Recent fMRI studies indicate that the amygdala is a primary source of risk sensitivity and framing biases (De Martino et al. 2006). A second source of decision bias is the presence of somatic markers (Bechara and Damasio 2005). Somatic markers are combinations of autonomic responses such as accelerated heart rate, perspiration, heat and cold flashes, or general muscle tension that are triggered (usually by the hypothalamus) when certain kinds of signals or cues are perceived by the decision maker. Some somatic markers are instinctive, whereas others are acquired from prior experience. They provide one possible mechanism for invoking the past in a linear operator process. When activated, a somatic marker acts as an additional cue that the ventromedial prefrontal cortex needs to consider during its melding of direct estimates and input from the striatum and amygdala.
Much of the remaining decision machinery in the mammalian brain seems to be devoted to updating and error correction following a decision. Once an action is complete, the prefrontal cortex, amygdala, and striatum all receive input that allows them to compare the actual versus previously anticipated payoffs. When there is a large difference between these values, the anterior cingulate, which is an active observer of all decision making, and the orbitofrontal cortex alert the brain to this situation; a particularly strong difference between expected and observed payoffs may also activate the insula (O’Doherty et al. 2003; Dalgleish 2004). If the rewards of the action exceed expectations, the dopamine system in the brainstem increases its activity, and if rewards are less than expected, then the dopamine system reduces its activity (Pessiglione et al. 2006). Where observed and expected rewards are similar, the dopamine system activity remains unchanged. Because the dopamine system projects throughout the brain, this provides a global broadcast of the effectiveness of the latest decision (Schultz 1998). The amygdala, striatum, and the dorsolateral cortex then play key roles in updating stored payoffs based on this recent experience. In this context, the amygdala is thought to rescale both gain and loss payoffs into weighted utilities and then feed these biases to the orbitofrontal cortex (Dalgleish 2004; De Martino et al. 2006; Paton et al. 2006). The dorsal striatum is also involved in updating once actual payoffs can be evaluated, and may play a role in establishing heuristic short cuts for future encounters (O’Doherty et al. 2004). Finally, the hippocampus and other nearby regions of the temporal lobe oversee the updating or addition of memory templates for any recent cues or signals that facilitated the decision (Greene et al. 2006; Moscovitch et al. 2006; Svoboda et al. 2006).
Other vertebrate brains and decision making
The basic brain structures and functions described above appear to be shared among rats, monkeys, and humans. What about other vertebrates? Until recently, the regions of birds’ brains followed a completely different nomenclature. Careful comparisons have now revealed that there are very strong parallels in avian and mammalian brains, and most of the structures identified above for mammals are now believed to have counterparts in the brains of birds (Jarvis et al. 2005). Since birds and mammals show similar evidence of rational decision making and similar biases when irrational, it seems likely that neurobiologists will eventually demonstrate similar processes in both taxa. Although amphibians have a much smaller cerebrum than birds or mammals, they also share many of the subcortical structures described earlier including basal ganglia, amygdala, striatum, hippocampus, and hypothalamus (Striedter 1997; Endepols et al. 2005; Medina et al. 2005). Several of these structures even appear to be present with similar functions in fish (Broglio et al. 2005; Portavella and Vargas 2005). This suggests that the basic processes of decision making evolved early in the vertebrate line and have only been elaborated by subsequent evolution (see also the multiple chapters relating brain regions to specific ecological and behavioral tasks in Dukas and Radcliffe 2009).
Invertebrate brains and decision making
Invertebrates have quite different brain structures from vertebrates. However, research on slugs and leeches suggests that successive stimuli lead to cumulative updating, and that decisions depend on mutual levels of activity in multiple nerve cells as opposed to simple association and switching circuits (Esch and Kristan 2002; Esch et al. 2002; Jing and Gillette 2003; Briggman et al. 2005). The brains of most insects consist of sensory processing ganglia that provide input to mushroom bodies where associative learning and decision making take place (Fahrbach 2006; Menzel et al. 2006). While there has yet been little neurobiological work on decision making in insects, genetic and biochemical studies suggest that decisions again involve multiple but interacting regions within the mushroom bodies (Heberlein et al. 2004; Abramson et al. 2005).
Figure 5: The brain of a honeybee. The large masses (ME and LO) on each side do most of the processing of visual stimuli. Olfactory stimuli are processed in the paired lobes on the bottom side of the brain (AL). Decision making and memory appear to be the main functions of the large mushroom bodies (MC and LC) positioned at the top and between the various sensory lobes. (After Menzel 1983.)
Literature cited
Abramson, C.I., C. Sanderson, J. Painter, S. Barnett and H. Wells. 2005. Development of an ethanol model using social insects: V. Honeybee foraging decisions under the influence of alcohol. Alcohol 36: 187–193.
Bechara, A. and A.R. Damasio. 2005. The somatic marker hypothesis: A neural theory of economic decision. Games and Economic Behavior 52: 336–372.
Briggman, K.L., H.D.I. Abarbanel and W.B. Kristan. 2005. Optical imaging of neuronal populations during decision-making. Science 307: 896–901.
Broglio, C., A. Gomez, E. Duran, F.M. Ocana, F. Jimenez-Moya, F. Rodriguez and C. Salas. 2005. Hallmarks of a common forebrain vertebrate plan: Specialized pallial areas for spatial, temporal and emotional memory in actinopterygian fish. Brain Research Bulletin 66: 277–281.
Camerer, C., G. Loewenstein and D. Prelec. 2005. Neuroeconomics: How neuroscience can inform economics. Journal of Economic Literature 43: 9–64.
Dalgleish, T. 2004. The emotional brain. Nature Reviews Neuroscience 5: 582–589.
Daw, N.D., J.P. O’Doherty, P. Dayan, B. Seymour and R.J. Dolan. 2006. Cortical substrates for exploratory decisions in humans. Nature 441: 876–879.
De Martino, B., D. Kumaran, B. Seymour and R.J. Dolan. 2006. Frames, biases, and rational decision-making in the human brain. Science 313: 684–687.
Dukas, R. and J. Ratcliffe (eds). 2003. Cognitive Ecology II. Chicago, IL: University of Chicago Press.
Endepols, H., K. Roden and W. Walkowiak. 2005. Hodological characterization of the septum in anuran amphibians: II. Efferent connections. Journal of Comparative Neurology 483: 437–457.
Esch, T. and W.B. Kristan. 2002. Decision-making in the leech nervous systern. Integrative and Comparative Biology 42: 716–724.
Esch, T., K.A. Mesce and W.B. Kristan. 2002. Evidence for sequential decision making in the medicinal leech. Journal of Neuroscience 22: 11045–11054.
Fahrbach, S.E. 2006. Structure of the mushroom bodies of the insect brain. Annual Review of Entomology 51: 209–232.
Glimcher, P.W. 2003. The neurobiology of visual-saccadic decision making. Annual Review of Neuroscience 26: 133–179.
Glimcher, P.W., M.C. Dorris and H.M. Bayer. 2005. Physiological utility theory and the neuroeconomics of choice. Games and Economic Behavior 52: 213–256.
Glimcher, P.W. and A. Rustichini. 2004. Neuroeconomics: The consilience of brain and decision. Science 306: 447–452.
Greene, A.J., W.L. Gross, C.L. Elsinger and S.M. Rao. 2006. An fMRI analysis of the human hippocampus: Inference, context, and task awareness. Journal of Cognitive Neuroscience 18: 1156–1173.
Heberlein, U., F.W. Wolf, A. Rothenfluh and D.J. Guarnieri. 2004. Molecular genetic analysis of ethanol intoxication in Drosophila melanogaster. Integrative and Comparative Biology 44: 269–274.
Homma, R., B.J. Baker, L. Jin, O. Garaschuk, A. Konnerth, L.B. Cohen and D. Zecevic. 2009. Wide-field and two-photon imaging of brain activity with voltage- and calcium-sensitive dyes. Philosophical Transactions of the Royal Society B-Biological Sciences 364: 2453–2467.
Jarvis, E., O. Gunturkun, L. Bruce, A. Csillag, H. Karten, W. Kuenzel, L. Medina, G. Paxinos, D.J. Perkel, T. Shimizu, G. Striedter, J.M. Wild, G.F. Ball, J. Dugas-Ford, S.E. Durand, G.E. Hough, S. Husband, L. Kubikova, D.W. Lee, C.V. Mello, A. Powers, C. Siang, T.V. Smulders, K. Wada, S.A. White, K. Yamamoto, J. Yu, A. Reiner and A.B. Butler. 2005. Avian brains and a new understanding of vertebrate brain evolution. Nature Reviews Neuroscience 6: 151–159.
Jing, J. and R. Gillette. 2003. Directional avoidance turns encoded by single interneurons and sustained by multifunctional serotonergic cells. Journal of Neuroscience 23: 3039–3051.
Kahneman, D. and A. Tversky. 2000. Choices, Values, and Frames. New York: Cambridge University Press.
Kim, H., L.H. Somerville, T. Johnstone, S. Polis, A.L. Alexander, L.M. Shin and P.J. Whalen. 2004. Contextual modulation of amygdala responsivity to surprised faces. Journal of Cognitive Neuroscience 16: 1730–1745.
Kim, J.N. and M.N. Shadlen. 1999. Neural correlates of a decision in the dorsolateral prefrontal cortex of the macaque. Nature Neuroscience 2: 176–185.
Knopfel, T., M.Z. Lin, A. Levskaya, L. Tian, J.Y. Lin and E.S. Boyden. 2010. Toward the second generation of optogenetic tools. Journal of Neuroscience 30: 14998–15004.
Knutson, B., J. Taylor, M. Kaufman, R. Peterson and G. Glover. 2005. Distributed neural representation of expected value. Journal of Neuroscience 25: 4806–4812.
McClure, S.M., D.I. Laibson, G. Loewenstein and J.D. Cohen. 2004. Separate neural systems value immediate and delayed monetary rewards. Science 306: 503–507.
Medina, L., A. Brox, I. Legaz, M. Garcia-Lopez and L. Puelles. 2005. Expression patterns of developmental regulatory genes show comparable divisions in the telencephalon of Xenopus and mouse: insights into the evolution of the forebrain. Brain Research Bulletin 66: 297–302.
Menzel, R., G. Leboulle and D. Eisenhardt. 2006. Small brains, bright minds. Cell 124: 237–239.
Moscovitch, M., L. Nadel, G. Winocur, A. Gilboa and R.S. Rosenbaum. 2006. The cognitive neuroscience of remote episodic, semantic and spatial memory. Current Opinion in Neurobiology 16: 179–190.
O’Doherty, J., H. Critchley, R. Deichmann and R.J. Dolan. 2003. Dissociating valence of outcome from behavioral control in human orbital and ventral prefrontal cortices. Journal of Neuroscience 23: 7931–7939.
O’Doherty, J., P. Dayan, J. Schultz, R. Deichmann, K. Friston and R.J. Dolan. 2004. Dissociable roles of ventral and dorsal striatum in instrumental conditioning. Science 304: 452–454.
Paton, J.J., M.A. Belova, S.E. Morrison and C.D. Salzman. 2006. The primate amygdala represents the positive and negative value of visual stimuli during learning. Nature 439: 865–870.
Pessiglione, M., B. Seymour, G. Flandin, R.J. Dolan and C.D. Frith. 2006. Dopamine-dependent prediction errors underpin reward-seeking behaviour in humans. Nature 442: 1042–1045.
Portavella, M. and J.P. Vargas. 2005. Emotional and spatial learning in goldfish is dependent on different telencephalic pallial systems. European Journal of Neuroscience 21: 2800–2806.
Sanfey, A.G., G. Loewenstein, S.M. McClure and J.D. Cohen. 2006. Neuroeconomics: cross-currents in research on decision-making. Trends in Cognitive Sciences 10: 108–116.
Schultz, W. 1998. Predictive reward signal of dopamine neurons. Journal of Neurophysiology 80: 1–27.
Striedter, G.F. 1997. The telencephalon of tetrapods in evolution. Brain Behavior and Evolution 49: 179–213.
Sugrue, L.P., G.S. Corrado and W.T. Newsome. 2005. Choosing the greater of two goods: Neural currencies for valuation and decision making. Nature Reviews Neuroscience 6: 363–375.
Svoboda, E., M.C. McKinnon and B. Levine. 2006. The functional neuroanatomy of autobiographical memory: A meta-analysis. Neuropsychologia 44: 2189–2208.
Tom, S.M., C.R. Fox, C. Trepel and R.A. Poldrack. 2006. The neural basis of loss aversion in decision-making under risk. Science 315: 515–518.
Trepel, C., C.R. Fox and R.A. Poldrack. 2005. Prospect theory on the brain? Toward a cognitive neuroscience of decision under risk. Cognitive Brain Research 23: 34–50.
Ursu, S. and C.S. Carter. 2005. Outcome representations, counterfactual comparisons and the human orbitofrontal cortex: Implications for neuroirnaging studies of decision-making. Cognitive Brain Research 23: 51–60.
Yarkoni, T., J.R. Gray, E.R. Chrastil, D.M. Barch, L. Green and T.S. Braver. 2005. Sustained neural activity associated with cognitive control during temporally extended decision making. Cognitive Brain Research 23: 71–84.
8.8 Measures of Discrete Signal Effectiveness
Introduction
The most efficient way to summarize discrete coding schemes is with matrices. Typically, these are two dimensional matrices with inputs (such as ambient conditions) listed along one axis, outputs (such as signals emitted) listed along the perpendicular axis, and cell values containing some measure of how often a given input and a given output co-occur. Different matrices can be used to characterize sender coding, signal propagation effects, and receiver assignments of propagated signals to expected categories. Or one could combine all of these effects into a single overall matrix. Below, we outline how one might collect data for a sender’s matrix, combine this with similar data on propagation and receiver matrices to compute an overall matrix, and use matrix algebra to compute various measures of signal effectiveness.
Format conventions
A coding matrix summarizes the conditional probabilities that a given input will result in a given output. Thus they are a special case of transition (also called stochastic) matrices which are widely used in statistics, ecology, and Markov chain analyses. To keep the geometry of our matrix axes similar to treatments of continuous coding schemes (in which inputs are assigned to the horizontal axis and outputs to the vertical axis), we shall use “left stochastic matrices” in which we assign inputs to matrix columns and outputs to matrix rows. The columns in such a matrix should each add to 1.0. This is the opposite configuration from the “right stochastic matrices” used in many mathematical and ecological studies.
Matrix manipulation: Matlab
We shall demonstrate computations below using MATLAB. This is a commercially available program that is specifically designed for this kind of mathematics. It is now available in both professional and student versions. MATLAB is produced by The Mathworks (http://www.mathworks.com/) and is available through a site license at many universities and colleges. All of the examples and routines outlined below are presented in MATLAB formats, and the functions we have written can be copied from this text and pasted into MATLAB M-files or used directly at the MATLAB prompt (>>). If performed at the prompt, all command lines should be followed by hitting the return key. The routines may also be adapted fairly easily to other environments such as Wolfram Research’s Mathematica (http://www.wolfram.com/).
Obtaining a sender’s coding matrix
Suppose we are studying a species of African monkey that can emit any of three different call types when it spots predators: S1, S2, S3. The three predators are leopards, eagles, and other predators like snakes. We are interested in measuring how regularly the monkeys assign the same signal type to a given predator including emitting no signal when no predator is present. We pick a sample period that is shorter than the interval between successive appearances of predators. We then observe these monkeys for a large number of sample periods and record the presence or absence of a predator in each period and note any alarm calls given. Because an initial alarm call often triggers mimicked calling by others in the troop, regardless of whether they saw a predator or not, we only record the first alarm call given in a calling bout. Suppose we thus accumulate 1,205 sample periods. We can assemble the raw data into a contingency table in which each cell indicates the number of times that a given condition (predator type or no predator present) co-occurred with a given signal (including no signal emitted). This might look as follows:
RAW SENDER FIELD DATA:
We can convert these raw data into a table of probabilities that we shall call the AND table: this summarizes the joint probability, based on our samples, that a given condition and a given signal will co-occur. The AND table is created by dividing all cell values by the grand total (here 1,205). In Matlab, we would first create the raw matrix, W, as:
>>W=[49 9 11 27;4 81 7 9;2 1 89 54;8 8 36 810]
Then the AND matrix is simply
>> Wand=W/1205
This gives us:
SENDER “AND” MATRIX:
We next compute the marginal subtotals for the AND matrix: the sums of the rows indicate the overall fractions of sample periods in which each signal option was given in our sample, and the sums of the columns provide the fractions of sample periods that each predator situation occurred. In Matlab, we can denote the row sums as the vertical vector B, and the column sums as the row vector P. Thus:
>>B=sum(Wand,2)
>>P=sum(Wand)
If we add those subtotals (in blue) to the AND matrix, we would get:
SENDER “AND” MATRIX WITH SUBTOTALS
Note that the overall sum of all cell values in an AND matrix should equal 1.0.
To obtain the sender’s coding matrix, we would divide each cell in the AND matrix by its column subtotal (P). In Matlab, this takes a few steps:
>>PP=[P;P;P;P]
>>S=Wand./PP % (note that this is the ./ operator, not the / operator)
Converting the resulting probabilities into percentages, this gives us the sender coding matrix S:
SENDER CODING MATRIX (S):
We next consider the case of a monkey receiver who is at the outer limit of the troop dispersion while feeding. Thus it will experience any alarm signals only after they have propagated some distance from the sender. Careful measurements have allowed our researchers to compile the following propagation matrix, T, which lists the possible emitted signals by sender monkeys as the inputs and an acoustical classification of these sounds (by the researchers) after they have propagated this distance as the outputs. Note that the occasional addition of wind and insect noise has created one more propagated signal than is actually emitted by senders:
PROPAGATION MATRIX (T):
Sender Signal Emitted | ||||
---|---|---|---|---|
Propagated Signal | ||||
The distant receiver monkey has to try to match what it hears with one of the four possible signal options that it knows senders might adopt. Based on a variety of lab and field tests, our researchers have assembled the following assignment matrix (R) used by a typical receiver monkey at that distance from the sender:
RECEIVER ASSIGNMENT MATRIX (R):
Propagated Signal | |||||
---|---|---|---|---|---|
Assigned Signal | |||||
We shall need some estimates of the prior probabilities that each predator condition is likely to occur. The best data we have available for this are the column subtotals for the AND matrix, P. Writing these probabilities as percentages, we get for the vector P:
PRIOR PROBABILITIES (P):
Chaining
One thing we might want to do is combine the separate sender, propagation, and receiver matrices into an overall cumulative matrix. Note that in addition to summarizing cumulative errors, this table will reveal the degree to which errors in one stage are corrected by errors in a later stage. As the overall matrix will reveal, such error correction is overwhelmed by cumulative errors and the cell values for the overall matrix are always no larger than the smallest equivalent cell among the contributing matrices.
To generate the overall matrix, we load each component matrix, (S for sender, T for transmission, and R for receiver):
We can combine the effects of propagation and sender error by computing the simple matrix product T*S. To compute the entire chain, we add the effects of the receiver assignments to get: R*T*S. The only trick to remember is to place each successive effect ahead of the one being modified in the product (e.g. pre-multiply). The overall (O) result of combining the sender, transmission, and receiver matrices is then:
Note that the dimensions of multiplied matrices have to be compatible. This means that the number of columns in the matrix to the left in any product must equal the number of rows of the matrix to the right. Here, S is a 4 (row) x 4 (column) matrix that is pre-multiplied by T which is a 5 x 4 matrix. The result will be a (5 x 4)*(4 x 4) = (5 x 4) matrix. We then pre-multiply our T*S result by R which is a 4 x 5 matrix. The overall result is a (4 x 5)*(5 x 4) = (4 x 4) matrix. The values in this (R*T*S) table provide the overall conditional probabilities that a receiver will assign what it received to one of the expected signals (rows) when a given predator situation was true (columns). Note also that some rounding errors have occurred at this MATLAB default resolution (all columns in these tables should add to 1.0). These can be corrected by dividing each number in a column by the column total to get:
OVERALL CODING MATRIX:
Forward measures of signal effectiveness
Forward measures of signal effectiveness rate the regularity with which a given output is produced when a given input is true. This can be contrasted with backward measures which rate how regularly the correct input is inferred when a given output is received.
There are two kinds of forward measures for signal effectiveness: those that focus only on the coding matrix, and those that require both the coding matrix and the prior probabilities of the matrix inputs. The latter essentially combine the coding matrix and the priors to regenerate the relevant AND matrix. By itself, this would argue for just using AND matrices and not extracting the coding matrices from them. The reason for going to coding matrices is that these are less likely to vary with context than are the priors. If the coding matrix for senders is fairly fixed, a receiver monkey can also use it effectively in a variety of situations by simply updating its estimates of the prior probabilities of the various predators for each situation. Ideally then, one would want some measures that focus only on the coding matrix, and others that characterize a particular situation and thus combine the coding matrix with relevant priors. Luckily, both kinds of measures are available.
Forward measures based only on the coding matrix: determinants
The determinant of a square matrix (one with identical numbers of rows and columns) is a measure of how heterogeneously values are distributed among the matrix cells. For a coding matrix, the most heterogeneous distribution possible is perfect coding in which there is a single cell in each row and column with a conditional probability of 1.0, and zeros in all other cells. The determinant of such a perfect coding matrix is 1.0 (or using percentages, 100%). At the other extreme is a completely homogeneous matrix in which all cell values are identical. This is the case when all outputs are equally likely for a given input. The determinant of such a matrix is 0 (or 0%). Levels of cell value heterogeneity between these two extremes will yield intermediate determinant values. However, note that it does not take much homogenizing to make the determinant value small: the more similar any two columns or rows in a matrix, the closer the determinant will be to 0.
What if the coding matrix is not square as will be the case for some animal signal systems? A solution has been proposed by Yanai et al. (2006) who suggest multiplying such a matrix by itself, (actually premultiplying it by its transpose, the same matrix with rows and columns reversed), to generate a new square matrix. The square root of the determinant of this new square matrix is called a “generalized determinant.” This method will give the same value when applied to an initially square matrix as one would get by directly taking the determinant of that matrix. Thus generalized determinants can place all signal matrices on the same 0–100% scale with perfect coding at 100% and random assignment at 0%. In Matlab, the relevant step for extracting the generalized determinant from a matrix S is:
>>deta=sqrt(det(S'*S))
where S' is the transpose of S (the same matrix but with rows and columns reversed). Thus if S has n rows and p columns, S'*S has dimensions (p x n)*(n x p) = (p x p). This new matrix tabulates the sums of the squares of the inputs. It is similar to the first step in converting a rectangular (n x p) data matrix with samples as the rows and measures as the columns into the relevant variance-covariance and correlation matrices in statistics: one “squares” the n x p data matrix into a p x p sums of squares matrix and then uses this to compute variances, covariances, and correlations between measures.
While one can compute a generalized determinant for any coding matrix (including transmission and receiver matrices as well as sender matrices) without reference to prior probabilities, this tacitly assumes that none of the inputs has zero prior probability (Kåhre 2002): if a prior is zero, then it really does not matter what the values are in the coding matrix, as any set of conditional probabilities will yield the same outcomes. Differences in the determinants of two coding matrices for both of which the same input has a prior of zero may be meaningless. We thus should restrict determinant measures to coding matrices in which all inputs have non-zero prior probability.
Forward measures requiring both priors and the coding matrix
Average Consistency
The simplest forward measure that is specific to a given set of input prior probabilities is consistency. One first identifies the cell in each column of the measured matrix that contains the maximal value for that column. The row containing that maximal cell value is the dominant output for that input. For our sender matrix S, we can color the maximal cells in blue:
SENDER CODING MATRIX (S):
The average consistency of this matrix is computed by weighting each input’s maximal cell value by the prior probability that that input will occur, and then adding these products across all inputs. This is done simply in MATLAB using:
>>AC=max(S)*P' % (note transpose of P using prime operator)
For our sample sender’s matrix S above, the average consistency is 85.4%.
Perfect coding will result in an average consistency of 100%; random assignment of outputs to inputs will produce an average consistency equal to the reciprocal of the number of possible outputs. Note that average consistency is oblivious to the presence of more than one maximal value in the same row of the matrix: consistency only characterizes deviations from perfect coding along the vertical axis; it ignores any deviations along the horizontal axis. As a result, average consistency values are invariably larger than the determinant for the same matrix.
Index of Association
This measure was proposed for rating the heterogeneity of contingency tables by Goodman and Kruskal (1954). Because the AND matrix is simply the original contingency table divided by the total number of samples, the index of association can also be computed directly from the AND matrix. It finds the maximum row subtotal in that matrix and compares it to the maxima in each of the AND matrix columns. In words, the index evaluates how much knowing the current input helps predict the output when compared to just knowing which output is most common overall. The index varies from 1.0 (100%) for perfect coding to 0 (0%) for a table with uniform values throughout. For our sender’s coding matrix S above, and the priors in vector P, the index of association (called “lambda” by Goodman and Kruskal) is 48.2%. A MATLAB routine for this index is:
function L=lambda(S,P) % Computes Goodman-Kruskal index of association for coding matrix S % with inputs as columns and outputs as rows, and prior probability % vector P. Values range 0–1 with 100% for perfect coding and 0 when % all matrix cell values are equal. % Goodman, L.A. and W.H. Kruskal. 1954. Measures of association for cross % classifications, Part I. Journal of the American Statistical % Association 49: 732–764. [a b]=size(S); D=ones(a,b); for i=1:a D(i,:)=P; end ND=D.*S; B=S*P'; PB=max(B); PNB=sum(max(ND)); L=(PNB–PB)/(1–PB);
Other Forward Measures
A number of other forward measures have been proposed. Many, such as Theil’s index of inequality (Theil 1970), use logarithms of probabilities as part of their computation. Since animal coding matrices (particularly those approaching perfect coding) contain zeros in some cell values, these measures may not be computable without making some substitutions or approximations (as is done for mutual information, of which many of these measures are close relatives).
A posteriori probabilities: Bayes’ theorem
All backward measures of signal effectiveness are based on the a posteriori probabilities that a given input is true having received a given input and assuming a given coding matrix and set of prior probabilities. While animals may invoke a variety of shortcuts or approximations to estimate a posteriori probabilities, none can do better than by invoking Bayes’ Theorem. Therefore, most backward measures of signal effectiveness first compute a table of a posteriori probabilities using Bayes' theorem. If one knew how an animal not using Bayesian methods updated information, one could substitute the probabilities generated by those mechanisms for the Bayesian values used in standard measures.
Below, we provide a MATLAB routine for computing the matrix of a posteriori probabilities predicted by Bayes’ Theorem, given a coding matrix S, and the set of prior probabilities in the horizontal vector P. Note that the output matrix from our Bayes M–file continues to observe the “left stochastic” convention that inputs are assigned to columns and outputs to rows. In the case of the a posteriori probabilities, this means that what were outputs in the coding matrix (e.g. signals) are now assigned to columns and the former inputs (e.g. conditions) are now assigned to the rows. This is because the cell values in the matrix give the conditional a posteriori probabilities that a given condition is true after having received a given signal and assuming that the initial coding matrix and priors are valid. The relevant MATLAB routine is:
function B=Bayes (S,P) % S is the coding matrix with conditions as columns and signals as rows % P gives prior probabilities as row vector with columns as conditions % D.*S computes the “Condition AND Signal” matrix from S and P % S*P' computes the row totals in the AND matrix and thus the total % fraction of time that each signal is given across all conditions % The last half of the expression divides each AND cell value by % the corresponding row total; this is the actual Bayes calculation % Values in the output matrix cells are a posteriori probabilities of % each condition being true (rows) after having received the signal assigned % to that column. Note reversal of axis assignments from S matrix. % Inclusion of signals that are never used results in 0 Bayesian estimates. [a b]=size(S); D=ones(a,b); DD=D; for i=1:a D(i,:)=P; end AND=D.*S; %Compute joint (AND) matrix of conditions and signals SS=S*P'; %compute total fraction of time each signal is given for i=1:a if (SS(i)>0) SP(i)=1/SS(i); else SP(i)=0; end end for i=1:b DD(:,i)=SP; end B=AND.*DD; %compute a posteriori prob of conditions given signal B=B'; %Reverse axes so that original inputs and outputs reversed % and columns add to 1.0.
Let us apply this routine to compute the a posteriori probabilities for the overall matrix O and the prior probabilities P when there are no transmission alterations to the emitted signals and receivers identify emitted signals perfectly. This yields the following a posteriori matrix with the emitted signal alternatives as the columns and the inferred predator conditions as the rows:
Putting these into the appropriate format and correcting for rounding errors,
A POSTERIORI PROBABILITIES:
Note that the a posteriori probabilities for each of the predators remain low even after receiving signals; however, these values should be compared to the pre-signal priors of 5%, 8%, and 12% for leopards, eagles, and other predators respectively. Clearly, receipt of signals increases the receiver’s estimated probabilities that the respective predator is present, and reduces the prior 75% probability that no predator is present.
Backward measures of signal effectiveness
There are two useful sets of backward measures for discrete coding schemes. Both result in absolute measures of signal effectiveness and but can be converted to relative measures comparing the effectiveness of the signal scheme to one with perfect coding.
Reliabilities
The first two measures concern signal scheme reliability. Suppose a given condition (one possible input in the original coding matrix) is true. Reliability is here defined as the probability estimated by a receiver that this condition is true after having processed a received signal and updated its estimates. If the relevant coding scheme is perfect, the receiver’s estimate that that condition is true will be 100%; if a coding scheme has the same values in all cells, and thus provides no information, the receiver’s estimate after receipt of a signal will be no different than before and thus equal to the prior probability for that condition. Average reliability weights the reliability of the signal scheme for each condition by that condition’s prior probability and adds up these products. Relative reliability is the average difference between the updated probability for each condition and its prior value divided by the maximal difference that would be obtained if coding were perfect. Thus if the reliability for a condition were R, and the prior probability for that condition were P, relative reliability=(R–P)/(1–P). Relative reliability thus varies between 0 and 1.0.
Computations of reliabilities in MATLAB are easy: one simply chains the various stages in the communication process. Thus if sender and receiver are far apart and thus we need to consider the sender’s matrix S, a transmission matrix T, and a receiver’s assignment of transmitted signals to expected categories matrix, R, the overall coding matrix O can be created in MATLAB as:
>>O=R*T*S
We thus have the first three steps in the communication process accounted for. We next need to characterize how the receiver uses assigned signals to update its probability estimates for each condition. As an upper limit to updating rates, we use the Bayes routine (above), the overall coding matrix O, and the prior probabilities vector P to create the final updating transition matrix BA as:
>>BA=Bayes(O,P)
Pre-multiplying O with BA will give us a square matrix G with the actual conditions as columns, the possible inferred conditions as rows, and cell values equal to the updated probabilities estimated by the receiver that each of the possible conditions is true when the condition listed in the columns of the matrix is in fact true:
>>G=BA*O
The main diagonal of this matrix indicates how likely the receiver thinks the current condition is true; in a way, these values indicate how closely the receiver is to getting it right. The off-diagonal values indicate the probabilities that the receiver has estimated that conditions that are not currently true might be. The G matrix given our overall coding matrix above and priors P is:
OVERALL RELIABILITY:
The average reliability is then the product of the prior probabilities P and the values along the main diagonal of G:
>>R=P*diag(G)
In the example, the average reliability R is 63.5%. This value is high because of the high diagonal cell and high prior values for the no predator condition.
To compute the average relative reliability, RR, we need to compare each value along the main diagonal of G with the corresponding prior values in P, and divide this difference by the maximum possible were the coding scheme perfect (1–P). These ratios are then weighted by the prior probabilities to given an average relative reliability for the entire coding scheme. Assuming that no value of P =1.0, we can use:
>> [a b]=size(P)
>>ON=ones(a,b)
>>W=(GG'-P)./(ON-P)
>> RR=W*P'
In our example above, the relative reliability (RR) is 14%.
Mutual information
Mutual information was developed by Shannon (Shannon and Weaver 1949) and has been widely used by many disciplines in science and statistics. The uncertainty about which condition is true is typically given as a current probability estimate. In information theory, probabilities are rescaled as the number of binary questions (bits) that would need to be answered to remove all uncertainty. Mutual information is the difference between the uncertainty about which condition is true before signals minus the residual uncertainty after a receiver has detected and classified a signal and updated its probability estimates. It is thus a backward measure of signal effectiveness.
As with reliability, we can generate an absolute and relative average measure of coding scheme effectiveness. The absolute measure is simply the average reduced uncertainty (in bits) after receiving a signal from the scheme, and the relative measure is the fraction of the initial uncertainty that is resolved on average by receipt of a scheme signal.
We first compute the absolute mutual information using the same coding matrices presented earlier (sender matrix S, transmission matrix T, and receiver signal assignment matrix R). Again, we use the chain procedure to compute an overall coding matrix O (=R*T*S) and a prior probability vector P.
The conversion of all probabilities to bits uses logarithms to the base 2. Because 0 probabilities may occur, we need a routine that tells MATLAB what to do when such a value occurs. The following routine will do that:
function Y=mylog2(X) %Computes log to base 2 if number>0 and returns 0 otherwise [a b]=size(X); Y=X; for i=1:a for j=1:b Y(i,j)=0; if (X(i,j)>0) Y(i,j)=log2(X(i,j)); end end end
The average uncertainty in bits for any vector or matrix V can then be computed using:
function H=JH(V) %Computes entropy from V which is either a vector or a matrix H=–sum(sum(V.*mylog2(V)));
To compute the mutual information provided by a coding matrix, here denoted by S, (but note that it could be the overall matrix O if propagation distortions and receiver error are relevant), given priors in vector P, we can use the following:
function [HT RH]=MI(S,P) % This function computes the mutual information in bits provided by a % signal matrix S and a vector of prior probabilities P. The result is % output as the variable H. If the output is defined as a 2 element vector, % the second element is the fraction of the original uncertainty resolved % on average by this coding matrix. This assumes Bayesian updating. % Uses functions mylog2 and JH. [a b]=size(S); D=ones(a,b); for i=1:a D(i,:)=P; end ND=D.*S; % computes AND matrix of joint prob of inputs and outputs XP=P; %Get priors ready for entropy calc if (sum(XP)~=1) XP=XP/sum(XP); % Make sure priors add to 1.0 end for i=1:size(XP) % Make sure priors >0 if (XP(i)==0) XP(i)=1; end end HC=JH(XP); %Compute max uncertainty at start HT=HC+JH(RC)–JH(ND); %Compute mutual information provided RH=0; if HC~=0 RH=HT/HC; %Compute fraction of original information resolved end
Single routine for all discrete measures
The following MATLAB routine, Rel, computes and prints out all of the above measures for a given coding matrix S and prior probability vector P. If no output vector is assigned, one only gets the list and results. If the function output is assigned to a vector such as [M VV], M will contain the list and values and VV will be a vector with only the values.
function [M VV] =Rel(S,P) % S is a coding matrix with inputs as columns and outputs as rows. % Column totals should add to 1.0. P is a horizontal vector containing % the prior probabilities of each column in S being true. % Average consistency (ACC) is defined as the sum of max cell values for each % column in the coding matrix discounted by the prior probability of that % input being true. It is a forward measure that varies between 0 when no % information about outputs is provided by inputs to 1.0 for perfect coding. % Another forward measure is the generalized determinant of the coding % matrix. This also varies between 0 (no information provided) and 1.0 for % perfect coding. A third forward measure is the Goodman-Kruskal index of % association, lambda. This is the average reduction in uncertainty (on a % scale of 0 to 1.0) that knowledge of the inputs provides in predicting the % outputs when compared to a guess based on the relative frequencies with which % each output in S is given. % The remaining measures are backwards indices in that they require the % computation of an updated (Bayesian) estimate of input probabilities % given receipt of a signal, a set of priors, and access to the relevant % coding matrices. Average reliability is the mean probability across inputs % that a receiver using this coding system will assign to conditions when they % are in fact true. Perfect coding will yield a value of 1.0 and chance % coding (all cell values in the coding matrix are identical, will yield a % weighted average of the priors (e.g. there is no change from prior values % if attending to signals). Relative reliability is the fraction of the % improvement above using priors to estimate inputs provided by signals % when compared to not using signals and only relying on priors. It varies % from 0 (no improvement) to 1.0 (perfect coding). % Entropy (bits) is another backwards measure using the Bayesian estimates of % probabilities. It computes the difference in uncertainty before receipt of % a signal minus the residual uncertainty after signalreceipt. Relative % entropy is the fraction of initial uncertainty removed on average by use % of this signal set. These routines require access to the Bayes, AbsRel, % and JH routines. % Jack Bradbury, October 2009. [a b]=size(S); D=ones(a,b); for i=1:a D(i,:)=P; end ND=D.*S; %computes AND matrix of joint prob of each input and output RC=S*P'; %computes sums of rows of AND matrix %Compute measures ACC=max(S)*P'; %Compute average consistency L=lambda(S,P); deta=sqrt(det(S'*S)); %compute generalized determinant of matrix [R RR G]=AbsRel(S,P); XP=P; %Get priors ready for entropy calc if (sum(XP)~=1) XP=XP/sum(XP); % Make sure priors add to 1.0 end for i=1:size(XP) % Make sure priors >0 if (XP(i)==0) XP(i)=1; end end HC=JH(XP); %Compute max uncertainty at start HT=HC+JH(RC)–JH(ND); %Compute mutual information provided RH=0; if HC~=0 RH=HT/HC; %Compute fraction of original information resolved end titles=char('Consistency:','Determinant:','Lambda:','Reliability:','Rel Reliabil:','Ht (bits):','Fraction Hmax:'); MV=num2str([ACC;deta;L;R;RR;HT;RH],'%6f'); VV=[ACC;deta;L;R;RR;HT;RH]; M=[titles MV];
Literature cited
Kåhre, J. 2002. The Mathematical Theory of Information. New York, NY: Springer Verlag.
Goodman, L.A. and W.H. Kruskal. 1954. Measures of association for cross classifications, Part I. Journal of the American Statistical Association 49: 732–764.
Shannon, C.E. and W. Weaver. 1949. The Mathematical Theory of Communication. Urbana, Illinois: University of Illinois Press.
Yanai, H., Y. Takane and H. Ishii. 2006. Nonnegative determinant of a rectangular matrix: Its definition and applications to multivariate analysis. Linear Algebra and Its Applications 417: 259–274.
8.9 Mutual Information Measures of Signal Effectiveness
Introduction
Mutual information is often used as a backward measure of signal effectiveness: that is, it computes the change in a receiver’s uncertainty (entropy) about which of several alternative conditions is true after receipt of a signal. Uncertainty and changes in uncertainty are measured in bits: the number of binary questions that would need to be answered to identify which of several alternatives is true. The uncertainty in bits can be computed as the logarithm to the base 2 of the probability that a given alternative is true. We are most interested in the effectiveness of signal sets, and thus compute the weighted average of the binary questions required to clarify which alternative in a set is currently true.
Logic of method
The computation of the mutual information, H_{T}, provided by a discrete signal set requires access to the relevant coding matrix which tabulates the conditional probabilities P(S_{j} |C_{i}) that a given output, S_{j}, will be produced when a given input, C_{i}, is true. In many mathematical treatments, inputs of transition matrices are assigned to rows and outputs to columns, but inputs are then assigned to the horizontal axis in graphs of continuous processes. For consistency, we here adopt the convention of assigning alternative inputs to the horizontal axis of both coding schemes, and outputs to the vertical axis. Thus in our discrete coding matrices, inputs are assigned to columns and outputs are assigned to rows. Since the contents of the matrices are probabilities, and every input should result in one of the listed outputs, columns in these coding matrices should add to 1.0 (or 100%).
Mutual information also requires access to a vector listing the priori probability before signaling, P(C_{i}), that each input alternative is likely to occur.
Given access to the coding matrix and the prior probabilities, there are two ways to compute the average mutual information provided by a signal set:
as
$${\stackrel{\u2500}{H}}_{T}=\sum _{i}\sum _{j}P\left({C}_{i}\mathrm{and}{S}_{j}\right){\mathrm{log}}_{2}\frac{P\left({C}_{i}\right|{S}_{j})}{P\left({C}_{i}\right)}$$or as
$${\stackrel{\u2500}{H}}_{T}=\stackrel{\u2500}{H}\left(C\right)-\stackrel{\u2500}{H}\left(C\right|S)$$where
$$\stackrel{\u2500}{H}\left(C\right)=\sum _{i}P\left({C}_{i}\right){\mathrm{log}}_{2}P\left({C}_{i}\right)$$and
$$\stackrel{\u2500}{H}(C|S)=\sum _{i}\sum _{j}P({C}_{i}\phantom{\rule{0.25 em}{0ex}}\mathrm{and}\phantom{\rule{0.25 em}{0ex}}{S}_{j})\phantom{\rule{0.25 em}{0ex}}{\mathrm{log}}_{2}P({C}_{i}|{S}_{j})$$The probabilities that a given input is true when a particular output has been received, P(Ci|Sj), are computed by combining the coding matrix with the prior probabilities using Baye’s theorem.
Method and example
Because the computations for mutual information can become complicated with multiple inputs and outputs, we outline below a simplifying recipe that requires generation of three successive tables. The first is the coding matrix tabulating the conditional probabilities that a given output will be generated when a given input is true. The second tabulates the joint probabilities, P(C_{i} and S_{j}), that any given input and output will co-occur. The third table lists the a posteriori probabilities, P(C_{i}|S_{j}) based on Baye’s Theorem, that a particular input is true given the coding matrix and the prior probabilities.
We shall use an example in which a female bird seeks to choose a mate and wishes to discriminate between healthy and sick candidates. Males can sing either fast or slow songs. In the first considered population, both sick and healthy males can sing songs at both speeds, but healthy males sing fast songs more often than do sick males. Given that healthy and sick males are equally common in the population, and given the coding matrix listed in the first table, what is the average mutual information provided by song speed?
1. Coding Matrix and Prior Probabilities: First, set up the coding matrix by observing the fraction of time that each of sick and healthy males produce fast and slow songs. This table contains the P(S_{j}|C_{i}) values:. We also add the prior probability vector below this table:
2. Joint Probability Table: We can use the basic probability rule that a joint probability equals the product of the prior probability one component will occur with the conditional probability that the other will occur when the first is true. Algebraically, P(C_{i} and S_{j}) = P(C_{i})·P(S_{j}|C_{i}). We thus multiply the prior probability in each column by the respective cell values in the same column of the coding matrix. Thus:
Note that we can add up the subtotal across rows to obtain the overall fraction of time, P(S_{j}), that a given output, here song speed type, is produced overall.
3. A Posteriori Probability Table: We can now easily compute the Bayesian a posteriori probabilities that a female could compute after combining receipt of a given signal with her knowledge of the coding matrix and the prior probabilities of each male condition. Cells in this table are computed by dividing the cell values in the joint probability table by their corresponding row in that table:
This table now gives the probability that a given male condition (column) is true given receipt of a particular signal (row). Here the rows add to 1.0 because the sum of the fraction of healthy and sick males is all the males. Note that the a posteriori probability that a male is healthy after receiving a fast song is 0.636; this can be compared to the prior probability that a male was healthy of 0.50. Receipt of a fast signal moves the probability meter in the female’s head from 0.50 to 0.6365. Receipt of a slow song moves the needle down from the prior value of 0.50 to the a posteriori value of 0.333. Clearly some information has been obtained. The next steps compute how much information was provided in bits.
4. Using the first method listed at the beginning of this online, we can now compute the average mutual information provided by this signal set. This uses the formula:
$${\stackrel{\u2500}{H}}_{T}=\sum _{i}\sum _{j}P\left({C}_{i}\mathrm{and}{S}_{j}\right){\mathrm{log}}_{2}\frac{P\left({C}_{i}\right|{S}_{j})}{P\left({C}_{i}\right)}$$This uses the cell values in the second table, those in the third table, and the prior probabilities to compute the average information provided:
$${\stackrel{\u2500}{H}}_{T}=\mathrm{0.35}{\mathrm{log}}_{2}(\frac{\mathrm{0.636}}{\mathrm{0.50}})+\mathrm{0.20}{\mathrm{log}}_{2}(\frac{\mathrm{0.364}}{\mathrm{0.50}})+\mathrm{0.15}{\mathrm{log}}_{2}(\frac{\mathrm{0.333}}{\mathrm{0.5}})+\mathrm{0.30}{\mathrm{log}}_{2}(\frac{\mathrm{0.667}}{\mathrm{50}})=\mathrm{0.067}$$bits.
5. Using the second method, the relevant formula is:
$${\stackrel{\u2500}{H}}_{T}=\stackrel{\u2500}{H}\mathrm{(C)}-\stackrel{\u2500}{H}\mathrm{(C|S)}$$where
$$\stackrel{\u2500}{H}\mathrm{(C)}=\sum _{i}P\left({C}_{i}\right){\mathrm{log}}_{2}P\left({C}_{i}\right)$$and
$$\stackrel{\u2500}{H}\mathrm{(C|S)}=\sum _{i}\sum _{j}P\left({C}_{i}\mathrm{and}{S}_{j}\right){\mathrm{log}}_{2}P\left({C}_{i}\right|{S}_{J})$$We first compute H(C) = 0.50log_{2}(0.50) + 0.50log_{2}(0.50) = 1.0 bit. We next compute
H(C|S) = – (0.35 log_{2} 0.636 + 0.20 log_{2} 0.364 + 0.15 log_{2} 0.333 + 0.30 log_{2} 0.667) = 0.933 bits.
Thus H_{T} = H(C) – H(C|S) = 1.0 – 0.933 = 0.0667 bits.
Note that H (C) is the total uncertainty that a female faces about a male’s health before he sings. H (C|S), (called the equivocation), is the residual uncertainty she faces about whether he is healthy after hearing him sing. It is the reduction in uncertainty after receiving the signal that is used here as a measure of signal effectiveness. Note also that it does not depend on any payoffs that might accrue to either party as a result of the signal exchange.
One can compute a relative amount of information received by dividing the actual amount by the maximum that could have been received. Complete resolution of uncertainty in this case would require H(C) = 1.0 bit. The average female obtains 0.067 bits by attending to song speed. Thus she reduces her uncertainty by 0.067/1.0 = 6.7%.
Extreme examples
1. Random coding matrix: What happens if signals are given randomly? What is mutual information in that case? The corresponding coding matrix will show equal values in all cells for a given column:
The joint probability matrix is then:
and the a posteriori probabilities are:
In this case, we can see that listening to song speed has no effect on a female’s estimate that a male is healthy: her probability meter for a given male starts at 0.50 and remains at 0.50 after he sings. We would thus expect that no information has been exchanged. This is in fact what the subsequent computations show:
Using the second method, we again have that H (C) = 1.0 bit, and now the equivocation, H (C|S) = – (0.275 log_{2} 0.5 + 0.275 log_{2} 0.5 + 0.275 log_{2} 0.5 + 0.275 log_{2} 0.5) = 1.0 bit. As expected, H_{T} = H(C) – H(C|S) = 1.0 – 1.0 = 0.0 bits. The relative gain in information is also 0%.
2. Perfect Coding: Consider the opposite extreme. What if healthy males always sing fast songs and sick males always sing slow songs. This is a case of perfect coding and the corresponding coding matrix will be:
The joint probability matrix is then:
and the a posteriori probabilities are:
These signals move a female’s probability meter all the way to 1.0 (if the male sings a fast song) or all the way down to 0 (if he sings a slow song. Song speed resolves ALL uncertainty and there should be no equivocation after receipt of the signal. This is in fact what the subsequent computations show:
Using the second method, we again have that H (C) = 1.0 bit, and now the equivocation, H (C|S) = – (0.0 – (0.5 log_{2} 1.0 + 0.0 log_{2} 0.0 + 0.0 log_{2} 0.0 + 0.5 log_{2} 1.0) = 0.0 bits (since log_{2} 0.0 = 0.0 by convention). As expected, H_{T} = H(C) –H(C|S) = 1.0 – 0.0 = 1.0 bits. Since the amount of information gained is equal to the original uncertainty, the relative information gained in this case is 100%.
Successive sampling and larger coding matrices
If a female samples the same male multiple times successively, she can gain further information with each successive song that he sings. The computations are similar except that the a posteriori probabilities computed for each alternative condition become the prior probabilities for the next song he sings. Although tedious, one can compute the final a posteriori probabilities after a series of male songs and compare them to the initial prior values to come up with a value for the average mutual information provided by that song series. For computing sequential sampling, or for matrices with large numbers of inputs and outputs, it is much easier to use matrix methods. Details on how to compute mutual information and other signal effectiveness measures using matrix methods are summarized in Web Topic 8.8.
Further reading
Quastler, H. 1958. A primer on information theory. In, Symposium on Information Theory in Biology (Yockey, H.P. & R.L. Platzman, eds). New York: Pergamon Press. pp. 3–49.
Rosie, A.M. 1973. Information and Communication Theory. London: Van Nostrand Reinhold Company.
Wilson, E.O. 1975. Sociobiology: The New Synthesis. Cambridge, MA: Belknap/Harvard University Press.
8.10 Signal Detection Theory and Signal Effectiveness
Introduction
The major problem with using receiver responses as an index of signal effectiveness is that responses confound the effects of the amount of information provided by a signal, the receiver’s estimates of prior probabilities, and the relative payoffs of alternative actions. A female faced with choosing between two displaying males may fail to discriminate between their displays a) because the differences are too small for her to detect, or b) it does not pay for her to expend the effort to compare them.
Signal detection theory provides tools for separating the roles of the amount of information in signals from the value of that information. It allows one to compute an index, called receiver sensitivity and denoted by d', that can be used as another measure of the effectiveness of a signal set. The following discussion assumes that the reader is familiar with the general approach of signal detection theory as summarized in Web Topic 8.4.
Logic of method
Consider a hypothetical example in which females seek to identify a healthy male instead of a sick male as a mate. All males sing songs and song speed varies continuously among males. However, the distribution of song speeds for healthy males has a higher mean value than that for sick males. Let male song speed be denoted by w. The task for each female is to define a “red line” at some critical value w_{c} such that any male whose song speed exceeds w_{c} will be considered an acceptable mate and any male with a slower speed will be rejected. The optimal value of w_{c} will depend upon a given female’s estimates of the prior probabilities of sick and healthy males, and the relative payoffs to her of correct versus wrong decisions. Thus different females, or the same female at different times, may set different values of w_{c}.
If the distributions of song speed for sick and healthy males are at all overlapping, a female invoking her particular w_{c} will make some correct choices and some errors. Let P_{hit} denote the fraction of time that a given female correctly selects a healthy male for a mate and P_{false alarm} denote the fraction of the time that she mistakenly selects a sick male for a mate because his song rate is greater than w_{c}.
Now consider three different populations of males that vary in the degree to which the distributions of song rate for sick and healthy males overlap. In each example, a graph on the left will plot song speed (w) on the horizontal axis and the probability that a given type of male (sick or healthy) will sing that song speed on the vertical axis. In all cases, we shall assume the distributions are roughly bell-shaped with the same variances. The mean song speed for each distribution is the one under the peak value of the bell-curve’s vertical axis. Consider first a population (A) in which there is little or no difference in the mean values of sick and healthy male song speeds: the two distributions are completely overlapping. This is shown on the left graph below:
Suppose we select pairs of healthy and sick males at random from this population and record their songs. We then play the two songs back to a test female from that population and see which speaker she approaches. We do this multiple times with different pairs of randomly sampled males to get an estimate of how often she correctly selects the healthy males (P_{hit}) and how often she incorrectly selects sick ones (P_{false alarm}). The values of these two measures will depend on that female’s red line value of w_{c}. We plot the two values on the graph on the right and label it w_{c1}. We then select a second test female who is of a different age, has different nutritional condition, or because of different prior probabilities is likely to have a different red line value of w_{c}, and repeat our experiment. We then plot her values of P_{hit} and P_{false alarm} on the graph and label them. Adding values for more females allows us to see the relationship between P_{hit} and P_{false alarm} as the relative payoffs to females of right versus wrong decisions changes. The graph on the right is called a receiver operating curve or ROC graph.
If the distributions of song speed for sick and healthy males are completely overlapping, it will be impossible for females to make accurate discriminations between them using song speed: there is no correlation between song speed and health, and it should be obvious that attending to song speed provides no information to females. In this case, the ROC graph is a straight line as shown in this example: P_{hit} and P_{false alarm} remain proportional to each other at a fixed rate. Increasing one results in an equivalent increase in the other.
Next, consider a population (B) in which song speed is somewhat correlated with male health. This implies that the two distributions are not entirely overlapping, and there is thus a non-zero difference between their means. Let us call that difference d'.
If we now undertake playbacks of sick and healthy male songs to a series of females, we will get the plot of P_{hit} versus P_{false alarm} shown on the right. Now, the ROC relationship between P_{hit} and P_{false alarm} bends up towards the upper left corner of the graph. This means that for all values of w_{c}, the cost to a female in terms of numbers of false alarms is much lower for every correct choice than was the case in population A. This is because there is now a significant correlation between male song rate and male health, and the information provided by songs reduces errors in female decisions.
In population (C), the correlation between male song speed and male health is even stronger than in population (B). The difference between distribution means, d', is a much larger number and the curvature of the ROC plot towards the upper left corner of the graph is even stronger:
These examples suggest that one should be able to estimate the difference between the distribution means, d', by estimating the degree to which the curvature in the ROC plots deviates from the straight line expected when there is no correlation between signal and condition. And surprisingly, this measure of the amount of information can be obtained using receiver responses. Even more surprising is the observation that if both distributions are bell-shaped and have similar variances, any pair of P_{hit} and P_{false alarm} values will fall on only one possible ROC curve corresponding to only one d' value. This means that we could estimate d' from examining the P_{hit} and P_{false alarm} values of only a single female.
Standard units
The major point of measures of signal effectiveness is to be able to compare one signal set to another, or perhaps obtain an average value for how effective most threat signals or most alarm signals are. Clearly, one cannot compare d' values if the units for one signal set are in songs/second and another is in brightness of red plumage coloration. As long as the relevant distributions are bell-shaped (Gaussian) or can be made so with appropriate transformations, one can convert the w values in any distribution plot into z scores. This is a scaling widely used in statistics and computed as follows. If the mean of a normal distribution is μ, and its standard deviation is σ , (where σ = √ variance ), then the z score for w is
$$z\left(w\right)=\frac{w-\mu}{\sigma}$$We can thus replot any original probability distribution of w values as a probability distribution of z(w) values. This distribution will have its maximum when z(w) = 0 (e.g. when w = μ), and all z(w) values to the left of this peak will be negative (e.g. w < μ), and all z(w) values to the right of the peak will be positive (w > μ ). The difference between the means of two z-scaled distributions, d', will then be given as a multiple of their common standard deviation (if it is the same for both), or as a multiple of their average standard deviation (if they are different). Because d' is measured in standard deviation units, decreasing the average standard deviation of the distributions is equivalent to increasing the distances between their means: either reduces overlap between the distributions, and thus reduces errors.
Let the means for the two probability distributions be μ_{1} for healthy males and μ_{2} for sick males. We wish to convert the w axis for each distribution into z(w) values. For the first distribution,
$${z}_{1}\left(w\right)=\frac{w-{\mu}_{1}}{\sigma}$$and for the second distribution and the same w,
$${z}_{2}\left(w\right)=\frac{w-{\mu}_{2}}{\sigma}$$We note that
$${z}_{2}\left(w\right)-{z}_{1}\left(w\right)=\frac{{\mu}_{1}-{\mu}_{1}}{\sigma}=\mathrm{d\text{'}}$$which is the measure we seek. We can thus estimate d' if we can estimate z_{1}(w) and z_{2}(w) from observations of a female’s decisions.
Applying the method
Suppose we perform our playback experiments on a female using songs of sick and healthy males from the same population. We now have values for P_{hit} and P_{false alarm} for that female. Most statistics texts have tables in the back listing the area below a normal probability curve to the right or the left of some cutoff value of a z score. Usually the reader has a z value and wants to know the corresponding probability. In our case, we know the probability, but would like to know the corresponding z score. We thus locate the measured probabilities P_{hit} and P_{false alarm} in this table, and then find the corresponding values z_{hit} and z_{false alarm} respectively. Since these z scores are based upon the same w, in this case that female’s wc, we can use their difference to compute d' = z_{hit} – z _{false alarm}. To provide a feeling for the scale of this measure, a receiver which correctly identifies both sick and healthy males 50% of the time (e.g., chance) has a d' = 0, that which is accurate 70% of the time has a d' = 1.04, that accurate 90% of the time has a d' = 2.56, and that accurate 99% of the time will have a d' = 4.65.
Additional measures from signal detection theory
An additional parameter of signal detection theory that can be extracted from P_{hit} and P_{false alarm }data is bias: this is the degree to which a female is conservative about accepting males, and thus avoids false alarm errors at the expense of having more miss errors. It thus reflects the value of information independently of the amount of information. The simplest measure of bias is the criterion index c: it can be computed as c = – 0.5 (z_{hit} + z _{false alarm}). A female that has no bias accepts equal numbers of false alarms and miss errors (e.g. P_{false alarm} = 1 – P_{hit }), and their bias c = 0. When females avoid false alarms, c > 0, and when they avoid misses, c < 0. For any observed combination of P_{hit} and P_{false alarm}, c depends upon the distance between that point and the diagonal running from top left to lower right corner of the ROC plot.
It is also possible to estimate the likelihood ratio parameter β, which is equal to the ratio of the likelihoods that a male is healthy to the likelihood that he is sick (see Web Topic 8.4 for derivation). It can be computed using ln (β) = c d' if the female is making optimal decisions. Using hit rates and false alarm rates, we can rewrite this as ln(β) = – 0.5 [z_{hit} – z_{false alarm}]^{2}.
Non-normal distributions or unequal variances
If we know that the distributions of w for healthy and sick males are normally distributed with equal variances, we saw that we do not have to compute an entire ROC curve to obtain estimates of d', c, and β: instead, one pair of hit and false alarm rates will do. However, distributions may not be normal or have equal variances. The only way to detect this is to plot the ROC curve by obtaining data from multiple females or by manipulating one female's prior probabilities or payoff values. We can still compute a single d', c, and β from such a situation; however the analysis is more complicated than that given here. See MacMillan and Creelman (1991) for details.
Further reading
Macmillan, N. A. and C.D. Creelman. 2004. Detection Theory: A User’s Guide. 2^{nd} Edition. New York: Cambridge University Press.
Wiley, R.H. 1994. Errors, exaggeration, and deception in animal communication. In Behavioral Mechanisms in Evolutionary Biology, (L.A. Real, ed.). Chicago: Chicago University Press. pp. 157–189.