In the book we discussed how you might want to fine tune a diagnostic test to minimize the total number of false reports (both false positives and false negatives), here we introduce a graphical aid to this - the receiver operating characteristic curve. Imagine people presenting to their general practitioner concerned that they have an eating disorder. Such disorders require a specialist to correctly diagnose them and identify proper treatment. Such specialists are in high demand so average waiting times to see specialists can be long, but such conditions are easier to treat the earlier the intervention. We plan to introduce a questionnaire-based test that non-expert general practitioners can administer. Patients who score above a certain threshold on the test are deemed as having a high probability of having an eating disorder and are fast-tracked to seeing a specialist; those scoring below the threshold are placed in a second slower-moving queue. The two types of error possible are placing people subsequently determined not to have a disorder by the specialist in the fast track (false positives) and placing people subsequently diagnosed with a disorder in the slow lane (false negatives). We want to find the threshold that minimizes the total number of errors, and we can do that with an experiment.
We take say 250 people presenting with concerns that they might have the eating disorder and carry out both the questionnaire test and diagnosis by an expert on each. We design the experiment so there are no confounding factors, such that an individual’s test score does not correlate with how long they wait to see a specialist and which specialist they see. The specialists and the individuals concerned will also be blind to test scores until after specialist evaluation has occurred. We would first carry out statistical tests to see whether the questionnaire is any good at all in identifying those with genuine disorders; that is we would test to see if test scores for those diagnosed with a disorder by the specialist are higher than test scores of those identified as not having a disorder in need of treatment. If this is true then we can move to identifying what a suitable threshold score would be.
Imagine the test score is out of 100. For every possible threshold (0 to 100) we can calculate the sensitivity and specificity of the test with that particular threshold. The receiver operating characteristic curve is generated by plotting a point for each possible threshold with the sensitivity on the y-axis and 1-specificity on the x-axis (see Figure S11.2). The ideal test would have a sensitivity of 1 and the specificity of 1, corresponding to the top left hand corner of the graph. Thus the threshold that minimizes the total number of errors is that corresponding to the point on the curve that lies nearest to that corner.
Figure S11.2 Receiver operating curves for two tests with different sensitivity/specificity relationships.
But remember, this would only be a sensible way to pick the threshold if we felt that the costs of false negatives and false positives were similar. Consider how valid that assumption is in our case.
Answer: This is quite a complex business. There are two costs to false positives. There is a cost to the person given a false positive score in terms of needlessly increasing their anxiety until they get to see a specialist. However, this is a cost that can be managed by sensitive communication by the general practitioner. There is a cost to all those correctly diagnosed as positive of false positives, since (unless the availability of specialists is increased to meet demand) the more false positives we have the longer everyone has to wait to see a specialist. In contrast, the costs of false negatives fall more firmly on the person mistakenly allocated to the slow-track for assessment by a specialist?they will have to wait longer to be correctly diagnosed and treated, this will have costs to their mental and physical wellbeing and costs to those close to them. It is a difficult judgment call, but we think it is at least worth considering in this case whether we might reduce the threshold score a little and accept a slightly higher rate of false positives in order to reduce the rates of false negatives a little. But a complicating factor here is what fraction of those who present with concerns at their general practitioner are in fact (in the judgment of a specialist) actually suffering from a condition requiring treatment. We can estimate prevalence from our experiment above and use that to calculate numbers rather than rates of false negatives and positives, and that might influence how we set the threshold.
The receiver operating curve can also be used as a means of comparing the general performance of different tests. The area under the curve (sometimes called the c index) is often taken as a measure of the general effectiveness of a test. So test A in Figure S11.2 seems superior to test B.
Screening large numbers of people for a medical condition needs to be done very sensitively in the face of sensitivity and specificity. Imagine that a questionnaire was developed that general practitioners could administer with a patient in five minutes and had 95% sensitivity and 92% specificity for identifying if the individual is suffering from dementia. Imagine that health policymakers were very impressed with this as a screening tool, and instructed GPs to carry this test out on all patients on their first visit after their 65th birthday. Our point is that if this policy were introduced then it is essential that GPs are trained to give patients good advice on what testing positive actually means in terms of their probability of having dementia.
Let us imagine that the prevalence of dementia in 65 year olds is 4%. So if 1,000 people are tested, on average 40 of them will have dementia and 960 will not. The test is 95% sensitive so on average there will be 38 true positives out of the 1,000 people. The test has a 92% specificity so on 8% of occasions it will flag a person without dementia as having it (a false positive). In our group of 1,000, this will lead to 76.8 false positives. Thus, when someone is flagged by the test as having dementia it is essential that it is emphasized to them that the test is not definitive, that they should now be subject to more detailed testing that will give a definitive answer: but of those flagged as having dementia in this initial screening test only a third of them will actually have dementia.