*To the Editor:*

In a recent innovative study, Baker used relative Z scores (Z^{rel}) to correct for observer bias in the assessment of 108 anesthesiology residents.1We have concerns about the statistical methodology used in this study and believe there is a need for caution before his approach is widely adopted.

Baker distinguishes three groups: those “reliably above average,”“reliably below average,” and “not reliably different from average.” His criterion for identifying a resident who is above average is that 1.96 times the SEM for that individual's Z score (a 95% CI for the SEM) does not overlap with zero. A similar criterion is used to identify “below average” residents. This approach is problematic.

Although Baker identifies 30% of residents as “reliably below average,” with sufficient assessments, 50% would be “reliably below average” because the width of the CIs would decrease. It is trivially true that, as long as the distribution is symmetric, 50% of people are “below average,” but this does not imply that all “below average” residents require what Baker terms “performance interventions.” Baker's Z scores could be applied to any group of residents, even a sample of entirely competent anesthesiologists, and would still identify a proportion as “below average.” Without a clinically relevant benchmark, Baker's approach cannot be used to identify anesthetic competence.

In translating an overall assessment of ‘anesthetic competence’ into a Z score, Baker makes certain assumptions. One of these is that the competence of anesthesiologists is an underlying, continuous variable that can be normalized. Although this assumption cannot be validated, it can be simulated using a Monte-Carlo approach. Figure 1shows the results of a single run of such a simulation. The assumptions are: that each of 100 individuals has intraindividual variation in Z^{rel}scores that is normally distributed, and that the mean score for each individual is offset by a value that is similarly, randomly sampled from a normal distribution (“interindividual variation”), with a known SD (SD^{adj}). As both the generated SD for each individual's variation in competence, and the interindividual offset of his/her mean are known, Baker's approach can be tested against this standard. This simulation produces results that are similar to Baker's figure 5.

With this simulation, a “competence threshold” can be set, beyond which individuals are known to be outliers. Although it would be more usual to set the threshold at ±1.96 SD^{adj}(*i.e.* , to assume that just 2.5% of individuals are performing “over” or “under”), the number of individuals Baker categorizes as above or below average (27% and 30%, respectively) suggests a threshold of about ±1 SD^{adj}, which would on average identify 33.6% of residents as either above or below a threshold.

Figure 1shows that with a competence threshold of just 1 SD^{adj}, Baker's approach misclassifies 33 of the 100 individuals in this run as “reliably” above or below average, despite their underlying competence being within 1 SD of the mean competence. Use of a higher threshold would result in even more misclassification. The paradox of Baker's approach is that the greater the number of evaluations of each individual, the more likely he/she is to be misclassified. Running the current simulation 10,000 times shows that, on average, almost 36% of residents would be misclassified at a competence threshold of 1 SD^{adj}.

The annotated source code of our model (written in R, version 2.10.1; R Foundation for Statistical Computing, Vienna, Austria) is available in Supplemental Digital Content 1, http://links.lww.com/ALN/A838. The model is robust with repeated testing, and with moderate alterations to the SDs used; more extreme changes produce plots that are incompatible with Baker's results. We find it difficult to retain Baker's Z-score normalization without the conclusion that his subsequent interpretation is flawed. Misclassification of residents based on Baker's approach clearly has implications for their management and even their future careers.

A more robust analytic approach would be to use analysis of means,2provided the assumption of an underlying, continuous, and normalizable “competence” can be justified, and clinical benchmarking can be established.

The authors thank Associate Professor Ross Ihaka, Ph.D., Department of Statistics, University of Auckland, Auckland, New Zealand, for checking the R code used, and Martin J. Turner, Ph.D., University of Sydney, Sydney, Australia, for comments on the manuscript.