I wish to thank the letter-writers for effectively articulating concerns with the paper.1Cattano essentially poses a question and a concern. He asks for an explanation of the progressive bias that occurs when faculty members evaluate CA1 to CA3 residents. I offer no certain mechanism for this finding. It was shown in the paper, in figure 7, that more senior residents garner higher degrees of faculty confidence in their skills as they progress in residency. Perhaps this causes a halo effect in normative rating even though the cohort is constant. One mechanism that can largely be ruled out is attrition. At most, we lose one resident per year because of attrition. Cattano raises the concern that faculty members will positively bias their evaluations of residents with the hope that residents will reciprocate and evaluate their teaching skills in a similarly positive manner. The concern that faculty members will assign positively biased clinical performance scores in an attempt to win higher teaching scores from the residents is a significant concern. We attempt to minimize this occurrence by distributing only anonymous evaluations to residents, and thus residents are unaware of which faculty members evaluated them. Similarly, faculty members are provided with anonymous resident evaluations of their teaching, and faculty members are unaware of which residents evaluated them. As mentioned in the paper, we no longer provide residents with the actual scores of their performance. Thus, residents don't know the evaluative numbers assigned by faculty members. Lastly, when individual faculty member teaching scores are regressed against the clinical performance scores, no important relationship is discernible between those same individuals and residents (data not shown).

Watson correctly identifies the statistical slight of hand whereby ordinal or categorical Likert scores are mapped onto a linear numerical scale. Watson also correctly points out that Likert score intervals are not necessarily equal or even certain. Fortunately, the Z-score system is based on relative performance and does not use absolute numerical cutoff scores. Although Z-scores are not diagnostic of any particular absolute level of performance, they do provide an excellent method for differentiating relative resident performance among a cohort of residents. This allows us to identify relatively poorer performers, examine the faculty-provided comments for clues as to why the scores were low and, in turn, create performance improvement strategies that often result in performance improvement, as shown in the paper in figure 9.

Van Schalkwyk, Campbell, and Short are primarily concerned with misclassification of a large fraction of residents based on the statistical methodology found in the paper. In particular, they note that as the number of evaluations gets very large, precisely one-half of all residents will be confidently labeled as below average according to the approach used in the paper. They appear to express concern that more than 50% of residents could potentially be inappropriately labeled as problematic or poorly performing, which could have implications for their management and even future careers. Van Schalkwyk, Campbell, and Short appear to have interpreted “below average” as incompetent or problematic. Nowhere in the paper is this inference made. The Z-score system allows relative ordering of residents within a cohort. As correctly pointed out by these authors, the lowest-performing resident in the group may be perfectly competent. The Z-score system relates not to competence but to performance, as stated in the paper. The system correctly identifies relative performance of one resident compared with another and does so with a degree of statistical significance allowing differentiation of levels of performance. The system does not claim that the lowest-performing resident is incompetent. In fact, individual scores are meaningless per se , and it is only in the context of the comments associated with Z-scores that concerning performance attributes are identified and intervened upon. Thus there is no threshold Z-score that identifies a competent versus  an incompetent resident. However, as one moves further and further below average, the likelihood of finding concerning or problematic performance gets higher and higher. This is precisely what was found in this study. For example, as mean Z-scores fall further and further below zero, the faculty increasingly checked off clinical-competency boxes relating to significant concerns with performance. This is expected if Z-scores relate to performance. As mentioned in the paper, we tended to find actionable performance concerns associated with Z-scores of about −0.5 and below. When such scores are noted, the comments associated with these scores are examined for diagnostic clues that can explain the lower-than-average performance scores. If we find actionable concerns in the comments, we create interventions, and the resident is tasked with performance improvement in the identified areas. Examples of success using this method were presented in the paper. At times, we also find residents whose average Z-scores are near −0.5 and who do not have any concerning comments related to their below-average Z-scores. Such residents are simply exhibiting performance that is below average for the cohort but is not concerning in terms of competence. Ideally, every resident would have a performance-improvement program, including those with above-average Z-scores. However, given limited resources, we focus on residents having below-average Z-scores with the intent of improving their performance.

1.
Baker K: Determining resident clinical performance: Getting beyond the noise. ANESTHESIOLOGY 2011; 115:862–78