Assessment of clinical competence is essential for residency programs and should be guided by valid, reliable measurements. We implemented Baker’s Z-score system, which produces measures of traditional core competency assessments and clinical performance summative scores. Our goal was to validate use of summative scores and estimate the number of evaluations needed for reliable measures.
We performed generalizability studies to estimate the variance components of raw and Z-transformed absolute and peer-relative scores and decision studies to estimate the evaluations needed to produce at least 90% reliable measures for classification and for high-stakes decisions. A subset of evaluations was selected representing residents who were evaluated frequently by faculty who provided the majority of evaluations. Variance components were estimated using ANOVA.
Principal component extraction from 8,754 complete evaluations demonstrated that a single factor explained 91 and 85% of variance for absolute and peer-relative scores, respectively. In total, 1,200 evaluations were selected for generalizability and decision studies. The major variance component for all scores was resident interaction with measurement occasions. Variance due to the resident component was strongest with raw scores, where 30 evaluation occasions produced 90% reliable measurements with absolute scores and 58 for peer-relative scores. For Z-transformed scores, 57 evaluation occasions produced 90% reliable measurements with absolute scores and 55 for peer-relative scores. The results were similar for high-stakes decisions.
The Baker system produced moderately reliable measures at our institution, suggesting that it may be generalizable to other training programs. Raw absolute scores required few assessment occasions to achieve 90% reliable measurements.
Resident evaluations are often idiosyncratic, making it difficult to fairly evaluate both absolute and relative performance
A previously published system overcomes some of these limitations by converting evaluation metrics into Z scores (deviation from average in SD units), adjusted for faculty grade range use, grade inflation, and resident training level
The investigators evaluated the system in their residency
The system was moderately reliable, requiring between 30 and 58 assessments for accuracy
Fewer assessments were needed with absolute scoring than with peer-relative scoring
THE Accreditation Council for Graduate Medical Education (ACGME) has provided a framework of six core competencies for evaluating residents: medical knowledge, patient care, practice-based learning and improvement, professionalism, interpersonal and communication skills, and systems-based practice.1 These competencies are intended to constitute a system for evaluating residents based on outcomes and performance, but there is no defined evaluation methodology for accurately and reliably assessing these core skills after each rotation. In anesthesiology training, faculty anesthesiologists evaluate resident performance using clinical and professional observations from the immediate perioperative period and other care settings, such as the intensive care unit, preoperative clinic, and pain clinic. However, faculty may have different opinions about acceptable performance,2 and trainee performance in one situation may not generalize to another.3 Accordingly, making reliable assessments of resident performance is a challenge and requires multiple points of assessment.
Accordingly, residency programs must provide periodic formative, as well as summative, evaluation on all six ACGME core competencies.1 Developing and interpreting these formative evaluations can be challenging, owing to evaluation biases that include grade inflation and idiosyncratic grade-range usage. Baker4 described a system that normalized resident evaluations to adjust for variations in individual faculty anesthesiologist assessments, idiosyncratic grade-range usage, and resident level of training. This system consists of an instrument to measure peer-relative (relative-to-peer) and absolute (anchored) performance in all six core competencies, as well as Z-score transformations that convert the raw measurement data from the assessment tool into normalized Z scores. This system is intended to be used in transforming “noisy” evaluation data into valid and reliable signals that can be utilized to rank-order residents by performance and identify residents that need an intervention to address performance difficulties in the core competencies.
The Z-transformed scores combine and normalize the instrument’s Likert items into standard scores. These standard scores set the mean score to 0 and represent all scores in terms of SDs above or below the mean score. Thus, a Z-transformed score of −0.5 would represent a score that is half a SD below the mean. Baker grouped the peer-relative scores together into one combined measurement (Zrel) and the absolute scores together into another combined measurement (Zabs). Thus, the system relies primarily on two scores that are cumulative in nature rather than individual instrument items, with these scores being able to be used in formative or summative assessments and feedback. Baker proposed thresholds for residents in need of intervention, those experiencing a challenge, and those facing serious performance issues.5
In July 2012, our institution implemented an evaluation system identical to Baker’s, with absolute and peer-relative measurements that represent the six core competencies, as well as flags for concerns about essential competency attributes, faculty confidence assessment, and free text comments. Absolute measurements are comparisons to fixed competency standards (i.e., criterion-based), whereas peer-relative measurements are comparisons to peers (i.e., norm-referenced), as described further below. Although the system described by Baker was based on large sample size of 14,469 evaluations, it was also performed at a single center, and it is unclear whether the findings and methodology would be applicable at another institution or whether the utilization of summative scores is justified. Additionally, the number of measurement occasions (clinical encounters with subsequent evaluation) necessary for dependable measurements was not defined. Thus, we undertook an implementation, validation, and analysis of the Baker Z-score system at our institution. Our study was conceived as a planned attempt at reproduction of a notable finding in a domain of educational research for valid and reliable trainee assessment.
Materials and Methods
This study was deemed exempt by the Human Research Protection Program/Institutional Review Board of Vanderbilt University (protocol 130507), as it was research conducted in an existing educational setting. At our anesthesiology residency training program, faculty anesthesiologists are assigned one evaluation per week for each resident they supervise. Our institution uses a web-based platform provided by New Innovations (New Innovations, Inc., USA) to solicit, store, and aggregate residency evaluations. An evaluation instrument containing absolute and peer-relative measurements identical to that described by Baker was used during the time period evaluated ( appendix 1). Using this instrument through the New Innovations platform, faculty anesthesiologists may create and enter resident evaluations at any time with a minimum of one evaluation per resident per week requested from each faculty member with whom the resident worked in a given week (Monday through Sunday). Thus, residents receive more evaluations when working with several faculty members in a week than when working with the same faculty member for a week. As a result, residents are evaluated more frequently while on operating room-based rotations compared to non–operating room–based rotations, such as pain medicine clinic or critical care medicine rotations. At the end of each week after the evaluations were assigned, faculty members received an email reminder if they had not completed all assigned resident evaluations.
These evaluations were transmitted in an aggregated file by secure file transport protocol on a monthly basis. Once an evaluation file was received, it was processed by an automated system task that updated a local SQL Server (Microsoft, USA) database. The resulting Z-scores and deidentified raw evaluation data were provided to one of the authors (J.P.W.), who performed manual validation. Comparison of the results from the initial Z-score SQL Server implementation and manual review revealed two significant interpretation discrepancies. Specifically, the original description of the methodology was ambiguous as to how scores should be aggregated for both the relative and absolute measurements and to which set of data means should be applied.5 The technical implementation was applied as a mean of the competencies aggregated by means, i.e., a mean of means, whereas manual validation was performed by taking a mean of all of the individual competency scores. The differing interpretations of the original methodology were resolved after discussion with Dr. Keith H. Baker, M.D., Ph.D., Department of Anesthesia, Critical Care and Pain Medicine, Massachusetts General Hospital, Boston, Massachusetts (personal communication, November 2012), resulting in our final SQL Server implementation (appendix 2, Supplemental Digital Content, https://links.lww.com/ALN/B546). For analysis, we extracted raw scores from this system between July 2012 and June 2015.
Generalizability and Decision Studies
Generalizability studies examine the dependability of behavioral measurements, taking into consideration the magnitude of the multiple sources of measurement error imposed by the situations under which measures were obtained—the universe of generalization. When assessing learner achievement, such situations—facets—usually are represented by item characteristics, rater biases, measurement occasions, and evaluation designs. The error variance of each of these components affects the resulting measure of persons’ behavior that is the object of measurement. By identifying the sources of error and their respective size—the percentage of total error variance—decision makers may identify the factors that contribute to the dependability of performance measures. Decision studies use the results of generalizability studies to inform decisions about changes in the universe of observation.
Four generalizability studies were conducted to assess the reliability of resident measures produced by raw absolute, raw peer-relative, Z-transformed absolute, and Z-transformed peer-relative scores. For these studies, we created a balanced sample from the original data set, using the criterion of 15 evaluations per resident provided by different faculty. This best represented the most frequent situation in our actual environment: one random faculty providing an evaluation of one resident only once, representing 49% of faculty/resident interactions. We chose 15 evaluations per resident because that was the harmonic mean of the number of evaluations per resident in the original data set.5 Our balanced sampling strategy produced data from 80 resident (person facet) evaluations (78% of the number of residents in the original data set) by 78 faculty members (rater facet) (55% of the faculty evaluators in the original data set) on 15 unique weekly evaluations (occasion facet), resulting in 1,200 unique resident evaluations available for analyses. This sample size exceeds the 50 to 500-observation samples adequate to produce robust generalizability studies.6,7
To assess potential raters × item interactions, two generalizability studies were performed on the scores of the items of the evaluation instrument (item). These studies included items, raters, and occasions as potential sources of error variance on persons’ scores. The P × R:O × I design (persons crossed with raters within occasions crossed with items) was chosen based on the assumptions that all persons had been rated the same number of times (15 occasions) by one different rater on each occasion on the same seven items that comprised each scale. Items were considered fixed, and all other facets were random. Individual item reliability estimators were extracted from these generalizability studies.
Once assured the absence of rater × item interaction (variance component = 0), four generalizability studies were conducted to investigate the reliability of person measures obtained with raw relative and absolute scores and their counterpart Z-transformed scores. For these studies, we chose a partially nested design P × R:O (persons crossed with raters within occasions), because all persons were evaluated the same number of times by one different rater on each measurement occasion. From these studies, decision studies were performed to estimate the number of occasions one resident should have to be evaluated to produce 90% reliable performance measures. Using data from generalizability studies on Z-scores, the dependability of the thresholds proposed by Baker for suboptimal performance (−0.5, −0.6, and −0.8 SD) was investigated.
To assess the variance components of the rater × occasion interaction and their impact on the reliability coefficients, further generalizability studies were conducted using a P:R × O (persons-nested-within-raters-crossed with occasions) design. For these studies, a random balanced sample containing 1,040 data units was drawn from the original data set. The sample comprised evaluations performed by 52 faculty raters of 10 different residents each, on two occasions for each resident (52 × 10 × 2). Generalizability studies were performed on raw and Z-transformed relative and absolute scores.
Scores in the samples used for generalizability studies were compared with scores of data units not used for the generalizability studies by Student’s t tests for independent samples to assure that the samples used for generalizability studies were representative of the data set regarding the summative scores. A two-sided P value of 0.05 represented statistical significance. EduG software (Swiss Society for Research in Education Working Group, Switzerland) was used to perform the analysis.
From July 2012 to June 2015, 10,525 evaluations were identified for analysis. After discarding incomplete evaluations, 8,754 evaluations remained. These evaluations were entered by 141 faculty members for 102 residents (CA-1 = 250; CA-2 = 3,456; CA-3 = 2,983; and CA-4 = 2,065). The number of evaluations per faculty member ranged from 1 to 349, with a median of 42 evaluations. The number of evaluations received by residents ranged from 1 to 203, with a median of 83 evaluations.
Factor analysis with principal component extraction and orthogonal rotation identified a single factor in each scale. For the absolute scale, the Eigenvalue was 6.33 with 91% explained variance. For the peer-relative scale, the Eigenvalue was 5.95 with 85% explained variance. Given the presence of a single dominant factor, scores for generalizability studies were estimated as the average of the item scores.
Generalizability and Dependability
As described above, 1,200 evaluations were selected for generalizability and decision studies (fig. 1). Out of the four types of scores analyzed, raw absolute scores had the highest degree of variance due to differences between residents (23.2%), followed by raw peer-relative scores (15.9%), Z-transformed peer-relative scores (14.5%), and finally Z-transformed absolute scores (13.8%), as described in table 1. Z-transformed scores had higher standard errors compared to raw scores, relative to score means (table 2). Variance due exclusively to differences between residents was more strongly captured by raw scores compared to the Z-transformed scores. We noted that the greater variance in the persons facet, the greater the reliability of the measures, as can be observed in the respective generalizability coefficients. Measurement occasions per se did not contribute substantively to the error variance. The major component of score variance was the interaction between persons and measurement occasions.
Based on the decision studies, the estimated number of evaluations needed to produce 90% reliable measures for classification purposes were estimated as 30 for raw absolute scores, 47 for raw peer-relative, 57 for Z-transformed absolute, and 55 for Z-transformed relative scores. Similar figures were estimated for 90% reliable absolute (summative) decisions: 30 for raw absolute scores, 48 for raw peer-relative, 57 for Z-transformed absolute, and 54 for Z-transformed relative scores (figs. 2 and 3). Phi coefficients (dependability indexes) were estimated for high-stakes decisions based on the thresholds Z-scores defined in Baker’s studies at −0.8, −0.6, and −0.5 SD. High dependability (reliability) of decisions based on these thresholds was predicted (table 3).
Table 4 shows the results of the generalizability studies designed to disclose the amount of score variance due to rater x occasion interactions. The values varied from 0.2 through 0.6% of total variance, suggesting that faculty were consistent in their ratings across measurement occasions. The variance attributable to differences among raters was apparent only for raw scores. As expected, Z scores were not affected by differences among raters’ rating styles or preferences. However, greater residual error variance was found in Z scores, resulting in lower reliability, as indicated by their respective generalizability coefficients and standard errors.
Raw absolute scores were significantly higher in the sample used for the P × R:O × I and P × R:O studies as compared to the remaining data set. In the sample used for the P:R × O generalizability study, the raw relative scores were significantly higher than those of the remaining data set. No other differences were observed between study samples and the remaining data set (table 5).
Having a valid, reliable, quantitative, and stable measurement of resident training performance is crucial for informing decisions regarding professional development, promotion, and, when needed, remediation.8 Along with robust conventional mechanisms, including informal feedback by faculty and free text comments, these measurements could potentially identify those in need of remediation and could serve as an early warning system for others.5 Accordingly, we have described the technical aspects of a real-world implementation of the Baker Z-score system in a large residency program (18 residents/yr). Our study has four important findings that add to the literature on resident evaluation. First, we performed a psychometric analysis of the assessment instruments in the Baker evaluation system and demonstrated that the raw scores account for the variance between persons being rated (residents) better than the Z-transformed scores, which was unexpected. The variance attributed to differences among residents was smaller than that associated with the interaction between residents and measurement occasions, indicating that the scores attributed to the residents were homogenous but varied across measurement occasions. Second, our work demonstrates that the Baker evaluation system using raw or Z-transformed scores produces reliable scores and confirms that it is appropriate to use in formative and summative assessments for measures of resident performance. Third, we demonstrated that fewer rating occasions are needed to reach 90% reliability of the scores produced when using raw absolute scores as compared to Z scores. Finally, we demonstrated high dependability of the Z-transformed score thresholds identified by Baker, which could be readily operationalized by a clinical competency committee using these data as part of a structured assessment process.
A key finding of Baker’s study was “the low correlation between first and second Zrel scores when a faculty member evaluated the same resident on two occasions.”5 The author concluded that “a single Zrel score has only a small amount of clinical performance ‘truth’ associated with it.” This finding matches ours, justifies the approach used in generalizability analysis of choosing a single rater for each measurement occasion, and is consistent with the large amount of variance associated with the interaction of persons and measurement occasions. Baker justified his finding by invoking the context sensitivity theory. Our findings, achieved with different raters in each measurement occasion, also could be explained by context sensitivity. Both studies agree that reliability is dependent on multiple measurement occasions. Our study went further in estimating how many occasions and the amount of consistency of measures depending on the number of measurement occasions. We cannot compare our results regarding raw scores, because they were not analyzed in Baker’s original study. However, greater reliability was found for raw scores compared to Z-transformed scores. This was caused by the greater measurement error associated with Z-transformed scores.
Both generalizability studies presented in this manuscript show that the interactions between residents and measurement occasions contribute with substantive amounts for the error variance of scores produced by Baker’s evaluation system. We have also shown that the rater per measurement occasions of the same resident is highly consistent, as the negligible amounts of variance associated with such interaction suggest. This occurs in the presence of high heterogeneity among raters in attributing raw scores, as the substantive amount of error variance associated with the rater facet suggests. Put together, our results are consistent with the conclusion that Baker’s system captures differences in situation-specific resident performance. This is highly desirable, because resident performance is expected to remain unstable—to fluctuate—during the learning curves of complex anesthetic procedures.9
A final point of clarification for interpreting the results of this study is that by nesting only one rater within each measurement occasion, we were unable to estimate variance for the rater within the occasion facet. This was necessary to explore the effect of measurement occasions. For this reason, we conducted another set of generalizability studies designed to explore the consistency of raters’ scoring across repeated occasions. The negligible amount of variance associated with rater × occasion interactions suggests that raters are consistent in their raters of each resident in at least two consecutive measurement occasions. Such behavior applies to raw and Z scores.
To put our results into practical context, the level of reliability desired for making formative or summative (high-stakes) decisions should be understood. High-stakes exams (e.g., licensing board exams) have a goal of at least 90% reliability, whereas formative assessments accept anything more than 70% as being sufficient.10,11 The decision studies in our analysis demonstrated that when using raw absolute scores, having 30 evaluations per resident would produce 90% reliability, whereas 15 evaluations would produce a reliability of 82% (fig. 3). As a practical example, if an anesthesiology resident received two faculty evaluations per week of work in the operating room, then the compilation of evaluation scores for consideration by the clinical competency committee and for use by the program director in their quarterly review (formative assessment) would include 24 evaluations and have greater than 80% reliability concerning statements made about their performance if the raw scores are used. If this is extended to the 6-month evaluation period with input required for high-stakes reporting to the ACGME and American Board of Anesthesiology, a resident would have 48 evaluations, and both raw absolute or raw peer-relative scores would produce greater than 90% reliability (fig. 3). Of note, raw and Z-transformed absolute and peer-relative scores produce adequate reliability to be used for formative assessment if more than 15 evaluations are completed on the trainee (fig. 2). This level of reliability in formative and summative assessments could be of great assistance to educators in making decisions about resident progression or remediation throughout training.
Of note, this current project was implemented before the start of the ACGME Milestone era for anesthesiology. In the Milestone system, which now delineates 25 subcompetencies spread among the 6 core competencies, individual absolute rankings are more desired than peer-relative or training year-relative rankings, because the goal is individual progression toward unsupervised practice with recognition that trainees may progress on different learning trajectories.1 This finding may be of particular importance to training programs across the country, because program directors have been given no specific direction on how to implement evaluation schemas in the Milestone era, and numerous questions remain and are debated. For instance, should the subcompetencies be used verbatim as an assessment tool? Or should an alternative evaluation system be used that the clinical competency committees and program directors use to map to the Milestones system for reporting? Based upon our results, use of absolute scoring within the five levels of the Milestones rubric is still needed, but peer-relative assessments could be abandoned in favor of absolute ranking scales focused on the individual learner. Although these recommendations can be made based upon our results, further psychometric evaluation should be undertaken to ensure that reliability of the scores does maintain when using Milestone rankings.
Finally, external validation of competency assessment tools, such as this study, are important in testing whether original research findings are robust to generalization to other settings. As described above under Materials and Methods, unintentional ambiguity in initial descriptions of systems can lead to erroneous implementation if not carefully checked during implementation. Providing reproducible work in the form of shared code, as we have done (appendix 2, Supplemental Digital Content, https://links.lww.com/ALN/B546), can reduce these risks.
The present study does have several limitations. First, because the evaluation system utilized before 2012 by our residency program had a different set of questions, we were unable to evaluate the Z-score system on our historical evaluation data collected before that time to account for historical trends. Second, we did not analyze all aspects of the system that Baker described, omitting analysis of the case confidence scores, essential competency attributes, and qualitative assessment of free text comments. This approach was in large part tactical, because we were attempting to determine the feasibility and appropriateness of incorporating a summary metric into our clinical competency committee process and have modified our case confidence scores from Baker’s description to fit our rotations in a more specific manner. Third, generalizability theory deals basically with random effects. For our generalizability studies, random facets were created by randomly sampling from the original database levels of each facet included in the study according to the intended study design. Therefore, the random nature of our balanced samples does not imply that data were collected from faculty following any random assignment scheme. Finally, our results were from a single institution, which may impact their generalizability. Our approach to discussing education and evaluating residents likely has important differences compared to other institutions, which may have influenced our results.
The future of education research should include studies that determine the predictive value of assessment scores in identifying residents who will have difficulty during residency training. One additional study, for instance, would be investigating the relationship between our evaluation data and educational outcomes, such as clinical competency committee referrals and board examination results, which would complement the analysis described above. These outcomes have been positively linked elsewhere.12 Additionally, future studies need to determine whether assessment data can be leveraged to facilitate clinical education within an anesthesiology residency program. For instance, we have described a decision support system for resident operating room assignments, which provided summaries of resident case experience to assist with creating appropriate clinical assignments.13 Highly reliable assessment scores generated by a valid scoring system could be incorporated into such a decision support system, providing information to faculty about ongoing assessments of trainee competence, in addition to simply case numbers performed. However, because absolute scores had higher reliability, the next step would be to perform a standard setting procedure (e.g. Angoff) to create criteria for passing and failing.
In conclusion, we report on the implementation and external validation of a resident assessment tool. We determined that the Baker assessment system produces moderately reliable measures from a reasonable number of measurement occasions such that formative and summative assessment decisions can be made, with raw absolute scores requiring the fewest measurement occasions for comparable reliability.
The authors thank Nimesh Patel for his efforts in developing our SQL implementation of the Z-score system.
This work was supported by the Department of Anesthesiology, Vanderbilt University Medical Center, Nashville, Tennessee. Dr. Wanderer was funded by the Foundation for Anesthesia Education and Research and the Anesthesia Quality Institute’s Mentored Research Training Grant-Health Services Research.
The authors declare no competing interests. Dr. McEvoy received funding (not related to this article) from the GE Foundation for educational research work in Kenya, from Edwards Lifesciences for research in goal-directed fluid therapy, and from Cheetah Medical for research in goal-directed fluid therapy.