Background

Grade inflation is pervasive in educational settings in the United States. One driver of grade inflation may be faculty concern that assigning lower clinical performance scores to trainees will cause them to retaliate and assign lower teaching scores to the faculty member. The finding of near-zero retaliation would be important to faculty members who evaluate trainees.

Methods

The authors used a bidirectional confidential evaluation and feedback system to test the hypothesis that faculty members who assign lower clinical performance scores to residents subsequently receive lower clinical teaching scores. From September 1, 2008, to February 15, 2013, 177 faculty members evaluated 188 anesthesia residents (n = 27,561 evaluations), and 188 anesthesia residents evaluated 204 faculty members (n = 25,058 evaluations). The authors analyzed the relationship between clinical performance scores assigned by faculty members and the clinical teaching scores received using linear regression. The authors used complete dyads between faculty members and resident pairs to conduct a mixed effects model analysis. All analyses were repeated for three different epochs, each with different administrative attributes that might influence retaliation.

Results

There was no relationship between mean clinical performance scores assigned by faculty members and mean clinical teaching scores received in any epoch (P ≥ 0.45). Using only complete dyads, the authors’ mixed effects model analysis demonstrated a very small retaliation effect in each epoch (effect sizes of 0.10, 0.06, and 0.12; P ≤ 0.01).

Conclusions

These results imply that faculty members can provide confidential evaluations and written feedback to trainees with near-zero impact on their mean teaching scores.

What We Already Know about This Topic
  • Teaching evaluations are important in medical education

  • Inflation of student grades is common, and one driver is faculty inflating student evaluations in an attempt to receive reciprocal positive evaluations of their teaching skills, or avoiding giving low scores to avoid a reciprocal low teaching score

What This Article Tells Us That Is New
  • In a residency training program, faculty members who assigned lower clinical performance scores to residents did not receive lower clinical teaching scores

  • In this institution’s residency program, there was little or no retaliatory effect when faculty members gave residents low clinical scores when providing confidential evaluations and written feedback to trainees

EVALUATION systems are a cornerstone of medical education. Clinical performance evaluations are judgments by educators of a learner’s clinical progress. Evaluations ensure that performance standards are being met. Clinical teaching evaluations are judgments by learners of educator’s clinical teaching skillfulness. Teaching evaluations are important because they are often used in decisions about the educator’s promotion, tenure, access to teaching venues, and merit raises.1,2 

Grade inflation threatens the validity of evaluations, and in the worst case, faculty members have passed a medical student they felt should have failed a clinical rotation.3,4  Grade inflation has been documented in high schools,5  higher education,6,7  third-year medical school clerkships,3,8,9  fourth-year medical school subinternships,4  applications to residencies,10  and residency.11  The mechanisms driving grade inflation include faculty members inflating scores in an attempt to receive reciprocal positive evaluations of their teaching skills.2,3,6,12–14  Conversely, they may avoid assigning a low score to a trainee to avoid a reciprocal low teaching score.3  This concern has some merit because retaliation (also termed reciprocity) was demonstrated in a general surgical residency when faculty members gave lower clinical scores to residents and the name of the faculty member was known to the trainee.15 

The evaluation and feedback system used in our study keeps evaluator name confidential. As such, it may appear to be impossible for residents to retaliate for low scores or negative feedback. However, characteristics of optimal feedback (i.e., specific, timely, nonjudgmental, and aimed at helping the learner improve16,17 ) will often provide enough information to identify the author of the feedback. Additionally, a significant amount of communication is nonverbal,18,19  and it is possible that negative evaluation is communicated nonverbally when a faculty member interacts with a resident.20  Some faculty members have expressed concern to the residency program director (K.B.) about receiving lower clinical teaching scores if they submit a negative evaluation on a resident. This concern is shared by surgical faculty members who are leery of providing poor evaluations to trainees, even when done anonymously, due to concern that residents can identify the faculty member who rendered the poor evaluation.21 

Our program possesses both faculty member concern with retaliation and grade inflation, so we sought evidence of retaliation using two different approaches. Using our confidential evaluation and feedback system, we addressed macroscopic retaliation by investigating whether faculty members who assigned, on average, lower clinical performance scores to residents were assigned, on average, lower clinical teaching scores by residents. This is termed the leniency hypothesis (teachers who provide higher mean scores to learners are awarded higher mean teaching evaluations).14  We also addressed microscopic retaliation using dyads of individual faculty member–resident pairs using a mixed effects model. Dyads allowed us to study whether there was direct retaliation between individual faculty member–resident pairs. This is termed the reciprocity hypothesis (a learner will assign a higher teaching evaluation to a faculty member if they had received a higher evaluative score from the teacher).14  Last, we evaluated sex since it has been shown to influence the assessment of faculty teaching.22–24 

The Massachusetts General Hospital Institutional Review Board (Boston, Massachusetts) waived the need for informed consent and classified this study as exempt (protocol no. 2013P000912, May 21, 2013). Three distinct periods (epochs) were identified during the study period (September 1, 2008, until February 15, 2013). Each epoch was characterized by a unique combination of administrative details pertaining to how evaluation and feedback information was obtained and distributed (table 1). The evaluator’s name was kept confidential on all evaluations in all three epochs.

Faculty Member Evaluation of Resident Clinical Performance

Each week, faculty members were assigned to provide numerical evaluation and written feedback on resident clinical performance. Evaluation assignments were based on our anesthesia information management system. This system tracks which faculty members supervised which residents during the previous week. When anesthesia information management system data were not available (intensive care unit, preoperative clinic, recovery room, and pain clinic), we used weekly schedules to determine who worked with whom as previously published.11  Faculty evaluation of resident clinical performance was based on the peer comparison section of our evaluation form, which had seven elements, each with a Likert score ranging from 1 to 5. We used the mean of the seven subscores to represent the overall score by a faculty member even though the subscores were from a Likert ordinal scale. Our use of means for summarizing ordinal data has been criticized,25  but the pragmatic use is supported in instances where the sample size is large and the data are approximately normally distributed,26,27  as we have shown to be the case with our data.11  Importantly, in the peer comparison section of our form, we previously published and defined a score of 3 to mean peer average when compared to other Massachusetts General Hospital anesthesia residents at the same level of training.11  Thus, the average peer comparison score should not rise as residents advance in the program. The average of the seven elements was used as an overall clinical performance score,11  and the average was rescaled onto a 0 to 100 range. Each evaluation form also has areas for faculty members to provide written feedback to the resident. The complete evaluation form has been published.11  Reminder emails were sent at least weekly in response to delinquent evaluations. Details of faculty member evaluation and feedback regarding resident clinical performance for each epoch are found in table 2.

Resident Evaluation of Faculty Member Clinical Teaching

During epochs 1 and 2, residents were assigned to evaluate their faculty member’s clinical teaching based on monthly billing data, which enabled us to know which faculty member supervised which resident. Pairings were extracted about 2 to 3 weeks after the completion of each month-long rotation as previously published.28  During epoch 3, we used the weekly list detailing which faculty member was assigned to evaluate which resident (see Faculty Member Evaluation of Resident Clinical Performance) to then assign residents to evaluate the corresponding faculty members. Thus, in epoch 3, we had a weekly bidirectional evaluation process. Raw teaching scores contained seven clinical teaching subscores, each with a Likert score ranging from 0 to 10; thus, composite teaching scores ranged from 0 to 70.28  Teaching scores were rescaled onto a 0 to 100 range. Each evaluation form also has areas for residents to provide written feedback to the faculty member. Details of resident evaluation and feedback regarding faculty member clinical teaching for each epoch are found in table 3. Reminders were sent monthly (epochs 1 and 2) or at least weekly (epoch 3) in response to delinquent evaluations.

Macroscopic Assessment of Retaliation

Macroscopic retaliation refers to the process whereby faculty members who assign, on average, lower clinical performance scores to residents receive, in return, lower teaching scores from residents. Detection of macroscopic retaliation amounts to finding lower average teaching scores among faculty members who assign lower resident clinical performance scores. This has been termed the leniency hypothesis.14  For each epoch, our independent measure was the mean resident clinical performance score assigned by each faculty member. This measure determined the level of faculty leniency (a point measure on the hawk–dove continuum11,29 ). We then used the mean teaching score received by each faculty member as our dependent measure of retaliation. Figure 1A displays these interactions. A pairing was null-resident if a faculty member submitted clinical performance scores to a resident and that resident did not submit any clinical teaching scores on that faculty member during an epoch (fig. 1A). A pairing was null-faculty if the resident submitted clinical teaching scores on a faculty member but the faculty member did not submit any clinical performance scores on that resident during an epoch (fig. 1A).

Microscopic Retaliation: Linking Faculty Member and Resident Evaluations to Create Complete Dyads

Microscopic retaliation is a term we use to describe the retaliation effect between a single faculty member and a single resident (a dyad). A pair (dyad) was complete if a faculty member submitted one or more clinical performance scores to a resident and that resident also submitted one or more clinical teaching scores to that faculty member during an epoch. Figure 1B displays these interactions.

Faculty members and residents sometimes evaluated each other more than once in an epoch because they worked together more than once during an epoch. On average, each resident was evaluated twice by a given faculty member during an epoch. On average, each resident evaluated each faculty member once or twice during an epoch. Thus, the most common dyads in an epoch were 1:1 or 2:1. We defined a dyadic interaction within an epoch as the average score that a faculty member assigned to a resident coupled to the average teaching score that the resident assigned to that faculty member.

Timing (Sequencing) of Evaluation Requests and Returns

In order to investigate retaliation, we had to know the sequence of who evaluated whom and when. During all epochs (1, 2, and 3), faculty members were assigned to evaluate residents they had worked with during the previous 7 days. During epochs 1 and 2, after each month-long rotation, residents were assigned to evaluate the faculty members that they had worked with during the previous month. Thus, in epochs 1 and 2, there was a built-in structural delay of at least 2.5 weeks and up to 6.5 weeks (mean, 4.5 weeks) before residents were assigned to evaluate the faculty members they worked with. During epoch 3, residents were assigned to evaluate faculty members they worked with during the previous 7 days. Thus, in epoch 3, all requests to evaluate were essentially synchronized in time. We measured the real delay (in days) for all faculty-based evaluations of resident clinical performance during all three epochs (table 1). We were able to measure the real delay (in days) for all resident-based evaluations of faculty clinical teaching only for epoch 3 (table 1). In epoch 3, residents had a longer delay than faculty members (mean [SD], 23 [32] vs. 18 [13] days; P < 0.001, unpaired Student’s t test). Thus, our system was arranged so that, on average, faculty members evaluated residents before residents evaluated the corresponding faculty member. We were not able to determine the actual timing for each dyad, and thus, we expect that some sequencing was synchronous or even inverted in all epochs but especially in epoch 3.

Different Components of the Evaluation and Feedback Form Were Revealed in Each Epoch

During epoch 1, residents were able to see the entirety of each evaluation form that contained both evaluative scores and formative feedback comments (but not the name of the faculty member who submitted the form). Links to view these completed evaluation forms were emailed to residents every 7 to 10 days during epoch 1, and residents were required to sign that they read them. During epochs 2 and 3, residents were only able to see portfolios of aggregated written feedback comments but not the corresponding evaluative scores or names of the faculty members who submitted the comments (table 1). Links to portfolios were emailed to residents and their mentors every 2 weeks. Both residents and mentors were required to sign that they read the portfolios. Residents were more than 98% compliant with signing that they have reviewed their evaluations (epoch 1) and portfolios (epochs 2 and 3). We uncoupled evaluative scores from formative feedback comments during epochs 2 and 3 due to educational research showing that grades (scores) can reduce the motivation to learn.30–33  Clinical performance scores were processed into Z scores and used by the clinical competency committee to determine resident clinical performance and to identify residents who were particularly in need of improvement.11 

Statistics

In each epoch, we assessed for macroscopic retaliation by performing linear regression between the mean clinical performance score assigned by each faculty member (independent variable) and the mean clinical teaching score that was received by that faculty member (dependent variable). We used all available data to compute each mean clinical performance score and each mean clinical teaching score. We ensured that linear regression was appropriate for our datasets34  by ensuring that the population errors of each regression model were normally distributed. We also assessed the residuals for heteroscedasticity. The population errors were normally distributed as determined by linear normal probability plots. In addition, none of our linear regression analyses displayed significant heteroscedasticity.

We sought evidence of microscopic retaliation using a mixed effects model using only complete dyads. A complete dyad was composed of two individuals, a faculty member and a resident who evaluated each other. Each dyad had two numerical parts: a mean clinical performance score assigned by the faculty member to the resident (primary independent variable of interest) and a mean clinical teaching score assigned to that faculty member by the evaluated resident (dependent variable). Each dyad was composed of a unique faculty member–resident pairing. Our model took into account epoch (structurally different ways to acquire and distribute evaluations), age of resident in the program (0 to 36 months) since this has been shown to effect assignment of teaching scores,24,28  the number of evaluations submitted by a resident on a faculty member, and a retaliation effect (how the faculty member’s score of a resident’s clinical performance affected that resident’s clinical teaching score of that same faculty member). A random effects term was included to account for repeated dyads (the same faculty–resident pairing) that occurred in more than one epoch. The mixed effects model coefficients were estimated using reduced maximum likelihood.

Sex effects on faculty member evaluation of residents and of resident evaluation of faculty members were computed by comparing the means of scores assigned by males and females using unpaired Student’s t tests assuming unequal variance in the measures. We chose Student’s t tests to compare means for two reasons. First, with larger datasets (n > 50), the Student’s t test is a robust statistic for both normally and nonnormally distributed datasets34–38  due to the central limit theorem. Second, the Student’s t test provides additional power to detect differences. Thus, if we did not find a difference using a Student’s t test, then we were all but certain not to detect a difference using a parametric test such as the Wilcoxon test.

The effect of overall resident clinical performance on resident-assigned faculty clinical teaching scores was determined by performing linear regression between the mean overall resident clinical performance score (Zrel score11 ) and the mean teaching score assigned by that resident for each epoch. Mean Zrel scores were computed using all individual relative to peers11  Zrel scores for each resident during each epoch. This regression analysis met the criteria of having a linear distribution of errors and no heteroscedasticity.

Statistical results were determined using StatsDirect, Version 2.6.6 (StatsDirect Ltd., United Kingdom), Excel, Version 2003 (Microsoft, USA), Origin, Version 7.5 SR4 (OriginLab, USA), or SPSS, Version 21 (IBM Corporation, USA). Effect sizes were determined by Cohen d and provide a measure of the size of a difference compared to the variation in the data.39,40  Effect sizes are classified as small (Cohen d = 0.2), medium (Cohen d = 0.5), or large (Cohen d = 0.8).39,40 P values are two sided and determined exactly whenever possible. A P < 0.05 was considered statistically significant.

Faculty Members Who Assigned Lower Resident Clinical Performance Scores Did Not Receive Lower Teaching Scores (No Macroscopic Retaliation Effects)

Our faculty members provide confidential evaluations (scores) and feedback (written comments) to our residents. Our residents receive a large number of evaluations of which 52, 69, and 71% contained written comments in epochs 1, 2, and 3, respectively (table 2). We sought evidence that faculty members who assigned lower average clinical performance scores to residents would receive lower average teaching scores from residents. We found no relationship between the average clinical performance score assigned by a faculty member and the average teaching score that residents assigned to that faculty member in any of the three epochs (P ≥ 0.45; fig. 2 and table 4). In other words, faculty members who assigned lower clinical performance scores to residents did not receive lower clinical teaching scores in return. This broad macroscopic view indicated a lack of retaliation under each of the three different administrative conditions. A post hoc power analysis using data from all three epochs demonstrated that we had more than 80% power (with α = 0.05) to detect a very small retaliation effect (r = 0.2; d = 0.04). Thus, we have essentially ruled out the leniency hypothesis for our program using our confidential evaluation and feedback system.

Analysis of Faculty Member–Resident Pairs (Dyads) Reveals a Very Small Retaliation Effect (Microscopic Retaliation Effects)

Since we did not find a macroscopic retaliation effect, we proceeded with the mixed effects model to evaluate specific interactions. Our mixed effects model detected a very small retaliation effect in each epoch (table 5). In epochs 1, 2, and 3, the retaliation effect amounted to 0.09, 0.05, and 0.11 point changes in the faculty teaching score (on a 0 to 100 scale) for every one-point change in the resident performance score (on a 0 to 100 scale; P < 0.001, P = 0.010, and P < 0.001, respectively). Thus, using our confidential evaluation and feedback system, we found support for a very small effect of the retaliation hypothesis.

In contrast to these very small retaliation effects, the seniority of the resident had a much larger effect on assigned teaching scores in some epochs. In epochs 1, 2, and 3, for each additional month that a resident was in the program, the average assigned faculty teaching scores decreased by 0.26 and 0.30 or increased by 0.04, respectively (P < 0.001, P < 0.001, and P = 0.039, respectively). An additional small effect on teaching scores was also found for the number of teaching evaluations that each resident submitted per faculty member in epochs 2 and 3. In epochs 2 and 3, the teaching score was increased by 0.73 and 0.86 for each additional teaching evaluation returned by the same resident on the same faculty member, respectively (P = 0.03 and P < 0.001).

Faculty Member Sex, Resident Sex, and Resident Clinical Performance Do Not Affect Assigned Scores

Faculty member sex did not influence the assignment of resident clinical performance scores. Male and female faculty members assigned similar resident clinical performance scores across all three epochs (table 6; P ≥ 0.15). Resident sex did not influence the assignment of faculty member clinical teaching scores. Male and female residents assigned similar faculty member clinical teaching scores across all three epochs (table 6; P ≥ 0.16). Since different epochs were not likely to inherently influence sex bias, we increased the power to detect an effect by combining the data from all three epochs. The combined dataset had 172 unique residents (67 females and 105 males) who had submitted at least five evaluations. Using this post hoc dataset, we were not able to detect an effect of resident sex on the teaching scores they assigned (P = 0.13). We were also not able to detect an interaction of sex on assignment of teaching scores. Male residents evaluated male faculty members (n = 105; mean [SD], 84.2 [8.5]) the same as they evaluated female faculty members (n = 98; mean, 80.0 [8.5]; P = 0.83). Female residents evaluated male faculty members (n = 67; mean, 81.8 [9.3]) the same as they evaluated female faculty members (n = 63; mean, 82.5 [9.1]; P = 0.63). Overall resident clinical performance, as determined using mean Zrel scores, did not influence the assignment of faculty member teaching scores across all three epochs (table 6; P ≥ 0.10).

We Detected Either No or Very Small Retaliation Effects Using Our Evaluation and Feedback System

Our main finding was that faculty members who assigned lower clinical performance scores to residents did not receive lower clinical teaching scores from residents. We, thus, reject the leniency hypothesis using our system. Our results were obtained using three different administrative approaches to evaluation and feedback (epochs 1 to 3), and the results were consistent (fig. 2; table 4). Importantly, the evaluative scores that our faculty members assign to residents are converted to Zrel scores, which correct for individual bias and the unique grade range usage of each faculty member.11  Average Zrel scores differentiate residents who deliver lower and higher clinical performance,11  are stable over time, reliably identify low performers, detect improvement in performance when an educational intervention is successful, are related to an external measure of medical knowledge, and identify poor performance due to a wide variety of causes.11  Zrel scores also are related to American Board of Anesthesiology written (part 1) and oral (part 2) examination scores used to determine board certification.41  Thus, scores that faculty members assign, and comments they write, contain diagnostic information about resident performance. Our study demonstrates that this information can be conveyed to a clinical competency committee without important resident retaliation toward faculty clinical teaching scores. Our results contrast with those of Gardner and Scott15  who found a macroscopic retaliation effect, but the name of the faculty member was known to the resident. We believe that our confidential system eliminates the macroscopic retaliation effect.

Our second finding was made using unique faculty member–resident pairings where bidirectional evaluation had occurred to look for microscopic retaliation using a mixed effects model. Our analysis detected a statistically significant but very small retaliation effect in all three epochs. The retaliation effect in epoch 1 was no larger than in any other epoch, and it was the only epoch in which we provided residents with the entire evaluation form (including scores and written comments). The largest, yet still very small, retaliation effect occurred in epoch 3 when residents and faculty members were evaluating each other in the most synchronized manner. A potential mechanism for this finding is described next.

How Can Residents Retaliate When They Do Not Know Who Evaluated Them?

Our system treats evaluator identity as confidential (knowable but not revealed). Ostensibly, our system should be immune to retaliation since evaluator identity remains confidential. However, if a faculty member revealed enough information in their written feedback comments, then the resident would know who wrote the comments. In addition, if a faculty member displayed anger or frustration toward a resident, then the resident may react in a negative and retaliatory manner.20  Finally, a significant amount of communication is nonverbal,18,19  and negative evaluations may be nonverbally communicated. Recent evidence demonstrated a universal facial expression that communicates negative judgments.18  We speculate that nonverbal communication explains the retaliation effect found in epoch 3 when nearly synchronized bidirectional evaluation was occurring. This mechanism would also explain why all three epochs had similarly sized retaliation effects despite structural differences in the evaluation and feedback process.

Contextualizing the Size of the Retaliation Effects

To place these findings in context, we modeled the effects of retaliation on faculty teaching rankings based on mean teaching scores. We modeled having a faculty member decrease the scores they assigned to residents by one full SD and assumed retaliation on all subsequent teaching evaluations by the amount detected using only complete dyads (a very conservative projection). Our model demonstrated a retaliatory change in teaching scores of 1.4, 0.8, and 1.7 (on a 0 to 100 scale) for epochs 1, 2, and 3, respectively. These changes would translate into effect sizes, d, of 0.10, 0.06, and 0.12, respectively. A change in teaching scores of this magnitude would only slightly change the rank ordering of faculty members. The effects are shown for a high-scoring (5th percentile), average-scoring (50th percentile), and low-scoring faculty member (95th percentile) for each epoch (fig. 3).

In contrast to the very small retaliation effects, resident seniority had a much larger effect on teaching scores. For example, a senior resident (36 months in the program) would, on average, assign teaching scores that were 9.3 points lower, 10.7 points lower, or 1.5 points higher (on a 0 to 100 scale) than those assigned by a beginning resident (0 months in the program) in epochs 1 to 3, respectively. These effects translate into effect sizes, d, of 0.64, 0.83, and 0.10, respectively (table 5). The finding that more senior residents render lower faculty clinical teaching scores has been published24,28  and appears to be related to increasing discernment of what constitutes effective clinical teaching as residents advance in residency.24,28  The lack of a negative seniority effect (epoch 3) can occur when clinical teaching improves.28 

Our failure to find a retaliation signal in our macroscopic analysis given a positive microscopic retaliation effect is likely due to incomplete dyads in the macroscopic dataset. In epochs 1, 2, and 3, we had complete dyads for 43.8%, 34.5%, and 54.2% of our evaluations (table 4). The very small retaliation signal was likely confined to these complete dyads and diluted by null-resident and null-faculty evaluations.

It is important to acknowledge that our dataset contains many evaluations per faculty member, which buffer infrequent retaliation events. With a smaller number of evaluations, a single retaliation event would have a larger effect. Prospect theory42  has shown that people respond to gains (a positive evaluation in this case) and losses (a negative evaluation in this case) asymmetrically such that losses are perceived as far more costly than are equally sized gains. This means that very small retaliation effects may have larger psychologic effects than are justified by the numeric size of the effects.

Lack of Sex Effects on Evaluation

We found no effect of sex on resident assignment of teaching scores to faculty. Previous studies have found mixed results with some demonstrating higher teaching scores for male faculty members22,23  and others demonstrating higher teaching scores for female faculty members.24  We analyzed interactions of sex to see if male (or female) residents assigned different scores to male or female faculty members and found no interaction for any combinations.

Grade Inflation Occurs as Residents Progress through Residency but This Is Not Accompanied by Higher Teaching Scores

Our faculty members inflate clinical performance scores of more senior residents.11  Our relative to peers scoring system defines 3 as peer average at all times during residency. Thus, the average assigned score should remain 3 as a resident advances through residency. During epoch 1, residents saw their actual scores. During this epoch, faculty members assigned increasingly inflated scores to more senior residents, while more senior residents were assigning lower clinical teaching scores to the faculty (P < 0.001). Thus, grade inflation did not lead to higher teaching scores.

Limitations of This Study

Our results are from a single residency and may not generalize to other programs. We designed our system to keep evaluator identity confidential; thus, residents did not explicitly know who evaluated them making retaliation more difficult. Our faculty members11  assign inflated scores, and these scores were seen by residents in epoch 1. Thus, residents may not have perceived a need to retaliate. However, in epochs 2 and 3, we did not reveal scores to residents, and residents still did not retaliate to any important degree. Another limitation is the large number of evaluations we received; consequently, a negative evaluation would be diluted by other evaluations. Our study did not address the difficulties of maintaining confidentiality with a small program. Although we found little evidence of retaliation, we did not actually address whether grade inflation would be reduced if the faculty had this knowledge. Our study also did not analyze the information contained in the comments that residents and faculty members wrote. Comments are important to the process of evaluation and feedback and potentially to the process of retaliation; thus, they will need to be studied in the future.

Conclusions and Practical Implications

Our results provide reassurance to medical educators who worry about the consequences of assigning low clinical performance scores to residents. We found no relationship between clinical performance scores that faculty members assigned to residents and clinical teaching scores that residents assigned to faculty members using our confidential system. This means that hawks and doves29  do not receive different teaching scores as a result of their grading characteristics. When we analyzed faculty member–resident dyads using a mixed effects model, we detected only a very small retaliation effect in each epoch. This suggests that programs can use confidential evaluations with written feedback as a strategy to minimize retaliation. The lack of important retaliation documented in our study should encourage faculty members to be more forthright and provide more developmentally useful feedback to residents. Our results should also encourage faculty members to provide appropriate (less grade inflated) evaluations. This will allow the evaluation process to more accurately denote performance level and allow trainees to benefit from more authentic and less inflated evaluations.

The authors thank both the faculty members who spent time and effort evaluating residents’ clinical skills and the residents who spent time and effort evaluating faculty members’ teaching skills. The authors also thank Mary Wright, Ph.D., Director of the Sheridan Center for Teaching and Learning (Providence, Rhode Island) and Adjunct Assistant Professor, Department of Sociology, Brown University (Providence, Rhode Island), for feedback on the manuscript.

Support was provided solely from institutional and/or departmental sources.

The authors declare no competing interests.

1.
Curran
DS
,
Stalburg
CM
,
Xu
X
,
Dewald
SR
,
Quint
EH
:
Effect of resident evaluations of obstetrics and gynecology faculty on promotion.
J Grad Med Educ
2013
;
5
:
620
4
2.
Shearin
KK
:
Grade inflation.
Science
1976
;
191
:
340
3.
Fazio
SB
,
Papp
KK
,
Torre
DM
,
Defer
TM
:
Grade inflation in the internal medicine clerkship: A national survey.
Teach Learn Med
2013
;
25
:
71
6
4.
Cacamese
SM
,
Elnicki
M
,
Speer
AJ
:
Grade inflation and the internal medicine subinternship: A national survey of clerkship directors.
Teach Learn Med
2007
;
19
:
343
6
5.
Walsh
J
:
Does high school grade inflation mask a more alarming trend?
Science
1979
;
203
:
982
6.
Anonymous
:
Against grade inflation.
Nature
2004
;
431
:
723
7.
Fighting grade inflation.
Science
1994
;
264
:
1255
8.
Weaver
CS
,
Humbert
AJ
,
Besinger
BR
,
Graber
JA
,
Brizendine
EJ
:
A more explicit grading scale decreases grade inflation in a clinical clerkship.
Acad Emerg Med
2007
;
14
:
283
6
9.
Speer
AJ
,
Solomon
DJ
,
Fincher
RM
:
Grade inflation in internal medicine clerkships: Results of a national survey.
Teach Learn Med
2000
;
12
:
112
6
10.
Love
JN
,
Deiorio
NM
,
Ronan-Bentle
S
,
Howell
JM
,
Doty
CI
,
Lane
DR
,
Hegarty
C
;
SLOR Task Force
:
Characterization of the Council of Emergency Medicine Residency Directors’ standardized letter of recommendation in 2011–2012.
Acad Emerg Med
2013
;
20
:
926
32
11.
Baker
K
:
Determining resident clinical performance: Getting beyond the noise.
Anesthesiology
2011
;
115
:
862
78
12.
Redding
RE
:
Students’ evaluations of teaching fuel grade inflation.
Am Psychol
1998
;
53
:
1227
8
13.
Maurer
TW
:
Cognitive dissonance or revenge? Student grades and course evaluations.
Teaching Psychol
2006
;
33
:
176
9
14.
Clayson
DE
,
Frost
TF
,
Sheffet
MJ
:
Grades and the student evaluation of instruction: A test of the reciprocity effect.
Acad Manag Learn Educ
2006
;
5
:
52
85
15.
Gardner
AK
,
Scott
DJ
:
Repaying in kind: Examination of the reciprocity effect in faculty and resident evaluations.
J Surg Educ
2016; 73:e91–e94
16.
Shute
VJ
:
Focus on formative feedback.
Rev Educ Res
2008
;
78
:
153
89
17.
Archer
JC
:
State of the science in health professional education: Effective feedback.
Med Educ
2010
;
44
:
101
8
18.
Benitez-Quiroz
CF
,
Wilbur
RB
,
Martinez
AM
:
The not face: A grammaticalization of facial expressions of emotion.
Cognition
2016
;
150
:
77
84
19.
Keating
CF
;
The life and times of nonverbal communication theory and research: Past, present, future
American Psychological Association
:
APA Handbook of Nonverbal Communication
, 1st edition. Edited by
Matsumoto
DR
,
Hwang
HS
,
Frank
MG
.
Washington, DC
,
American Psychological Association
,
2016
, pp
17
42
20.
Johnson
G
,
Connelly
S
:
Negative emotions in informal feedback: The benefits of disappointment and drawbacks of anger.
Hum Rel
2014
;
67
:
1265
90
21.
Tarpley
JL
,
Tarpley
MJ
:
The continuing quest for meaningful faculty evaluations of residents.
JAMA Surg
2016
;
151
:
31
22.
Beckman
TJ
,
Mandrekar
JN
:
The interpersonal, cognitive and efficiency domains of clinical teaching: Construct validity of a multi-dimensional scale.
Med Educ
2005
;
39
:
1221
9
23.
de Groot
J
,
Brunet
A
,
Kaplan
AS
,
Bagby
M
:
A comparison of evaluations of male and female psychiatry supervisors.
Acad Psychiatry
2003
;
27
:
39
43
24.
Fluit
CR
,
Feskens
R
,
Bolhuis
S
,
Grol
R
,
Wensing
M
,
Laan
R
:
Understanding resident ratings of teaching in the workplace: A multi-centre study.
Adv Health Sci Educ Theory Pract
2015
;
20
:
691
707
25.
Watson
NC
:
Likert or not, we are biased.
Anesthesiology
2012
;
116
:
1160
author reply 1161–2
26.
Norman
G
:
Likert scales, levels of measurement and the “laws” of statistics.
Adv Health Sci Educ Theory Pract
2010
;
15
:
625
32
27.
Stevens
SS
:
On the theory of scales of measurement.
Science
1946
;
103
:
677
80
28.
Baker
K
:
Clinical teaching improves with resident evaluation and feedback.
Anesthesiology
2010
;
113
:
693
703
29.
McManus
IC
,
Thompson
M
,
Mollon
J
:
Assessment of examiner leniency and stringency (‘hawk-dove effect’) in the MRCP(UK) clinical examination (PACES) using multi-facet Rasch modelling.
BMC Med Educ
2006
;
6
:
42
30.
White
CB
,
Fantone
JC
:
Pass-fail grading: Laying the foundation for self-regulated learning.
Adv Health Sci Educ Theory Pract
2010
;
15
:
469
77
31.
Dweck
CS
:
Motivational processes affecting learning.
Am Psychol
1986
;
41
:
1040
8
32.
Dobrow
SR
,
Smith
WK
,
Posner
MA
:
Managing the grading paradox: Leveraging the power of choice in the classroom.
Acad Manag Learn Educ
2011
;
10
:
261
76
33.
Lipnevich
AA
,
Smith
JK
:
Effects of differential feedback on students’ examination performance.
J Exp Psychol Appl
2009
;
15
:
319
33
34.
Lumley
T
,
Diehr
P
,
Emerson
S
,
Chen
L
:
The importance of the normality assumption in large public health data sets.
Annu Rev Public Health
2002
;
23
:
151
69
35.
Barrett
JP
,
Goldsmith
L
:
When is n Sufficiently Large?
Am Stat
1976
;
30
:
67
70
36.
Sullivan
LM
,
D’Agostino
RB
:
Robustness of the t test applied to data distorted from normality by floor effects.
J Dent Res
1992
;
71
:
1938
43
37.
Boneau
CA
:
The effects of violations of assumptions underlying the test.
Psychol Bull
1960
;
57
:
49
64
38.
Ratcliffe
JF
:
The effect on the t distribution of non-normality in the sampled population.
J R Stat Soc Ser C Appl Stat
1968
;
17
:
42
8
39.
Cohen
J
:
Statistical Power Analysis for the Behavioral Sciences
, 2nd edition.
Hillsdale, New Jersey
,
L. Erlbaum Associates
,
1988
40.
Cohen
J
:
A power primer.
Psychol Bull
1992
;
112
:
155
9
41.
Baker
K
,
Sun
H
,
Harman
A
,
Poon
KT
,
Rathmell
JP
:
Clinical performance scores are independently associated with the American Board of Anesthesiology Certification Examination Scores.
Anesth Analg
2016
;
122
:
1992
9
42.
Kahneman
D
:
A perspective on judgment and choice: Mapping bounded rationality.
Am Psychol
2003
;
58
:
697
720