We sought to determine whether mannequin-based simulation can reliably characterize how board-certified anesthesiologists manage simulated medical emergencies. Our primary focus was to identify gaps in performance and to establish psychometric properties of the assessment methods.
A total of 263 consenting board-certified anesthesiologists participating in existing simulation-based maintenance of certification courses at one of eight simulation centers were video recorded performing simulated emergency scenarios. Each participated in two 20-min, standardized, high-fidelity simulated medical crisis scenarios, once each as primary anesthesiologist and first responder. Via a Delphi technique, an independent panel of expert anesthesiologists identified critical performance elements for each scenario. Trained, blinded anesthesiologists rated video recordings using standardized rating tools. Measures included the percentage of critical performance elements observed and holistic (one to nine ordinal scale) ratings of participant’s technical and nontechnical performance. Raters also judged whether the performance was at a level expected of a board-certified anesthesiologist.
Rater reliability for most measures was good. In 284 simulated emergencies, participants were rated as successfully completing 81% (interquartile range, 75 to 90%) of the critical performance elements. The median rating of both technical and nontechnical holistic performance was five, distributed across the nine-point scale. Approximately one-quarter of participants received low holistic ratings (i.e., three or less). Higher-rated performances were associated with younger age but not with previous simulation experience or other individual characteristics. Calling for help was associated with better individual and team performance.
Standardized simulation-based assessment identified performance gaps informing opportunities for improvement. If a substantial proportion of experienced anesthesiologists struggle with managing medical emergencies, continuing medical education activities should be reevaluated.
Written or oral examination performances can be unreliable indicators of the real-world performance of physicians as they practice throughout a long career
Mannequin-based simulation is used to evaluate the performance of anesthesia trainees in crisis event management
To assess the technical and behavioral performance of board-certified anesthesiologists, those who were already attending simulation courses for American Board of Anesthesiology Maintenance of Certification participated in standardized study simulation scenarios that were video recorded for later scoring by blinded trained raters
In simulated emergencies, participants successfully completed approximately 80% of critical performance elements, while approximately 25% received low holistic rating
Higher-rated performances were not associated with previous simulation experience
HUMAN performance is imperfect and, without dedicated periodic practice, typically degrades over time.1–3 To this end, Maintenance of Certification (MOC) programs are intended to facilitate lifelong learning and practice improvement.4–7 Maintenance of Certification in Anesthesiology (MOCA) and other fields has recently been revised in response to concerns about cost, relevance to practice, and inconsistent evidence of effectiveness.5,7 Many physicians believe that their practice is safe and that they are performing optimally.8 The ability of practicing anesthesia professionals to manage perioperative emergencies, like cardiorespiratory arrest, anaphylactic shock, or massive hemorrhage, where deficiencies may have life-or-death consequences, is largely unknown. Identifying the performance gaps of practicing clinicians could lead to more effective graduate medical education, continuing medical education, and practice improvement activities.
Assessing the quality of perioperative event management is difficult. Critical events are uncommon and unpredictable in practice, making prospective studies of their management extremely difficult. Post hoc adverse event reports are typically incomplete, and their analysis has inherent selection and hindsight biases.9 Written or oral examination performances may be unreliable indicators of real-world performance.10,11 Mannequin-based simulation, however, provides a unique window on performance: standardized critical events (of varying levels of urgency) can be simulated with reasonable levels of realism,12–15 and participant performance can be evaluated.16–19
Success in managing medical emergencies depends on both technical (e.g., correct diagnosis and therapy) and behavioral (e.g., leadership, communication, and resource management) skills.20,21 Although medical education has recently incorporated behavioral skills training, it was not explicitly taught at many institutions when a preponderance of currently practicing anesthesiologists underwent their primary training.22 In this study, we sought to quantify the distribution of technical and behavioral performance of board-certified anesthesiologists (BCAs) managing realistic perioperative simulated crises, with the following goals: (1) identifying performance gaps that could be addressed in future educational interventions; (2) investigating the feasibility of conducting simulation-based assessment at multiple sites; and (3) providing evidence to support the psychometric adequacy of the scores.
Materials and Methods
Study Design and Context
We conducted a prospective, nonrandomized, observational study at eight American Society of Anesthesiologists–endorsed simulation network programs.1 The study sites were selected based on their research infrastructure and regular conduct of MOCA courses. Study participants were recruited from BCAs who were already attending scheduled simulation courses that satisfied their MOCA simulation training requirement.4,23 The 6- to 8-h MOCA courses use realistic simulated encounters to foster the reflection of attendees on their care and decision-making during perioperative crises. All of the MOCA course scenarios deal with less common, unexpected clinical events of significant severity (e.g., episodes of severe hypoxia and/or hemodynamic instability) requiring recognition and complex management. Course participants are not informed of or given specific training about the clinical scenarios. Each course attendee is the primary anesthesiologist (referred to colloquially as the hot seat [HS] participant), in at least one 20- to 30-min simulated clinical crisis scenario. Because teamwork is emphasized, a second anesthesiologist (the first responder [FR]), naïve to the transpiring crisis, is sequestered until he/she is called to help. Experienced simulation educators facilitate debriefings after each scenario.
We designed four standardized MOCA-compliant study scenarios that were offered in study site MOCA courses between November 2012 and June 2014. After receiving institutional review board approval, each site enrolled consenting participants and collected demographic information. Each participant performed in at least two standardized study simulation scenarios (once in the HS role and once as FR) that were video recorded for later scoring by trained raters.
Designing Four Standardized Scenarios
Four perioperative crisis scenarios were designed and iteratively piloted to do the following: (1) comply with the course requirements4 ; (2) elicit relevant technical and behavioral skills; and (3) contain critical performance elements (CPEs) that could be observed and scored. A panel of 10 independent subject matter experts (SMEs) advised the study team in creating the simulation scenarios and rating rubrics (Supplemental Digital Content 1, https://links.lww.com/ALN/B480). SMEs were selected based on their clinical and educational expertise; all participated in the American Board of Anesthesiologists examination process either as oral examiners or written examination content developers. Some were simulation instructors, but none were simulation researchers or had leadership involvement in simulation. SMEs reviewed, contributed to, and approved the scenario content and assessment metrics. They also affirmed that the scenario content and management expectations were within a BCA’s scope of practice.
The four scenarios were iteratively developed, with the SMEs and research team reviewing and modifying their content as necessary, pilot-testing new iterations, and further refining the scenarios and corresponding checklists. Scenarios were approved for use by consensus of the research team and the SME panel. The resulting scenarios were as follows: (1) local anesthetic systemic toxicity (LAST) with hemodynamic collapse; (2) hemorrhagic shock from occult retroperitoneal bleeding (hemorrhage); (3) malignant hyperthermia (MH) presenting in the postanesthesia care unit; and (4) acute onset of atrial fibrillation with hemodynamic instability followed by ST elevation myocardial infarction (Afib/MI) (Supplemental Digital Content 2, https://links.lww.com/ALN/B481).
Standardization of Scenario Delivery
To standardize the delivery of the scenarios, detailed scripts and a guidebook of rules for scenario delivery were created. The scenario scripts delineated the contents of the simulated clinical environment (e.g., the equipment and medications available), evolution of the patient’s condition throughout the crises and their responses to interventions, standardized answers to anticipated participant questions, and criteria that defined successful completion of CPEs. Each script also contained the timing and content of key phrases or comments to be made by trained confederates, acting in the roles of anesthesiologists, surgeons, nurses, or the patient during the scenarios. These key phrases provided information or clinical context that otherwise would not be available from the mannequins (e.g., “the patient feels warm to me” in the MH scenario). Scripted verbal prompts from confederates were used when necessary to assure timely progression of the scenarios. Key scripted events and standardized content within the scenarios have been published previously, including the rules for standardized delivery of these scenarios.24 Before enrolling participants, investigators confirmed a site’s ability to deliver the standardized scenarios by reviewing video of its pilot-trial encounters. A central database and custom video review software facilitated data collection and analysis (Supplemental Digital Content 3, https://links.lww.com/ALN/B482).
Rating Rubrics, Metrics, and Procedures
Drawing on the existing literature,15,25–30 the project team and SMEs collaboratively developed rating rubrics and tools. Separate scoring rubrics were created for technical and behavioral performance. Because there are advantages and disadvantages of itemized versus global ratings,29,31–33 we developed both types of rubrics to quantify those skills. Technical performance was measured with the percentage of the scenario’s CPEs completed and holistic ordinal scores of overall technical performance. Behavioral performance was measured with numerical ratings made using behaviorally anchored rating scales (BARS) of four categories of skills: vigilance, communication, decision-making, and teamwork, as well as holistic ordinal scores of overall behavioral performance.26 The BARS and holistic behavioral rating scales have been found to be easier to use and yield scores that are just as reliable as the Anesthetists’ Non-technical Skills system,34 a widely used but complex means of rating anesthesia providers’ behavioral skills.26 Finally, based on all of these ratings and their overall evaluation of the performance, the rater made a summative binary assessment (i.e., yes or no) as to whether the participants’ overall performance was at the level expected of a BCA. The raters were instructed to base their binary decision on the holistic scores for technical and nontechnical ratings. If a participant scored in the “poor” bin (see “Video Rater Training and Rating Procedures” section), the rater was instructed to rate the performance “no.” If the scores were on the cusp of poor and medium performance, the rater was instructed to reconsider the technical and behavioral performance to reach the decision. Details of these metrics and scales are provided in Supplemental Digital Content 4 (https://links.lww.com/ALN/B483).
Through a Delphi Process,35 SMEs reached consensus on 72 CPEs (16 to 20 CPEs per scenario) that represented the essential patient management steps deemed necessary in each scenario. CPEs were defined so that they could be rated as either present (observed) or absent (not observed). The CPEs were not weighted as to their importance.
Participants and Study Procedures
Figure 1 illustrates the study enrollment process. After obtaining informed consent, MOCA course attendees who volunteered for the study were allocated to study scenarios. Allocation was made by chance, although many sites assigned participants to all MOCA course (including study) scenarios that were relevant to their practice (e.g., having a pain specialist perform the LAST scenario). Sites were also free to choose the study scenarios that they wished to conduct.
Participants completed a demographic survey (table 1, see also Supplemental Digital Content 3, https://links.lww.com/ALN/B482) and then participated in a standardized orientation to simulation where they were briefed on relevant mannequin characteristics, ground rules for participating in simulation encounters, and location and uses of medications, clinical equipment, and other resources (Supplemental Digital Content 5, https://links.lww.com/ALN/B484). Participants observed or took part in at least one course scenario before performing their first study encounter.
Generally, participants were studied in pairs, once each as the HS or FR in successive scenarios. To facilitate assessment of teamwork and communication skills, the FR was sequestered alone, unable to observe the evolving emergency, thereby mimicking the typical conditions for a real-world emergency response by an attending anesthesiologist. If the HS requested anesthesiologist assistance, the FR joined the simulation encounter, but not earlier than 9 min after the encounter started. If the HS did not request assistance, the FR entered the encounter 12 min after it commenced. Digital audio/video recordings of each study encounter were made and, along with participant demographics and other study data, saved to the project’s central database (see Supplemental Digital Content 6, https://links.lww.com/ALN/B485, for details about how the encounters were captured for later rating).
Video Rater Training and Rating Procedures
Nine academic anesthesiologists, previously unaffiliated with the study, with at least 3 yr of clinical practice after board certification and experience as educators and/or raters of clinical performance were selected as potential raters. A panel of project team members established consensus ratings on 24 exemplar study videos to be used as gold standards for rater training and assessment; these videos demonstrated a range of performances in each of the scenarios. Raters participated in a 2-day in-person training session. They were instructed on the use of the online rating software and practiced viewing and rating the exemplar videos. Project team members mentored the raters, providing one-on-one guidance, first in person and then via videoconference. Rater calibration was assessed regularly during training until the rater CPE ratings matched the consensus ratings exactly, their BARS scores were no more than one point from the consensus rating, and their performance ratings were within the same preliminary bin for the holistic HS and team ratings (see descriptions in the following paragraph and Supplemental Digital Content 4, https://links.lww.com/ALN/B483). Seven raters successfully completed the training and were able to rate performances in all four scenarios consistently. Raters were compensated.
After training, raters rated the randomly assigned videos of each recorded encounter an average of 1 yr after they were performed via a Web-based, secure application that allowed for as much review as needed to apply the scoring metrics (Supplemental Digital Content 7, https://links.lww.com/ALN/B486). The software allowed the reviewer to mark each CPE as it was observed. A CPE was counted as having been performed if the HS, FR, or a confederate under their direction completed it at any time during the encounter. Raters then scored the holistic technical and behavioral performance of the HS and the physician team (i.e., HS and FR working together) by assigning the performance to one of three bins (poor, medium, or excellent) and then choosing one of three levels (low, medium, or high) within that bin (fig. 2). Thus, scores one to three were in the poor bin; four to six in the medium bin; and seven to nine in the excellent bin. This scoring system was chosen over a simple ordinal scale because it simplifies the rating process and improves rater reliability.36 For behavioral ratings, the raters scored participants using the BARS, which is composed of detailed anchoring statements describing expected performance of those falling in the poor and excellent bins for each scale. Raters made a summative, binary (yes/no) assessment of overall performance based on the SME-chosen query: “Did this person [or team] perform at the level of a board-certified anesthesiologist?” The primary (HS) anesthesiologist was rated first, followed by the physician team. The raters also assessed whether the degree of standardization of scenario delivery was sufficient for study inclusion (e.g., were there any scenario deviations serious enough to render the encounter manifestly different than intended).
Raters received batches of videos in a predetermined, counterbalanced order. The same rater was not assigned multiple encounters conducted at a single site on the same day. The raters were instructed not to score a performance if they recognized a participant.
Data were collected directly into the study database portal via preconfigured data entry forms (Supplemental Digital Content 7, https://links.lww.com/ALN/B486). For logistical reasons (e.g., number of courses, number of participants per course, and efficiency of recruitment), the distribution of participant enrollment was uneven across the sites (Supplemental Digital Content 8, https://links.lww.com/ALN/B487).
Of the 342 BCAs entered into the database as participants, 24 were not an HS participant; these were all in scenarios where an extra FR was needed for an HS doing a second study scenario. For the 318 remaining study encounters, 26 (8.2%) were excluded from the final dataset due to obvious scenario standardization issues (e.g., outright mannequin failure in the middle of the scenario) or inadequate audio/video capture. The raters flagged an additional eight videos as unratable, and these were excluded from the final dataset of 284 encounters (net yield of 89%).
Reliability of Scores.
Fifty encounters were scored by more than one rater. To estimate interrater reliability, 39 randomly selected encounters were scored independently by at least two raters. Variance components were calculated by scenario to estimate interrater reliability based on a model where two (of the seven) randomly selected raters provided scores. For the summative binary score, κ was calculated.
Association between Participant Characteristics and Performance.
CPE data were summarized as the number and percentage of encounters in which each CPE was observed as present or absent. When an encounter was rated more than once, a CPE was scored as not performed only when all of the raters agreed. Binomial logistic regression and the associated likelihood ratio (LR) tests quantified the associations between the odds of CPE completion and participant demographics, accounting for scenario (table 1).
To derive the HS and team technical and behavioral scores in the 39 double-rated encounters, we averaged the ratings, rounding to the nearest integer. Proportional odds logistic regression and the associated LRs tested the associations between technical and behavioral performance and participant demographics, adjusting for scenario. Although the repeated ratings may be correlated among the 24 participants who performed in the HS in two different scenarios, there was insufficient information in these data to model the correlation directly (e.g., using a mixed-effects regression method). Thus, these ratings were treated as independent encounters.
For the binary score in double-rated encounters, a participant’s performance was only rated as not meeting the board-certified anesthesiologist criteria when all of the raters agreed (i.e., all rated it “no”). Binary logistic regression and the associated LRs tested the associations between the odds of being rated a board-certified anesthesiologist and participant demographics, adjusting for scenario. The effects of each covariate were summarized using odds ratios with Wald-type 95% CI.
Because the HS and team scores were paired, a McNemar test37 was used when assessing the fraction of technical and behavioral scores that fell in the lowest bin, as well as the fraction of performances that were rated as performing at the BCA level.
As an exploratory analysis, our assessments of hot seat and team performance were additionally adjusted by whether the provider requested assistance (i.e., “call for help”).
A total of 263 unique HS participants performed in 284 encounters. Table 1 shows demographic information for study sample participants and several sources of data characterizing comparative population-based cohorts. When compared to all BCAs (data provided by the American Board of Anesthesiologists) and all physicians billing Medicare who self-identified as anesthesiologists (data provided by the American Society of Anesthesiologists), our study cohort was younger, had proportionately more females, and were more likely to be fellowship trained (all P < 0.001). These differences were less pronounced when the study cohort was compared to all BCAs in the MOCA process. The proportion of the study cohort who self-identified as being board-certified in chronic pain (10.1%) was similar to that of the Medicare billing sample (14.0%). Compared with all BCAs in the MOCA process, our cohort was twice as likely to be board-certified in critical care medicine (16.4 vs. 8.1%, P < 0.001).
Compared with the 3,461 MOCA simulation course participants in calendar years 2013–2014, the study cohort was significantly more likely to report practicing in an academic setting (47.1 vs. 28.0%, P < 0.01). Similarly, the study cohort was significantly less likely to report working in a community practice setting (49.8 vs. 66.0%, P < 0.01).
Interrater reliability for the CPEs (percent of checklist items attained) ranged from 0.77 (myocardial infarction) to 0.93 (malignant hyperthermia) across the four scenarios (mean = 0.85). The average interrater reliability across scenarios for HS technical and behavior ratings were 0.72 and 0.83, and for team ratings they were 0.64 and 0.72, respectively. The interrater reliability for the BARS was 0.66. For the HS summative binary score, κ = 0.48; raters disagreed in 11 of 39 (28.2%) encounters with multiple ratings. For the team summative score, κ = 0.27, with disagreement in 14 (30.4%) of the encounters.
Across all of the encounters, 81% (interquartile range [IQR], 75 to 90%; table 3) of the CPEs were observed, with a range of 42 to 100%. The highest frequency of observed CPEs was in the LAST (85% [IQR, 75 to 85%]) and lowest in the hemorrhage scenario (77% [IQR, 71 to 88%]). In 46% of encounters, at least four CPEs were missed. Across all of the scenarios, 93% of participants called for help before the time when the first responder would have been sent into the scenario anyway. The likelihood of CPE completion differed by scenario but not by site. Table 2 provides a representative listing of CPEs by scenario and their incidence of observed performance; for a full list of CPEs, see Supplemental Digital Content 4 (https://links.lww.com/ALN/B483).
Technical and Behavioral Scores
The median technical performance rating of HS participants was five; ratings spanned the full one to nine scale. Performance varied significantly only by scenario (LR test P < 0.001), after adjusting for HS demographic and practice characteristics (table 3). Across all of the scenarios, team technical ratings were higher than HS ratings because the arrival of the FR often improved performance (fig. 3). Overall, 30% of the HS and 21% of team technical scores fell within the lowest performance bin (McNemar test P < 0.001).
Overall BARS performance was 5.4 (IQR, 3.5 to 7.1), spanning the metric range from one to nine (table 3). BARS performance varied significantly by scenario (LR test P < 0.001) and participant age (P = 0.037), after adjusting for HS demographic and practice characteristics. Similarly, the median global behavioral rating was five, spanning the full scoring range, and varied significantly by scenario (LR test P = 0.001). Higher participant age (P = 0.004), but not previous simulation experience (yes or no) or other individual factors, was associated with lower behavioral ratings. Overall, in 25% of encounters, HS behavioral scores fell in the lowest bin. Only 14% of team behavioral scores were in this bin (McNemar test P < 0.001 when compared with the HS scores). As seen in figure 3, the arrival of the first responder more often improved than degraded the behavioral score.
In 70% of encounters, the HS participant was rated as “having performed at the level of a board-certified anesthesiologist.” Performance varied significantly by scenario (LR test P = 0.002), with the worst scores in the LAST scenario (43% unsatisfactory). The arrival of FRs frequently improved low HS performances; 34% of unsatisfactory HS scores were followed by satisfactory team ratings, whereas only one (<1%) satisfactory HS score was associated with an unsatisfactory team score (McNemar P < 0.001). HS participants in the under 40-yr age group were more likely to receive a satisfactory binary rating relative to the 40- to 50-yr (odds ratio = 1.86 [95% CI, 1.17–3.10]) and over 50-yr (odds ratio = 2.70 [95% CI, 1.36–5.35]) age groups. HS binary ratings were not associated with any other participant characteristic.
We created a simulation-based assessment process that was reproducible across testing centers, yielded reasonably reliable assessment scores, and measured the performance of important crisis management skills of board-certified anesthesiologists. Based on multiple metrics, there was appreciable variability in the performance of board-certified anesthesiologists. CPEs were commonly omitted. Approximately 30% of encounters were rated as “poor” for overall individual technical or behavioral performance or as “unsatisfactory” for the binary rating. Arrival of the second physician commonly improved performance ratings.
The gaps in performance documented in this simulation study included four broad areas of crisis management: (1) escalation of therapy where first-line therapy is not working (e.g., using epinephrine or vasopressin when phenylephrine or ephedrine and fluids are not appreciably affecting hypotension); (2) using available resources (e.g., calling for help when conditions have deteriorated appreciably); (3) speaking up or engaging other team members, especially when action by them is required (e.g., asking the surgeon to change the surgical approach when it is essential to effective treatment); and (4) following evidence-based guidelines (e.g., giving dantrolene to a patient with obvious MH).
Age was the only statistically significant predictor of performance. Younger participants received higher ratings than older ones, although few participants were more than 60 yr of age. Our 35 participants who were 50 yr of age or older were demographically similar to the 135 participants who were 40 yr of age or younger (other than years in practice), except that they were less likely to be enrolled in MOCA (91 vs. 99%; P = 0.026) and more likely to practice in an anesthesia team model (97 vs. 75%; P = 0.014). Younger and older physicians may differ in many other ways, including the existence or nature of previous crisis management training, comfort with simulation, or simply time since completion of residency training. Degradation of skills from lack of practice or physiologic aging may explain our finding.38
Compared with all anesthesiologists who bill Medicare, with all board-certified anesthesiologists, and even with all BCAs in the MOCA process, our study cohort was younger and more likely to be female, be fellowship trained, and work in an academic practice. If anything, these factors may be more likely to bias our study sample toward those who were more confident about their abilities, more familiar with crisis management, and/or more comfortable with simulation and/or being assessed. We believe that such individuals would be more likely to perform better than those without these attributes. Thus, these results may well be biased toward better performances (in simulation) than might be seen in a fully representative population of all practicing anesthesiologists.
Relationship of This Study’s Results to Those of Previous Studies
Our study validates and expands on results from other studies17,19,39,40 that have assessed performance of anesthesia professionals (often residents) using simulation. We chose to study experienced anesthesiologists (BCAs) because they are the least-studied population yet provide the most patient care. Our sample of 268 BCAs was more than three times larger than that of Devitt et al.40 (79 anesthesiologists) and eight times larger than that of Henrichs et al.17 (35 anesthesiologists plus 26 certified registered nurse anesthetists). Similar to previous investigations, we assessed the technical (i.e., clinical) responses to simulated uncommon events and found a wide variability in the performance of fully trained anesthesia professionals. Like others, we also documented performance deficits, with a substantial rate (20% or higher) of performances rated as “poor,” including many with omissions, errors, or delays in actions deemed by clinical experts a priori to be critical to successful patient care.
Our study methods and results go well beyond those of previous research. Previous studies concentrated on developing tools to reliably and validly measure the ability of individual clinicians. To generate reproducible scores, participants typically performed a number of short, focused scenarios (e.g., 300 s) with quickly observable and unambiguous signs and symptoms.19 Participants often worked completely alone (i.e., no surgeon, nurse, or help to be called). These types of scenarios are less representative of real clinical situations and, at least from a content perspective, may yield less valid performance metrics. Finally, many previous studies assessed only technical performance, ignoring the important contribution of communication and teamwork in patient care. Our goal was to measure the performance of a large sample of experienced anesthesiologists in single scenarios. Although this strategy cannot yield reliable individual ability estimates, it allowed us to investigate group performance in simulations of higher complexity and ecologic validity.
To achieve our study aims, we designed moderate-length scenarios that had multiple credible diagnoses and treatments, thus replicating typical challenges of real events. Our participants worked in a team with trained confederate clinicians and with a second BCA in the latter half of each scenario. This design provided an environment where we could measure both technical and behavioral performance.
Relevance to Real-world Practice
Some may dismiss the variable and sometimes suboptimal performances observed in our study as the result of the artificiality of a simulated setting and contend that such deficiencies do not occur during patient care. However, we observed a variety of performance deficiencies that have been reported previously in both real and simulated events.41 For example, almost one fifth of participants in the atrial fibrillation/myocardial infarction scenario failed to cardiovert unstable atrial fibrillation, and a similar proportion failed to request that the surgeon open the abdomen in the face of exsanguination in the hemorrhage scenario. Performance gaps observed in these simulations are known to occur during patient care, including deficiencies or delays in the following: (1) transfusing during catastrophic hemorrhage42 ; (2) cardioversion of unstable arrhythmias43 ; (3) applying appropriate pharmacologic treatment of significant hypotension44 ; and (4) effective communication between surgical and anesthesia personnel. Failure to engage the surgeon in a timely and effective fashion, including reluctance to suggest that the surgeon obtain help or use an alternate surgical approach,45 is a well-documented pitfall during both real and simulated cases.42,46,47 That performance gaps identified in this study occur and have been associated with poor outcomes in real cases43,48–50 provides evidence to support the construct validity of our results.
Using comparable high-acuity scenarios, one would expect similar findings among other types of anesthesia professionals, emergency physicians, intensivists, interventional cardiologists, or surgeons. Although many other types of clinicians may only rarely face high-acuity critical events, some type of crisis management is required in nearly every clinical domain. Furthermore, issues of interprofessional communication and teamwork, effectively measured in our simulation scenarios, are important across all areas of health care.
The simulated clinical environment, although realistic, was not identical to the participants’ own practice environments. If faced with similar real emergencies in their familiar clinical setting with an established team of colleagues, these participants would probably perform better. Furthermore, since this study was grafted onto a learning experience, participants may not have been as motivated to perform as well as if it had been a test or a real-world crisis. Yet, many BCAs routinely find themselves in suboptimal, unstandardized, or unfamiliar environments where adaptability is essential to effective performance.
Simulating human pathophysiology is challenging, and imperfect portrayal of clinical signs and symptoms of real patients could have induced omission of correct actions or commission of incorrect ones. To mitigate this, participants were familiarized thoroughly with the mannequin and simulated care environment and were studied after having participated in or seen at least one encounter. Notably, two thirds of participants had previous simulation experience. The scenarios were designed to be realistic and appropriate to assess performance.4 Each one contained multiple reinforcing cues to present unambiguous depictions of key events and to produce a realistic progression. Thousands of board-certified anesthesiologists have judged simulation-based MOCA courses to be effective, realistic, and relevant to their practices.4,51 Furthermore, anesthesiologists have indicated that simulation-based training facilitated meaningful practice improvements that often had impact beyond their own individual practices.51 Nevertheless, it is possible that some participants might have performed better with more practice in the simulation environment. Some participants may not have clinical practices that expose them to the types of cases presented during the course. However, the SMEs felt that these four scenarios typified events that all BCAs should be expected to manage.
All four scenarios were designed to depend on management according to established guidelines (e.g., advanced cardiac life support, MH, LAST). The SMEs established the CPEs for each scenario. We subsequently trained the raters based on these criteria. Many of the actions for which performance gaps were seen are indeed widely accepted as appropriate crisis management practices (see table 2 for examples).
Grafting the study onto the MOCA simulation courses constrained our study design. Course logistics mostly restricted participants from being studied more than twice–once in the HS and once as an FR. Each MOCA encounter was followed by a facilitated peer debriefing, which could have influenced subsequent performances. Querying study participants systematically about why they did what they did might have yielded greater understanding of their performance,52 but it would have adversely affected debriefing quality and course flow for all of the course attendees.
Although raters were well trained, used sophisticated video review software, and provided reasonably reliable ratings, they could have missed subtle aspects of participant performance. Notwithstanding, we sought to measure performance fairly, within the constraints of the study design, to determine an upper bound of participant performance. For example, when more than one rater scored an encounter for the CPEs and binary ratings, we used the most favorable score. Interrater reliability was lowest for the HS and team binary ratings, where raters disagreed in approximately 30% of the encounters. There could be several explanations for why reliability was lower than for global technical and behavioral ratings of the same performances: (1) the raters agreed on the level of performance observed but had different opinions about how to rate it, possibly in part because the binary score was not explicitly defined or anchored; (2) the binary rating was the only metric that combined both technical and behavioral elements, and raters may have disagreed about the relative importance of these two aspects of performance; or (3) the raters weighted different attributes of the performance differently over time. In future research, investigators might use our archive of video recordings to test different approaches to address these limitations of holistic performance ratings.
The absence of previous simulation experience was not an independent predictor of rated performance. Because this was a yes-or-no question, we do not know how much previous simulation experience each participant had, when it might have occurred, or the type of any such experience (e.g., if it targeted acute event management as did our scenarios). Furthermore, many of our demographic variables are not fully independent, so, for example, more recently trained BCAs are by definition younger and could be expected to have had more (and perhaps different types of) previous simulation-based training.
Significance and Future Directions
Practicing anesthesiologists are expected to be competent, to identify gaps in their knowledge and performance, and to participate in continuing medical education and practice improvement programs to address these gaps.53 In particular, they must be able to detect and manage time-sensitive, potentially lethal events. Yet, the literature suggests that suboptimal individual clinician performance still contributes to adverse events during perioperative care.54–56 The ability of an individual clinician involves a myriad of skills that cannot be captured by any single method of assessment, whether it is written or oral examinations, prospective or retrospective performance reporting, or with simulations. Nonetheless, although performance during simulated crisis events may not exactly reflect actual care, the results of this study indicate that simulation can play a key role as one important component of clinician assessment.
We measured population performance, not individual competence. Performance in a single scenario is an inadequate basis to judge the competence of any individual provider. If simulation were to be considered for use in summative performance assessment of any kind, it is clear that many scenarios would be needed to yield a reliable and valid estimate of ability. However, the data of this study, derived from a large sample of practicing anesthesiologists, provide useful feedback for training programs at all levels, from residency through MOC.
Continuing medical education and professional development currently relies largely on physicians’ self-assessment of their learning.57 Yet, it is well established that physicians have a limited ability to correctly ascertain their learning needs.58 Furthermore, less competent physicians may be more likely to overestimate their current knowledge and abilities.58 To improve performance, humans require accurate information about specific deficiencies (or gaps) and directed feedback from experts or a peer group to be able to inculcate and then strive, through deliberate practice, to achieve these learning goals.1 Simulation-based training with debriefing, such as that offered as part of MOCA, provides such a structure.
Mannequin-based simulation is well suited for assessing the management of high-acuity rare events and for crisis-resource management.59 Consequential, even potentially lethal, clinical performance gaps identified across our study cohort could be targeted for recurrent interprofessional training of both trainees and experienced personnel. Although dire events are rare, the skills needed in crises (anticipation, prevention, identification, and management of challenging occurrences) are universally important attributes of clinician expertise. Simulation allows for recurrent standardized assessment of individuals and teams, with appropriate retraining as indicated. Simulation-based training, often as part of a multimodal intervention, has been shown to improve patient care.33,60,61
Our findings suggest that the responses of some experienced practicing anesthesiologists during life-threatening, real-world events are suboptimal. Although we cannot say with certainty whether anesthesiologists who perform well or poorly in simulation will respond similarly during actual events, collective experience and the literature suggest that clinician performance during real-world crises is also variable62,63 and imperfect.64
Implications for Real-world Crisis Management.
If performance in emergencies is suboptimal, why does harm to patients seem rare? First, although serious adverse events are relatively uncommon, when they do occur, failure to rescue may be attributed to patient illness or may go unreported.65,66 Second, individual clinicians may self-select their practice to be specialized or even circumscribed in complexity. Clinicians thought to be lower performing than others may be protected by scheduling simpler cases or other support mechanisms. Third, clinicians uncommonly work in isolation; they are part of care systems designed in part to reduce the risk of and enhance the recovery from untoward events.67 In some settings, many supporting clinicians can be called in to assist in an emergency, whereas in this study only one responding BCA was provided. The arrival of the second BCA usually improved performance and perhaps more so with lower-performing HS participants. The availability of experienced help in real crises depends on practice setting and time of day; many private-practice MOCA course participants comment that help from other BCAs is rarely available to them. Nevertheless, a cornerstone of safe and effective care systems remains high-performing individual clinicians, working alone and together in teams, during routine, nonroutine, and crisis situations.68,69
Implications of the Performance Gaps Observed.
How might the performance gaps that we observed be addressed? Many parallel strategies are possible; most are commonplace in other industries of high dynamism and high intrinsic hazard, such as aerospace, nuclear power, the military, or the maritime industry. These include, for example, recurrent high-fidelity simulation training of both trainees and experienced physicians, sometimes including other team members, on the recognition and management of specific events and the use of crisis resource management techniques, as well as practice working in clinical teams to manage unfolding adverse events. Another strategy is the regular and uniform use of protocol guidance optimized for real-time use via emergency manuals and other cognitive aids. Other industries conduct regular formative performance assessment of individuals and teams and provide appropriate practice improvement activities, as indicated.
We need to understand more deeply why individual physicians and other clinicians do not always execute the kind of decision-making and action that are expected. We also need to investigate in greater detail the decision-making, event management, and team leadership of experienced physicians in many different simulated situations. This might require a full day of simulation training for each participant, making such programs costly, but necessary, along the path of better understanding of how to continue to improve physician performance in the pursuit of patient safety.
The authors acknowledge (in alphabetical order) the substantive contributions (e.g., served as a subject matter expert [SME] or video rater, assisted in scenario development, assisted in manuscript preparation) of: Russ Beebe, B.A. (Center for Research and Innovation in Systems Safety, Vanderbilt University Medical Center, Nashville, Tennessee), Thomas Belda, B.S. (Mayo Clinic Multidisciplinary Simulation Center, Rochester, Minnesota), Edwin A. Bowe, M.D. (University of Kentucky College of Medicine, Lexington, Kentucky), Richard H. Blum, M.D., M.S.E. (Children’s Hospital of Boston, Boston, Massachusetts), Brian Cammarata, M.D. (Old Pueblo Anesthesia, Tucson, Arizona), Douglas B. Coursin, M.D. (University of Wisconsin–Madison School of Medicine and Public Health, Madison, Wisconsin), Gregory J. Crosby, M.D. (Brigham and Women’s Hospital, Harvard Medical School, Boston, Massachusetts), Deborah J. Culley, M.D. (Brigham and Women’s Hospital, Harvard Medical School), Anthony Dancel, B.S. (Massachusetts General Hospital, Center for Medical Simulation, Boston, Massachusetts), Andrew Kline, B.A. (Vanderbilt Comprehensive Care Clinic, Vanderbilt University Medical Center), Jordan Halasz, B.S. (Center for Experiential Learning and Assessment, Vanderbilt University Medical Center), Steven C. Hall, M.D. (Northwestern University Feinberg School of Medicine, Chicago, Illinois), Hans J. Hinssen, Dipl. Ing. (Penn State Hershey Clinical Simulation Center, Hershey, Pennsylvania), Joy Hawkins, M.D. (University of Colorado School of Medicine, Aurora, Colorado), Alan Johnstone, B.S. (Vanderbilt University Medical Center), Stephen J. Kimatian, M.D. (The Cleveland Clinic, Cleveland, Ohio), Jerome Klafta, M.D. (University of Chicago Pritzker School of Medicine, Chicago, Illinois), John Lutz, B.S. (Winter Institute for Simulation Education and Research, Pittsburgh, Pennsylvania), Christie Mulvey, B.S. (Penn State Hershey Clinical Simulation Center), Robert Nadelberg, M.D. (Massachusetts General Hospital, Center for Medical Simulation), Viren Naik, M.D., Med., M.B.A. (University of Ottawa Skills and Simulation Centre, Ottawa, Canada), Edward Nemergut, M.D. (University of Virginia School of Medicine, Charlottesville, Virginia), Eric Porterfield, M.S., M.S.N., R.N., F.N.P.-B.C. (Vanderbilt University Medical Center), Niraja Rajan, M.D. (Penn State Hershey Medical Center, Hershey, Pennsylvania), Lauryn Rochlen, M.D. (University of Michigan School of Medicine, Ann Arbor, Michigan), Ryan Romeo, M.D. (University of Pittsburgh School of Medicine and Winter Institute for Simulation Education and Research, Pittsburgh, Pennsylvania), Michael Seropian, M.D. (Oregon Health and Science University, Portland, Oregon), Ljuba Stojiljkovic, M.D. (Northwestern University Feinberg School of Medicine, Chicago, Illinois), Huaping Sun, Ph.D. (The American Board of Anesthesiology, Raleigh, North Carolina), Jeff Taekman, M.D. (Duke University School of Medicine, Durham, North Carolina), Christina Valle (Center for Medical Simulation), William B. Waldrop, M.D. (Baylor College of Medicine, Houston, Texas), and Cynthia Wong, M.D. (University of Iowa Carver College of Medicine, Iowa City, Iowa).
Supported in part by grants from the Agency for Healthcare Research and Quality (No. R18 HS020415), Rockville, Maryland, and the Anesthesia Patient Safety Foundation, Rochester, Minnesota (to Dr. Weinger), and by a grant from the Foundation for Anesthesia Education and Research, Schaumburg, Illinois (to Dr. Banerjee). The American Society of Anesthesiologists, Schaumburg, Illinois, allowed the project team to use their GoToMeeting teleconferencing account.
The authors declare no competing interests.