“Unless [appropriate] modeling is used, the chance of falsely detecting anesthesiologists as [being below average] can be greater than 50%…”
IN this issue of Anesthesiology, Glance et al.1 compare statistical methods for risk-adjusted comparisons among providers (e.g., hospitals and anesthesiologists). They present their findings in the context of hospital versus “physician-based measures for Merit-Based Incentive Payment.”1 There are multiple reasons to evaluate the performance of hospitals and their anesthesia departments as single teams.2 Glance et al.1 summarize the policy options well. In this editorial, we consider the implications of the article for evaluating individual anesthesiologists.
Individuals are hired, are credentialed by hospitals, and are promoted. Consequently, reasonably, there are multiple requirements from accreditation agencies (e.g., The Joint Commission, Oak Brook, Illinois) and corporations (e.g., universities) to evaluate individual anesthesiologists’ clinical performance.
When comparing low-incidence binary data (e.g., patient mortality) among anesthesiologists, one must (1) know patient conditions (risk factors) upon admission, (2) adjust for those risks statistically, and (3) compare among anesthesiologists using hierarchical modeling.1,3,4 Unless risk-adjusted hierarchical modeling is used, the chance of falsely detecting anesthesiologists as having below-average performance can be greater than 50% (i.e., worse than flipping a coin).1
The results of the study by Glance et al.1 are convincing because their findings are (reasonably) biased toward underestimating false discovery rates (i.e., incorrectly reporting average anesthesiologists as low performers). First, their simulations assume that the risk adjustment model and the data collected are both perfect, which, of course, is untrue with real (clinical) data. Second, all providers are assumed to have performed the same numbers of cases, which, again, will be untrue. With imbalance in case numbers, the 95% CIs calculated by the authors would be less accurate (e.g., greater false discovery rates).5
Collecting patient risk factor data and performing hierarchical logistic regression modeling take substantial resources (e.g., analysts).6 The expertise for this versus Student’s t test is analogous to comparing anesthesia expertise for cardiac surgery versus diagnostic colonoscopy. Yet, if your department reports low-incidence adverse events (e.g., less than or equal to the 2.7% incidence simulated by Glance et al.1 ) by an anesthesiologist, the results show that your department should use risk-adjusted hierarchical logistic regression modeling.1,7
In our opinion, hiring analysts for this purpose is not worthwhile. Suppose your department accepts a false discovery rate (see Glance et al.1 ) of approximately 5%. Then, even with unrealistically large n = 1,000 patients per anesthesiologist per evaluation period for an endpoint, Glance et al.1 show that there is only a 14.2% sensitivity to detect anesthesiologists with 50% greater than average rates of adverse outcomes. Thus, even for the highest risk procedures (e.g., cardiac surgery), typically a small proportion of the total anesthesia caseload, poorly performing anesthesiologists cannot reliably be identified.1,4,8 The reason is that serious adverse events are simply too infrequent for accurate comparisons of individual anesthesiologists. As Glance et al.1 recommend, public reporting and merit-based payment should be by hospital.
Comparing individual anesthesiologists based on clinical performance measures that occur more frequently also has been fruitless.9–12 For example, pain upon arrival in the postanesthesia care unit needs to be risk adjusted for factors often not known accurately (e.g., the specific postanesthesia care unit nurse obtaining the pain score and patient chronic opioid use).9 When the risk adjustments are made, differences among anesthesiologists are not detected.9 Patient satisfaction with the anesthesiologist lacks face (content) validity because amnesia is a fundamental part of anesthesia.10 After controlling for relevant covariates including patient waiting from surgical start times, there are not significant differences among individual anesthesiologists.11 Finally, prolonged times to extubation differ substantively among patients but not among anesthesiologists.12 Consequently, in our opinion, rely on the results of the study by Glance et al1 and previous work.7,12 Do not use risk-adjusted hierarchical logistic regression models with low-incidence clinical outcomes and performance measures for comparing individual anesthesiologists.
Acknowledgment
The authors thank Ms. Jennifer Espy, B.A. (University of Iowa, Iowa City, Iowa), for assisting with the editing of this manuscript.
Research Support
Supported by funding from the Department of Anesthesia at the University of Iowa, Iowa City, Iowa.
Competing Interests
The authors are not supported by, nor maintain any financial interest in, any commercial activity that may be associated with the topic of this article.