Optimal risk adjustment is a requisite precondition for monitoring quality of care and interpreting public reports of hospital outcomes. Current risk-adjustment measures have been criticized for including baseline variables that are difficult to obtain and inadequately adjusting for high-risk patients. The authors sought to develop highly predictive risk-adjustment models for 30-day mortality and morbidity based only on a small number of preoperative baseline characteristics. They included the Current Procedural Terminology code corresponding to the patient's primary procedure (American Medical Association), American Society of Anesthesiologists Physical Status, and age (for mortality) or hospitalization (inpatient vs. outpatient, for morbidity).

Data from 635,265 noncardiac surgical patients participating in the American College of Surgeons National Surgical Quality Improvement Program between 2005 and 2008 were analyzed. The authors developed a novel algorithm to aggregate sparsely represented Current Procedural Terminology codes into logical groups and estimated univariable Procedural Severity Scores-one for mortality and morbidity, respectively-for each aggregated group. These scores were then used as predictors in developing respective risk quantification models. Models were validated with c-statistics, and calibration was assessed using observed-to-expected ratios of event frequencies for clinically relevant strata of risk.

The risk quantification models demonstrated excellent predictive accuracy for 30-day postoperative mortality (c-statistic [95% CI] 0.915 [0.906-0.924]) and morbidity (0.867 [0.858-0.876]). Even in high-risk patients, observed rates calibrated well with estimated probabilities for mortality (observed-to-expected ratio: 0.93 [0.81-1.06]) and morbidity (0.99 [0.93-1.05]).

The authors developed simple risk-adjustment models, each based on three easily obtained variables, that allow for objective quality-of-care monitoring among hospitals.

## What We Already Know about This Topic

Current perioperative risk-adjustment measures have been criticized for including baseline variables that are difficult to obtain and inadequately adjusting for high-risk patients

## What This Article Tells Us That Is New

Novel, highly predictive risk quantification models for 30-day mortality and composite major morbidity were successfully developed that use only three readily available patient and procedural characteristics

HETEROGENEITY among providers (*i.e.* , clinicians or hospitals) in the quality of health care delivered remains an issue of mounting concern in the United States.1A key issue in improving overall quality of the healthcare system is identifying this heterogeneity by studying outcomes-based performance measures. But valid outcome-based quality monitoring efforts must properly account for inherent variability in risk associated with differences in baseline characteristics among patients as well as differences in procedural complexity.2,3

The American Society of Anesthesiologists (ASA) Physical Status Classification System has become a routine method of preoperative risk assessment. Although the ASA Physical Status is a baseline risk assessment and thus operative risk is not included in the score, it has been associated with perioperative morbidity and mortality. 4–7Combined with other baseline factors, it may contribute to an accurate characterization of patient risk.

Currently available outcomes-based measures may not adequately adjust for risk, especially in sicker patients.8To the extent that available indices inadequately account for risk in sicker patients, they encourage clinicians and health systems that must publically report outcomes to “cherry pick” relatively healthy patients.9

Other scores have been developed for specific risks or for specific patient populations, such as the Goldman Cardiac Risk Index for cardiac risk in noncardiac surgery patients and the European System for Cardiac Operative Risk Evaluation for estimating operative mortality in patients undergoing cardiac surgery.10,11These scores, although well-established and implemented, were not developed for the purpose of risk-adjusted quality-of-care comparisons among providers with respect to noncardiac surgical patients. More importantly, they require information that usually is not available in administrative databases that most broadly represent the U.S. surgical population.

Recently, Sessler *et al.* introduced a novel Risk Stratification Index based on Medicare Provider Analysis and Review files.12This index improves greatly on previous models. The Medicare Provider Analysis and Review database contains information from claims for services and relies completely on administrative data, namely International Classification of Disease codes. In contrast, the procedure coding of the American College of Surgeons National Surgical Quality Improvement Program (ACS-NSQIP) and several other administrative databases uses the Healthcare Common Procedure Coding System, of which the American Medical Association Current Procedural Terminology (CPT) codes are an integral part.13

The ACS-NSQIP, a national, prospectively implemented program for comparative assessment and enhancement of surgical outcomes among multiple institutions for several surgical subspecialties, incorporates risk-adjustment indices for 30-day mortality and 30-day morbidity.14,15Although these models are highly predictive, they require collection of detailed information on a multitudinous array of patient risk factors (including laboratory values). Generalizable and practical risk-adjustment models would best rely on a limited number of risk factors that are routinely available for most patients. Thus, our goal was to develop practical, highly predictive risk-adjustment models for 30-day mortality and morbidity in the U.S. noncardiac surgical population based only on baseline patient characteristics and CPT codes.

## Materials and Methods

### Data Collection

The ACS-NSQIP is a prospective, outcomes-based registry for comparative assessment and enhancement of the quality of surgical care among multiple institutions for several surgical subspecialties.16,17More than 250 academic and large community surgical institutions across the United States participate in the ACS-NSQIP. Participating institutions employ full-time surgical clinical reviewers to ensure the integrity of perioperative and 30-day outcomes data. Furthermore, routine audits are performed to monitor the accuracy of the data collection process. Data pertaining to patients undergoing certain low-morbidity, high-volume procedures are limited to avoid overwhelming the registry. Patients included in the ACS-NSQIP registry are enrolled in site-specific, 8-day cycles to avoid systematic bias introduced by weekly patterns in case loads.18We used perioperative data on 635,265 noncardiac surgery patients treated between 2005 and 2008 at participating ACS-NSQIP surgical centers.

We withheld a randomly selected validation cohort of 50,000 patients to study the predictive accuracy and calibration of our risk models. The remaining patients' data were used for developing our risk models as described below. Although our validation cohort was not external to the set of ACS-NSQIP–participating hospitals, we thought the cohort was adequate to study whether we overfit our model to the training cohort.

The primary outcomes for our study were 30-day postoperative mortality and 30-day postoperative major morbidity, which was defined as an occurrence of one or more of the following postoperative complications: organ space infection, pneumonia, unplanned intubation, pulmonary embolism, ventilator dependence more than 48 h, acute renal failure, stroke/cerebral vascular accident with neurologic deficit, coma more than 24 h, cardiac arrest requiring cardiopulmonary resuscitation, myocardial infarction, transfusion of 5 or more units erythrocytes within 72 h, sepsis/septic shock, and mortality.

### Statistical Analysis

SAS statistical software version 9.2 (SAS Institute, Cary, NC) was used for data management, whereas statistical modeling was performed using R software version 2.8.1 (The R Foundation for Statistical Computing, Vienna, Austria) and, specifically, its Design and Hmisc packages.19

Our overall modeling approach was implemented in two phases. First, for each of our two outcomes (mortality and morbidity) we developed a univariable score measuring procedure-associated risks (hereafter referred to as Procedural Severity Scores or PSS). Second, we developed multivariable logistic regression models for mortality and morbidity, each using the respective PSS and routinely available patient characteristics as predictor variables.

### Univariable Risk Scores Associated with the Primary Procedure

Development of the univariable procedure-associated risk scores was as follows: The ACS-NSQIP database includes procedural data, encoded using the CPT codes.20Our main predictor variable was based on the primary CPT code associated with each patient's visit. Frequencies of individual (primary) CPT codes in the training data set varied widely, ranging from 1 patient (369 CPT codes) to 39,373 patients (laparoscopic cholecystectomy); among the 2,555 CPT codes present in the training data set, 1,721 codes were represented by 30 or fewer patients.

### Aggregation of Procedures

The Clinical Classifications Software for Services and Procedures (CCS, U.S. Department of Health and Human Services Agency for Healthcare Research and Quality) aggregates each individual CPT code into 1 of 244 mutually exclusive, clinically appropriate categories. Based on these CCS procedure groups, we postulated that the aggregation of many sparsely represented CPT codes would provide increased precision for predicting the primary outcomes of our study. Furthermore, we considered the possibility that even some CCS procedure groups might have similarly low frequencies of cases. On the other hand, certain CPT codes (such as laparoscopic cholecystectomy) were sufficiently common not to require aggregation.

We therefore aggregated procedures represented by fewer than a certain number of patients—which we denote by *N** —into the associated CCS procedure group. Furthermore, we allowed for the possibility that even after this aggregation, certain CCS procedure groups would be sparsely represented. These sparsely represented CCS procedure groups were further aggregated into an all-purpose “other” group using the same rule (aggregate if the number of patients is less than *N** ). For 30-day mortality and composite 30-day morbidity separately, we selected *N** objectively such that the aggregation routine yielded maximal predictive accuracy among the training cohort.

### Determination of Minimum CCS Procedure Group Size for Aggregation

We implemented a cross-validation study (for each outcome) as follows: For candidate values of *N** ranging between 10 and 5,000, we randomly partitioned the training cohort into 10 subgroups of equal size. Then, for each subgroup, we used the other 90% of the data to (1) aggregate procedures using the methodology outlined above (with the candidate *N** in question); (2) estimate the incidence of the outcome within each (aggregated) group; and (3) evaluate the predictive accuracy of these incidences for the reserved 10% of the data using the c-statistic (the c-statistic is a quantity ranging between 0.5 and 1.0, for which 0.5 represents no discriminative ability beyond random guessing, whereas 1.0 represents perfect prediction19). This process was repeated over the 10 randomly partitioned subgroups, and the c-statistics were averaged to obtain an estimate of the expected c-statistic associated with the candidate *N** in question. Smoothing spline regression was then used to find *N** (*i.e.* , the value that maximized the cross-validated c-statistic), as shown in figure 1for 30-day mortality.

### Postaggregation Assignment of PSS

Once N* was selected using the cross-validation routine described in the previous section (again, specific to each outcome), the incidence of the outcome for each aggregated group was estimated using the entire training cohort. Then the PSS (our measure of univariable procedure risk) was defined for 30-day mortality as the twelfth root of these estimated incidences (which we refer to as PSS-mortality). Because these incidence estimates were heavily skewed to the right (*i.e.* , most procedures are associated with low risk of the outcome, but a few procedures are associated with a larger risk), this twelfth-root transformation was warranted to model a linear relationship between PSS mortality and log-odds of mortality. With methodology akin to that used to obtain our PSS mortality score, we defined PSS morbidity using the cube-root transformation of the aggregated group-specific incidence observed among the training cohort.

### Multivariable Modeling with PSS and Other Patient Characteristics

With the PSS now defined, we proceeded to build our overall risk models combining procedural and patient characteristics as follows: First, multivariable logistic regression was used with PSS and all baseline risk factors showing potential predictive ability as determined by relatively large values of test statistics for univariable association with the outcome (*e.g.* , univariable chi-square test statistic greater than 1,000 for categorical predictors). In this “full” model, the relationship between PSS and the log-odds of the outcome was modeled linearly, whereas restricted cubic splines with four equally spaced knots were used for all other continuous risk factors to allow for potential nonlinearities in the estimated relationships. C-statistics for this model (and their associated 95% CI, estimated using normal approximation theory for proportions), applied to both the training and validation data sets, were then estimated. As mentioned, the utility of such a full model suffers from difficulties associated with collecting data on many baseline risk factors. Thus, we considered a “reduced” model, which incorporated a limited number of selected risk factors with the goal of minimizing the reduction in the model c-statistics.

### Validation of Fitted Models

Our definition of the composite morbidity outcome differs slightly from that which was used to define the morbidity index currently available in the ACS-NSQIP database (for instance, we include mortality as one of the individual outcomes comprising composite morbidity, whereas the ACS-NSQIP morbidity index does not). Furthermore, the morbidity and mortality indices in the ACS-NSQIP database are estimated for only general and vascular surgery patients, and these indices supplement baseline and procedural risk factors with preoperative laboratory results. Nonetheless, we compared c-statistics arising from our models to those from the ACS-NSQIP indices for general and vascular surgery patients.

Calibration (the agreement between the predicted probabilities from a model and the observed outcome proportions) was assessed for our final models for mortality and morbidity within our validation cohort. First, we divided the validation cohort patients into groups based on their predicted probabilities of the outcome; then, we estimated group-specific ratios of observed-to-expected outcome counts (O/E ratios), using a Poisson log-linear model. Groups were selected to reflect clinically meaningful strata of risk. The expected outcome count within each group was the sum of the predicted probabilities of its members. A global Wald test of all O/E ratios equal to 1 was performed with a significance criterion of 0.05, and Bonferroni-adjusted 95% simultaneous CIs were estimated for each O/E ratio.

## Results

### Aggregation of Procedures

The optimal minimum group-specific sample size for the aggregation of procedures into groups was *N** = 77 for mortality (fig. 1) and *N** = 23 for morbidity. The resulting estimates of PSS mortality and PSS morbidity, as well as details about the aggregation of individual CPT codes into CCS procedure groups and of CCS procedure groups into an all-purpose other category, are given in Supplemental Digital Content 1, http://links.lww.com/ALN/A738, for each individual CPT code.

### Discrimination

In general, 30-day mortality was more accurately predicted than composite 30-day morbidity, as evidenced by the higher c-statistics among the randomly determined subset of 50,000 validation cases (table 1).

For mortality, PSS predicted the outcome better than did individual CPT codes (c-statistic [95% CI] 0.867 [0.859–0.876]*vs.* 0.847 [0.838–0.856]). Our reduced model for mortality included the following risk factors: PSS-mortality, ASA Physical Status, and age. This model resulted in a validation data c-statistic of 0.915 (0.906–0.924), whereas the ACS-NSQIP mortality score of Khuri *et al.,* as applied to our validation cohort, gave a c-statistic of 0.941 (0.931–0.950).14

Predictive ability for PSS-morbidity was essentially similar to that observed for CPT codes (table 1). For this outcome, our reduced model incorporated PSS-morbidity, ASA Physical Status, and hospitalization (*i.e.* , inpatient *vs.* outpatient); this model produced a validation data c-statistic of 0.867 (0.858–0.876). No appreciable increase in predictive accuracy was observed when using the comparator ACS-NSQIP morbidity score of Daley *et al.* (C = 0.875 [0.866–0.884]), whose analysis incorporated more risk factors and was based on a subset of the noncardiac surgical population (whereas our models were developed and validated using all patients in the registry).15

Before incorporation of patient risk factors, the procedural information alone predicted both outcomes quite accurately. The PSS developed for 30-day mortality resulted in a univariable c-statistic (95% CI 0.867 [0.859–0.876]; table 1), whereas the PSS developed for the composite morbidity outcome was 0.839 (0.830–0.848); these c-statistics were comparable or slightly higher (more discriminative) than the c-statistics achieved by using the individual CPT codes.

Combining patient demographic and morphometric information with the PSS increased the c-statistic for both outcomes. In the full model for 30-day mortality (which incorporated 24 total predictors), the c-statistic was 0.936 (0.927–0.945), which was comparable with the predictive ability achieved by the ACS-NSQIP mortality score of Khuri *et al.* (as applied to our validation cohort).14Our reduced model for 30-day mortality (which included only PSS-mortality, ASA Physical Status, and age) predicted the outcome only marginally less accurately than did the models that incorporated many predictors. Similar results were obtained for the composite 30-day postoperative morbidity/mortality outcome; although overall this outcome was predicted less accurately than 30-day mortality, our reduced model performed as accurately as the full model and as accurately as the ACS-NSQIP morbidity score of Daley *et al.* 15

### Calibration

Calibration of the two final models is summarized in table 2; the *P* value of the global Wald calibration test was 0.64 for 30-day mortality and 0.40 for 30-day morbidity, meaning that we did not find sufficient evidence indicating lack of agreement between model predictions and the observed outcomes. Specifically, the ratio of observed/expected number of deaths among patients with more than 10% predicted probability based on our reduced model for mortality (*i.e.* , high-risk patients) was 0.93 (0.81–1.06). Likewise, the ratio of observed/expected number of patients experiencing composite major morbidity among patients with more than 20% predicted probability based on our reduced model for morbidity was 0.99 (0.93–1.05).

### Obtaining Risk Estimates

Nomograms (graphic tools for obtaining regression model estimates) are provided for our reduced models (along with instructions) in figures 2 and 3. Within these nomograms, the axis length corresponding to each risk factor represents the relative degree of contribution of that risk factor in the model; thus, PSS mortality and PSS morbidity were the strongest predictors among those chosen for their respective models. ASA Physical Status, the next strongest predictor in each of the reduced models, displayed a relatively larger influence on the prediction for morbidity than it did for mortality.

We have developed an R package that contains functions for obtaining predictions for an entire data set; it is available on our Web site.**

## Discussion

The models previously established for risk adjustment within the ACS-NSQIP database proved to be highly discriminative; consequently, the expected risk for participating general and vascular surgery patients is well-defined. Quality-of-care investigations that compare observed outcome rates to (model-based) expected rates among ACS-NSQIP providers appear to adequately account for differences in risk profiles of individual patient populations. However, external implementation of these models is severely hampered by their dependence on a multitude of patient and procedural risk factors, including many that are not generally available for each patient (such as preoperative laboratory measurements like serum creatinine). Our reduced models for 30-day mortality and morbidity each required only three pieces of data to obtain an estimate of risk and achieved excellent predictive accuracy for 30-day postoperative mortality (c-statistic 0.915 [0.906–0.924]) and morbidity (0.867 [0.858–0.876]). This level of discrimination was only marginally lower than the far-more-complicated ACS-NSQIP mortality14and morbidity15scores. Even in high-risk patients, observed rates calibrated well with estimated probabilities for mortality (O/E ratio: 0.93 [0.81–1.06]) and morbidity (0.99 [0.93–1.05]).

Quality improvement initiatives face several hurdles in clinical practice. Lack of confidence among physicians and hospitals in the validity of existing outcomes-based quality measures can impede successful clinical quality improvement.21Specifically, there is a perception that currently available outcomes-based measures underrepresent risk associated with high-risk patients. In our models, the observed mortality rate among high-risk patients (as defined by more than 10% predicted probability of mortality) was between 19% less and 6% greater than that expected by our model, with 95% confidence. Similar results were observed for composite morbidity; using more than 20% predicted probability to characterize high-risk patients for this outcome, the observed morbidity rates were not more than 7% different from that expected by our model.

Our models suggest that ASA Physical Status, in conjunction with other variables, provides an accurate depiction of risk. Although ASA Physical Status previously has been shown to measure risk accurately, its implementation in quality-of-care monitoring has been limited. This is because of a perceived lack of availability and an ongoing concern that the scoring is subjective in nature.22–24Although ASA Physical Status is not universally available in administrative datasets, it is a powerful parameter that condenses relevant clinical measures of patient risk and acuity into a single variable.

The integration of clinical parameters with administrative variables allows for the greatest opportunity to predict baseline risk in surgical patients. Each source, clinical or administrative, has its benefits and disadvantages. The ACS-NSQIP data set as a clinical registry is highly regarded for its data quality. Critics often point out a lack of detail for intraoperative or anesthesia-related variables, but outcome definitions are highly detailed, universally applied, and rigorously audited. Using data from clinical registries such as the ACS-NSQIP brings a certain limitation compared with purely administrative data (mainly based on International Classification of Diseases, Ninth Revision codes) but also offers unique advantages, such as temporal relations. As Orkin states in a recent editorial, clinical data sets can be rich in detail but may lack uniform definitions and miss substantial amounts of data.25Because administrative data are frequently used for billing purposes, data sets are more complete and uniform. However, diagnostic and procedure codes tend to lack specificity to encode complex clinical scenarios and fail to distinguish temporal relationships.25We believe one of the strengths of the presented model is achieving an accurate prediction of outcomes with the combined use of clinical and administrative data while using only the most relevant data points.

In addition to use in studies evaluating relative quality of care among providers and health systems, these models may be used to provide specific risk assessments to individual patients. However, we note that physicians should not use our estimates as a basis for decisions between or among procedures in a given patient because our estimates do not represent the (immeasurable) differential risk for individuals among procedures. Instead, they represent the expected risk among large groups of patients with similar characteristics undergoing a given procedure. Patients undergoing a specific procedure will be more homogeneous than a global population of surgical patients. Applying broadly defined quality improvement models to narrowly defined populations likely will result in reduced discriminative ability. Thus, specific risk quantification models for a given disease should be used, when they exist, for predicting individual outcomes. Such models have been developed, for example, for colon resections, esophagectomies in resectable esophageal cancer, and gastric bypass surgery.26–28

We used data on 585,265 patients available from the ACS-NSQIP database at the time of analysis to build our models. It is possible that incremental improvements in predictive accuracy may be achieved by including data from future years because more predictions specific to individual procedures will be possible. That is, more individual procedures will have enough patients to meet the minimum bucket size for PSS estimation, instead of being statistically grouped with other similar procedures. Furthermore, to the extent that providers vary in quality of care, our risk quantification models might be further improved by incorporating, say, the treating institution, as a predictor variable.

Our models displayed a high level of predictive accuracy on a randomly withheld validation cohort of 50,000 patients. But developing predictive models requires external validation as well as internal validation. To the extent that outcomes among participating hospitals differ from outcomes among nonparticipating hospitals at a given level of patient/surgical risk, our model may not calibrate well among the nonparticipating institutions. An external validation with complete 30-day outcomes on all hospitals would address this issue. However, identifying a suitable registry for such external validation will be challenging given the quality of the ACS-NSQIP data set and its specific definitions, outcomes, and scope.

Public reporting of provider outcome statistics is a central tenet of national health care quality improvement, and reported results will have considerable financial impact throughout the healthcare system. An obvious use of our prediction models is to risk-adjust estimates of overall performance indices (*i.e.* , observed-to-expected risk ratios) among institutions. Adjusted performance can then be compared across organizations to assess and improve quality of care.

In summary, we developed risk quantification models for 30-day mortality and composite major morbidity using novel severity scoring methodology for each primary CPT code. These models use just three readily available patient and procedural characteristics without appreciably sacrificing predictive accuracy compared with models that require far more information, including information that generally is not available for most patients. Furthermore, these models remain accurate even for high-risk patients. Thus, they can be used to establish accurately the expected risk in various surgical populations for the purpose of fairly comparing quality-of-care among American healthcare institutions.