Abstract
Risk stratification is essential for both clinical risk prediction and comparative audit. There are a variety of risk stratification tools available for use in major noncardiac surgery, but their discrimination and calibration have not previously been systematically reviewed in heterogeneous patient cohorts.
Embase, MEDLINE, and Web of Science were searched for studies published between January 1, 1980 and August 6, 2011 in adult patients undergoing major noncardiac, nonneurological surgery. Twenty-seven studies evaluating 34 risk stratification tools were identified which met inclusion criteria. The Portsmouth-Physiology and Operative Severity Score for the enUmeration of Mortality and the Surgical Risk Scale were demonstrated to be the most consistently accurate tools that have been validated in multiple studies; however, both have limitations. Future work should focus on further evaluation of these and other parsimonious risk predictors, including validation in international cohorts. There is also a need for studies examining the impact that the use of these tools has on clinical decision making and patient outcome.
ACCURATE prediction of perioperative risk is an important goal—to enable informed consent for patients undergoing surgery and to guide clinical decision making in the perioperative period. In addition, by adjusting for risk, an accurate risk stratification tool enables meaningful comparison of surgical outcomes between providers for service evaluation or clinical audit. Some risk stratification tools have been incorporated into clinical practice, and indeed, have been recommended for these purposes.1
Risk stratification tools may be subdivided into risk scores and risk prediction models. Both are usually developed using multivariable analysis of risk factors for a specific outcome.2 Risk scores assign a weighting to factors identified as independent predictors of an outcome; with the weighting for each factor often determined by the value of the regression coefficient in the multivariable analysis. The sum of the weightings in the risk score then reflects increasing risk. Risk scores have the advantage that they are simple to use in the clinical setting. However, although they may score a patient on a scale on which other patients may be compared, they do not provide an individualized risk prediction of an adverse outcome.3 Examples of risk scores are the American Society of Anesthesiologists’ Physical Status score (ASA-PS)4 and the Lee Revised Cardiac Risk Index.5
By contrast, risk prediction models estimate an individual probability of risk for a patient by entering the patient’s data into the multivariable risk prediction model. Although risk prediction models may be more accurate predictors of an individual patient’s risk than risk scores, they are more complex to use in the day-to-day clinical setting.
Despite increasing interest in more sophisticated risk prediction methods, such as the measurement of functional capacity by exercise testing,6 risk stratification tools remain the most readily accessible option for this purpose. However, clinical experience tells us that they are not commonly used in everyday practice. Lack of use may be due to poor awareness amongst clinicians of the available options and concerns regarding their complexity and accuracy.7 In other clinical settings, low uptake of risk stratification tools has been ascribed to a lack of clarity on the precision of available tools, resulting from perhaps unnecessary efforts to make minor refinements to existing methods, or to developing novel methods, with the aim of achieving greater predictive accuracy.8
With the aim of summarizing the available risk stratification tools in perioperative care, in order to make recommendations about which methods are appropriate for use both in clinical practice and in research, we have undertaken a qualitative systematic review on the available evidence. The specific question we sought to answer was “What is the performance of risk stratification tools, validated for morbidity and/or mortality, in heterogeneous cohort of surgical (noncardiac, nonneurological) patients?” The review had three main objectives as follows: to summarize the available risk prediction methods, to report on their performance, and to comment on their strengths and weaknesses, with particular focus on accuracy and ease of application.
Materials and Methods
Previously published standards for reporting systematic reviews of observational studies were adhered to when undertaking this study.9 A Preferred Reporting Items for Systematic reviews and Meta-analyses checklist10 was used in the preparation of this report (appendix 1).
Definitions for the Purposes of This Study
A “risk stratification tool” was defined as a scoring system or model used to predict or adjust for either mortality or morbidity after surgery, and which contained at least two different risk factors. “Major surgery” was defined as a procedure taking place in an operating theatre and conducted by a surgeon; thus, studies of cohorts of patients undergoing endoscopic, angiographic, dental, and interventional radiological procedures were excluded. A “heterogeneous patient cohort” was defined as a cohort of patients including at least two different surgical specialities. Studies of gastrointestinal surgery, which included hepatobiliary surgery, were included. We excluded studies that consisted entirely of cohorts undergoing ambulatory (day case) surgery and cohorts that included cardiac or neurological surgery.
Search Strategy and Study Eligibility
A search for articles published between January 1, 1980 and August 6, 2011 was undertaken using MEDLINE, Embase, and Web of Science. No language restriction was applied. The search strategy and inclusion and exclusion criteria are detailed in appendix 2. Of note, articles reporting development studies were excluded, unless the article included validation in a separate cohort.
Data Extraction and Quality Assessment of Studies
Data extraction was independently undertaken by Drs. Moonesinghe and Das, using standardized tables relating to the study characteristics, quality, and outcomes. Where there was disagreement in the data extraction between these two authors, Dr. Moonesinghe resolved the query by referring again to the original articles. Study characteristics extracted from each article included the number of patients, the country where the study was conducted, the outcome measures and endpoints of each study, and the risk stratification tools being assessed. Data were also extracted regarding the most detailed description of the types of surgery included in each study cohort reported in the articles. We also extracted clinical outcome data (morbidity and mortality) for the cohorts in each study.
Assessment of study quality was based on the framework for assessing the internal validity of articles dealing with prognosis developed by Altman.11,12 The following criteria were used: the number of patients included in analyses, whether the study was conducted on a single or multiple sites, the timing of data collection (prospective vs. retrospective), whether a description of baseline characteristics for the cohort was included (including comorbidities, type of surgery, and demographic data), and selection criteria for patients included in the study (to assess for selection bias). Selection bias was judged to be present if a study restricted the type of patient who could be enrolled based on age, ethnicity, sex, premorbid condition, urgency of surgery, or postoperative destination (e.g., critical care). In addition, we reported the setting of each validation study—i.e., whether the validation was conducted in a split sample of the original development cohort or whether the validation cohort was entirely different from that in which the tool was developed.13 Finally, as a measure of their clinical usability and reproducibility, we reported whether each risk stratification tool used variables which were objective (e.g., blood results), subjective (e.g., chest radiograph interpretation), or both.14
Data Analysis and Statistical Considerations
The performance of each risk stratification tool was evaluated using measures of discrimination and, where appropriate, calibration. Discrimination (how well a model or score correctly identifies a particular outcome) was reported using either the area under the receiver operating characteristic curve (AUROC) or the concordance (c-) statistic. We considered an AUROC of less than 0.7 to indicate poor performance, 0.7–0.9 to be moderate, and greater than 0.9 to reflect high performance.15 Calibration is defined as how well the prognostic estimation of a model matches the probability of the event of interest across the full range of outcomes in the population being studied. Where reported, either Hosmer–Lemeshow or Pearson chi-square statistics were extracted as an evaluation of calibration; P value of more than 0.05 was taken to indicate that there was no evidence of lack-of-fit.
Results
Search Results
In the initial search, 139,775 articles on MEDLINE and 71,841 on Embase were listed, and the titles and abstracts of these were screened to identify articles which described risk stratification tools used in any adult noncardiac, nonneurological surgery. Seven hundred fifty-one articles then underwent a review. Hand searching of reference lists and citations identified a further 432 studies which were also reviewed in detail.
Three studies were identified that graphically displayed receiver operating characteristic curves in their results but did not report AUROCs.16–18 The authors of these studies were contacted for additional information; none responded, so these studies were excluded from the analysis. Six foreign language studies, which may have been eligible for inclusion based on review of the abstracts, but for which we were unable to obtain translations, were also omitted from the analysis.19–24 The flow chart for the review is detailed in figure 1.
A total of 27 studies evaluating 34 risk stratification tools were included in the analysis. All were cohort studies. Eight tools were validated in multiple studies; the most commonly reported were the ASA-PS (four studies, total number of patients, n = 4,014), the Acute Physiology and Chronic Health Evaluation II (APACHE II) scoring system (four studies, n = 5,897), the Physiological and Operative Score for the enUmeration of Mortality and Morbidity (POSSUM; three studies, n = 2,915), the Portsmouth variation of POSSUM (P-POSSUM; five studies, n = 10,648; mortality model only), the Surgical Risk Scale (three studies, n = 5,244; mortality model only), the Surgical Apgar Score (three studies, n = 10,795), the Charlson Comorbidity Index (two studies, n = 2,463,997), and Donati Surgical Risk Score (two studies, n = 7,121). The accuracy of a further 26 tools was evaluated in single-validation studies. A comparison of tools that were validated in multiple studies is detailed in tables 1 and 2. The general characteristics of all included studies are summarized in table 3.
Quality Assessment
The quality assessment of included studies is summarized in table 3. Seven studies were multicenter and 21 were single center. The data collection was prospective in 19 studies, retrospective in 7, and based on administrative data in 2 studies. Sixteen studies used mortality as an outcome measure, four used morbidity, and eight used both. The study endpoints included 30-day outcome in 12 articles, hospital discharge in 15 articles, and 3 articles also included shorter or longer follow-up times ranging from 1 day to 1 yr. Nineteen studies of the total 28 reported baseline patient characteristics of physiology or comorbidity, surgery, and demographics; selection bias was evident in 12 studies.
Outcomes Reporting
Outcomes are summarized in table 4. Surgical mortality at 30 days varied between 1.25 and 12.2% and at hospital discharge between 0.8 and 24.7%.
All but one25 of the six studies which separately tested the discrimination of stratification tools for morbidity and mortality reported that morbidity prediction was less accurate. There was considerable heterogeneity in the definition of morbidity in the 12 studies that reported this outcome (see appendix 3 for summary), and in keeping with this, there was wide variation in complication rates in different studies (between 6.726 and 50.4%).25
Calibration
Calibration was poorly reported: 16 studies did not report calibration at all; of the remaining 11 articles, 2 reported only whether the models were of “good fit,” without reporting the appropriate statistics. One article did not report calibration in their results, despite stating in the methods that they would calculate it.27
Risk Stratification Tools Using Preoperative Data Only
Four entirely preoperative risk stratification tools (ASA-PS, Surgical Risk Scale, Surgical Risk Score, and the Charlson Comorbidity Index) were validated in multiple studies. The Surgical Risk Scale and the Surgical Risk Score both contain the ASA-PS, and the urgency and severity of surgery; both have also been multiply validated. The Surgical Risk Score28,29 was developed and originally validated in Italy29 and contains the ASA-PS, a 3-point scale modification of the Johns Hopkins surgical severity criteria and a binary definition of surgical urgency (elective vs. emergency). The only published study evaluating the Surgical Risk Score after its initial validation found it to be poorly predictive of inpatient mortality.28 The Surgical Risk Scale30–32 uses the ASA-PS alongside United Kingdom definitions of operative urgency (a 4-point scale defined by the United Kingdom National Confidential Enquiry into Postoperative Death and Outcome) and severity (the British United Provident Association classification which is used to rank surgical procedures for the purposes of financial billing in the private sector). Both studies validating this system after its initial development found it to be a moderately discriminant tool (AUROC >0.8).30,32
A further 18 different risk stratification tools using solely preoperative data were validated in single publications. Several of these were originally derived and validated for purposes other than the prediction of generic morbidity and mortality: these include cardiac risk prediction scores,27,32,33 measures of nutritional status,34 and frailty indices.27 These tools are described in appendix 4.
Risk Stratification Tools Incorporating Intra- and Postoperative Data
The POSSUM and P-POSSUM scores were the most frequently used tools in heterogeneous surgical cohorts. The POSSUM score was derived by multivariable logistic regression analysis and contains 18 variables, of which 12 were measured preoperatively and 6 at hospital discharge; two separate equations, for morbidity and mortality, were developed and validated.17,35 After recognition that the POSSUM model overpredicted adverse outcome, the Portsmouth variation (P-POSSUM) was developed to predict mortality, using the same composite variables but a different calculation.36 P-POSSUM has been used in a larger number of more recent studies28–30,32,37 than the original POSSUM25,29,30 and has been found to be of moderate to high discriminant accuracy (AUROC varying between 0.68 and 0.92) with the exception of one Australian study.37
Medical Risk Prediction Tools Adapted for Surgical Risk Stratification
Two risk stratification tools, which have been multiply validated, APACHE II38 and the Charlson Index,39 were developed for the purposes of risk adjustment and prediction in nonsurgical settings. APACHE II was developed in 1985 as a tool for predicting hospital mortality in patients admitted to critical care; the score consists of 12 physiological variables and an assessment of chronic health status. This approach has face validity, as APACHE II is a summary measure of acute physiology and chronic health, both of which may influence surgical outcome. Only one of the four studies reporting the APACHE II score’s predictive accuracy used it in the way originally intended: by incorporating the most deranged physiological results within 24 h of critical care admission.40
The Charlson comorbidity score was developed to predict 10-yr mortality in medical patients.39 A combined age-comorbidity score was subsequently validated for the prediction of long-term mortality in a population of patients who had essential hypertension or diabetes and were undergoing elective surgery.41 It is the original Charlson score, however, which is used in two studies identified in our search to stratify risk of short-term outcome.42,43 These two studies reported very different predictive accuracy for the Charlson score; however, the largest single study included in this entire review found the Charlson score (measured using administrative data) to be a moderately accurate tool.44
Discussion
The purpose of this systematic review was to identify all risk stratification tools, which have been validated in heterogeneous patient cohorts, and to report and summarize their discrimination and calibration. We have found a plethora of instruments that have been developed and validated in single studies, which unfortunately limits any assessment of their usefulness and generalizability. A smaller number of tools have been multiply validated which could be used universally for perioperative risk prediction; of these, the P-POSSUM and Surgical Risk Scale have been demonstrated to be the most consistently accurate systems.
Risk Stratification Tools in Practice: Complexity versus Parsimony
There are two key considerations when assessing the clinical utility of the various risk stratification tools reviewed in our study. First, what level of predictive accuracy is fit for the purposes of risk stratification? Second, what is the likelihood that each of the described instruments may be used in everyday practice by clinicians? Although the answer to the first question may be to aim as “high” (accurate) as possible, this must also be balanced against the issues raised by the second question. Risk models incorporating over 30 variables may be highly accurate but are less likely to be routinely incorporated into preoperative assessment processes than scores of similar performance that use only a few data points. Furthermore, clinical experience tells us that the clinician is less likely to use complex mathematical formulae, as opposed to additive scores, when attempting to risk stratify patients at the bedside or in the preoperative clinic.1
P-POSSUM
The P-POSSUM model was developed in the United Kingdom and has since been validated in Japan, Australia, and Italy. Although this is the most frequently and widely validated model identified by our study, it has some limitations. First, it includes both preoperative and intraoperative variables, and therefore cannot be used for preoperative risk prediction. Second, several of the variables are subjective (e.g., chest radiograph interpretation), carrying the risk of measurement error. Third, in common with the original POSSUM, the P-POSSUM tends to overestimate risk in low-risk patients. Fourth, it contains 18 variables, which must be entered into a regression equation to obtain a predicted percentage risk value, and clinicians may not wish to use such a complex system. Finally, the inclusion of intraoperative variables, particularly blood loss, which may be influenced by surgical technique, runs the risk of concealing poor surgical performance, therefore, jeopardizing its face validity as a risk adjustment model for comparative audit of surgeons or institutions.
Surgical Risk Scale
The Surgical Risk Scale consists entirely of variables that are available before surgery, making it a useful tool for preoperative risk stratification for the purposes of clinical decision making. However, there are also some limitations. First, it incorporates the ASA-PS, which may be subject to interobserver variability and therefore measurement error.44–46 Second, the surgical severity coding is not intuitive, and some familiarity with the British United Provident Association system would be required for bedside estimation, unless a reference manual was available. Finally, it has only been validated in single-center studies within the United Kingdom; therefore, its generalizability to patient populations in the United States and worldwide is unknown.
Other Options
The ASA-PS is widely used as an indicator of whether or not a patient falls into a high-, medium-, or low-risk population, but it was not originally intended to be used for the prediction of adverse outcome in individual subjects.4 It is perhaps surprising that the ASA-PS was reported as having good discrimination for predicting postoperative mortality, as it is a very simple scoring system, which has been demonstrated to have only moderate to poor interrater reliability.44–47 Nevertheless, the ASA-PS has face validity as an assessment of functional capacity, which is increasingly thought to be a significant predictor of patient outcome, as demonstrated by more sophisticated techniques such as cardiopulmonary exercise testing.48 Although it is possible that this provides some explanation for the high discriminant accuracy for ASA-PS found in this systematic review, it is possible that publication bias, favoring studies with “positive” results, may also be a factor.
The Biochemistry and Hematology Outcome Model is a parsimonious version of POSSUM, which omits the subjective variables such as chest radiography and electrocardiogram results. It also has the advantage of consisting of variables which are all available preoperatively, with the exception of operative severity. Given the Biochemistry and Hematology Outcome Model’s similarity in predictive accuracy to P-POSSUM in the one study, we identified which made a direct comparison,32 this system warrants further evaluation. Finally, the Identification of Risk In Surgical patients score was developed in The Netherlands and consists of four variables (age, acuity of admission, acuity of surgery, and severity of surgery). In the study, which developed and validated it on separate cohorts, the validation AUROC was 0.92.49 Again, further investigation of this simple system would be useful.
Generalizability of Findings
Clinical and Methodological Heterogeneity.
Clinical heterogeneity (both within- and between-cohort patient heterogeneity) and methodological heterogeneity (between-study differences in the outcome measures used) are both likely to have had a significant influence on some of our findings. For example, between-cohort heterogeneity, and variation in how morbidity is defined (appendix 2), may explain the wide range of morbidity rates reported in different studies. Heterogeneity of morbidity definitions may also in part explain the lower accuracy of models for predicting morbidity compared with mortality. On a different note, our study included all populations of patients who were determined to be heterogeneous, using the definitions described in our methods. However, the degree of heterogeneity varied among studies, including whether or not patients of all surgical urgency categories were included, and this may have affected the predictive accuracy of models in different studies.
Objective versus Subjective Variables and Issues Surrounding Data Collection Methodology.
The variables included in risk stratification tools may be classified as objective (e.g., biochemistry and hematology assays), subjective (e.g., interpretation of chest radiographs), and patient-reported (e.g., smoking history). In some clinical settings, the reliability of nonobjective data may be questionable; for example, previous reports have demonstrated significant interrater variation in the interpretation of both chest radiographs50 and electrocardiograms.51 Patients may also under- or overestimate various elements of their clinical or social history when questioned in the hospital setting. Despite these concerns, the discrimination of predictors incorporating patient-reported and patient-subjective variables was high in the studies included. This may be due to publication bias; it may also be explained by the fact that in all of these studies, data were collected prospectively by trained staff. Previous work has demonstrated an association between interobserver variability in the recording of risk and outcome measures, and the level of training that data collection staff have received.52 These caveats are important when considering the generalizability of our findings to the everyday clinical setting, where data reporting and interpretation may be conducted by different types and grades of clinical staff. Finally, concerns have also been raised over the clinical accuracy of administrative data used for case-mix adjustment purposes.53,54 However, one large study included in our review43 showed high discriminant performance when using International Classification of Diseases 9 and 10 administrative coding data to define the Charlson Index variables.
Limitations of This Study
This study has limitations in a number of factors. First, the focus was on studies that measured the discrimination and/or calibration of risk stratification tools in cohorts that were heterogeneous in terms of surgical specialities; therefore, a large number of single-speciality cohort studies identified in the search were excluded from the analysis.
Second, although the inclusion criteria for our review ensured that a standard measure of discrimination was reported (AUROC or c-statistic), many studies did not report measures of calibration. However, in a systematic review such as this, calibration may be seen to be a less important measure of goodness-of-fit than discrimination for a number of reasons. Calibration can only be used as a measure of performance for models that generate an individualized predicted percentage risk of an outcome (e.g., the POSSUM systems) as opposed to summative scores, which use an ordinal scale to indicate increasing risk (e.g., the ASA-PS). Calibration drift is likely to occur over time and will be affected by changes in healthcare delivery; good calibration in a study over 30 yr ago may be unlikely to correspond to good calibration today.55,56 Although such calibration drift may affect the usefulness of a model for predicting an individual patient’s risk of outcome, poorly calibrated but highly discriminant models will still be of value for risk adjustment in comparative audit. Finally, the probability of the Hosmer–Lemeshow statistic being significant (thereby indicating poor calibration) increases with the size of the population being studied.57 This may explain why many of the large high-quality studies we evaluated did not report calibration or reported that calibration was poor.
Third, by using the AUROC as the sole measure of discrimination, a number of studies were excluded, particularly earlier articles that used correlation coefficients between risk scores and postoperative outcomes. This was felt to be necessary, as a uniform outcome measure provides clarity to the reader. Fourth, publication bias, where studies are preferentially submitted and accepted for publication if the results are positive, is likely to be a particular problem in cohort studies. Finally, despite an extensive literature search, it is possible that some studies which would have been eligible for inclusion may have been missed. Multiple strategies have been used to prevent this; however, in a review of this size, it is possible that a small number of appropriate articles may have been omitted.
Future Directions
Undertaking clinical risk prediction should be a key tenet of safe high-quality patient care, it facilitates informed consent and enables the perioperative team to plan their clinical management appropriately. Equally, accurate risk adjustment is required to enable meaningful comparative audit between teams and institutions, to facilitate quality improvement for patients and providers. Although we identified dozens of scores and models which have been used to predict or adjust for risk, very few of these achieved the aspiration of being derived from entirely preoperative data, and of being accurate, parsimonious, and simple to implement. The Surgical Risk Scale is the system that comes closest to achieving these goals; the P-POSSUM score is more accurate, but its value is limited by the fact that some of the variables are only available after surgery has been completed. Future work which might be of value would include further comparison of the Surgical Risk Scale, P-POSSUM, and objective models such as the Biochemistry and Hematology Outcome Model in international multicenter cohorts and further investigation of models which combine novel variables such as measures of functional capacity, nutritional status, and frailty.
There is another possible approach. The American College of Surgeons’ National Surgical Quality Improvement Program was created in the 1990s to facilitate risk-adjusted surgical outcomes reporting in Veterans’ Affairs hospitals, and now also includes a number of private sector institutions. Risk adjustment models are produced annually and observed that the expected ratios of surgical outcomes are reported back to institutions and surgical teams to facilitate quality improvement. This organization has published a number of risk calculators to help clinicians to provide informed consent and plan perioperative care. However, none of these calculators have been included in our review, as they have all been developed and validated for use in either specific types of surgery (e.g., pancreatectomy,58 bariatric,59,60 or colorectal60 surgery) or for specific outcomes (e.g., cardiac morbidity and mortality).61 A parsimonious, entirely preoperative National Surgical Quality Improvement Program model for predicting mortality in heterogeneous cohorts would be of value in the United States; its validation in international multicenter studies would also be a worthwhile endeavor.
Finally, although there are multiple studies aimed at developing and validating risk stratification tools, we do not know how widely such tools are used. Use of mobile technology, such as apps to enable risk calculation using complex equations at the bedside, might increase the use of accurate risk stratification tools in day-to-day practice. Importantly, in surgical outcomes research, there is an absence of impact studies, measuring the effect of using risk stratification tools on clinician behavior, patient outcome, and resource utilization. Randomized, controlled trials to evaluate impact, further validation of existing models across healthcare systems, and establishing the infrastructure required to facilitate such work, including the routine data collection of risk and outcome data, should be of the highest priority in health services research into surgical outcome.62
The authors thank Judith Hulf, F.R.C.A., Past President, Royal College of Anaesthetists, London, United Kingdom.
Preferred Reporting Items for Systematic reviews and Meta-analyses Checklist12

Appendix 2. Search Strategy
MEDLINE
Risk adjustment.mp. or exp Health Care Reform/or exp Risk Adjustment/or exp “Outcome Assessment (Health Care)”/or exp Models, Statistical/or exp Risk/OR exp Risk Assessment/or risk prediction.mp. or exp Risk/or exp Risk Factors/OR predictive value of tests.mp. or exp “Predictive Value of Tests”/OR exp Prognosis/or risk stratification.mp. OR case mix adjustment.mp. or exp Risk Adjustment/OR severity of illness index.mp. or exp “Severity of Illness Index”/OR scoring system.mp.
Combined with:
Surgical Procedures, Operative/OR surgery.mp. or General Surgery/OR operation.mp. or exp Postoperative Complications/
Combined with:
mortality.mp. or exp Hospital Mortality/or exp Mortality/OR morbidity.mp. or exp Morbidity/OR outcome.mp. or exp Fatal Outcome/or exp “Outcome Assessment (Health Care)”/or exp “Outcome and Process Assessment (Health Care)”/or exp Treatment Outcome/OR postoperative complications.mp. or exp Postoperative Complications/OR intraoperative complications.mp. or exp Intraoperative Complications/OR exp Perioperative Care/or perioperative complications.mp. OR prognosis.mp. or exp Prognosis/.
Embase
Risk Factor/or risk adjust$.mp. OR cardiovascular risk/or high risk patient/or high risk population/or risk assessment/or risk factor OR risk stratification.mp. [mp=title, abstract, subject headings, heading word, drug trade name, original title, device manufacturer, drug manufacturer] OR *”Scoring System”/OR “Severity of Illness Index”/OR Multivariate Logistic Regression Analysis/or Logistic Regression Analysis OR logistic models/or risk assessment/or risk factors/OR exp Scoring System OR Prediction/or possum.mp. or Scoring System/OR exp Risk Assessment/or risk stratification.mp. OR predict$.mp. OR exp Quality Indicators, Health Care/OR Risk Adjustment/.
Combined with:
exp Surgery/OR exp Surgical Procedures, Operative/OR specialties, surgical/or surgery/OR surg$.mp. [mp=title, abstract, subject headings, heading word, drug trade name, original title, device manufacturer, drug manufacturer] OR peri-operative period.mp. [mp=title, abstract, subject headings, heading word, drug trade name, original title, device manufacturer, drug manufacturer] OR perioperative.mp. [mp=title, abstract, subject headings, heading word, drug trade name, original title, device manufacturer, drug manufacturer] OR postoperative.mp. [mp=title, abstract, subject headings, heading word, drug trade name, original title, device manufacturer, drug manufacturer] OR perioperative care/or intraoperative care/or postoperative care/or preoperative care.
Combined with:
complicat$.mp. [mp=title, abstract, subject headings, heading word, drug trade name, original title, device manufacturer, drug manufacturer] OR adverse outcome/or prediction/or prognosis/OR exp Postoperative Complication/co, di, ep, su, th [Complication, Diagnosis, Epidemiology, Surgery, Therapy] OR exp Perioperative Complication/or exp Perioperative Period/OR exp Mortality/or exp Surgical Mortality/OR exp Morbidity/OR outcome.mp. or “Outcome Assessment (Health Care)”/or “Outcome and Process Assessment (Health Care)” OR treatment outcome/.
Limits
1980 to August 31, 2011
Exclusions:
(“all infant (birth to 23 months)” or “all child (0 to 18 years)” or “newborn infant (birth to 1 month)” or “infant (1 to 23 months)” or “preschool child (2 to 5 years)” or “child (6 to 12 years)” or “adolescent (13 to 18 years)”) or (cats or cattle or chick embryo or dogs or goats or guinea pigs or hamsters or horses or mice or rabbits or rats or sheep or swine) or (communication disorders journals or dentistry journals or “history of medicine journals” or “history of medicine journals non index medicus” or “national aeronautics and space administration (nasa) journals” or reproduction journals) or Angioplasty, Balloon/or Angioplasty, Laser/or Angioplasty/or Angioplasty, Balloon, Laser-Assisted/or Angioplasty, Transluminal, Percutaneous Coronary/or ANGIOPLASTY.mp. OR Eye/or Ophthalmology/or Eye Diseases/or OPTHALMOLOGY.mp. or Hearing Loss OR CARDIAC SURGERY.mp. or HEART SURGERY.mp. or Myocardial Revascularization/or Coronary Artery Bypass/or CORONARY SURGERY.mp. or Coronary Artery Bypass, Off-Pump/.
Hand Searching of Reference Lists
The following keywords were searched separately on MEDLINE, Embase, and ISI Web of Science:
POSSUM + surgery
NSQIP
E-PASS
ACE-27
APACHE
In addition, the original development studies for all risk prediction models identified in the initial search were then snowballed by hand searching for citations on MEDLINE, Embase and ISI Web of Science.
Inclusion/Exclusion Criteria
Studies were eligible if they fulfilled the following criteria:
Studies in adult humans undergoing noncardiac, nonneurological surgery
Study cohorts that included at least two different surgical subspecialities
Studies that described the predictive precision of risk models using analysis of receiver operator characteristic curves
Studies were excluded on the basis of these criteria:
Cohorts including children (under the age of 14 yr)
Cohorts including patients undergoing cardiac surgery
Cohorts including patients who did not undergo surgery
Single-speciality cohort studies (e.g., vascular, orthopedic)
Studies of ambulatory (day case) surgery
Studies describing the development of a risk prediction model without subsequent validation in a separate cohort (either in the original study or subsequent cohorts), with the exception of studies of data from the American College of Surgeons’ National Surgical Quality Improvement Programme
Studies in which the items comprising the risk stratification tool were not disclosed in the study report or available from other sources (such as references)
Studies using outcomes other than morbidity or mortality as their sole outcome measures (e.g., discharge destination, length of stay)
Studies using only a single pathological outcome measure (e.g., reoperation, cardiac morbidity, infectious complications, renal failure).