Clinical prediction models in anesthesia and surgery research have many clinical applications including preoperative risk stratification with implications for clinical utility in decision-making, resource utilization, and costs. It is imperative that predictive algorithms and multivariable models are validated in a suitable and comprehensive way in order to establish the robustness of the model in terms of accuracy, predictive ability, reliability, and generalizability. The purpose of this article is to educate anesthesia researchers at an introductory level on important statistical concepts involved with development and validation of multivariable prediction models for a binary outcome. Methods covered include assessments of discrimination and calibration through internal and external validation. An anesthesia research publication is examined to illustrate the process and presentation of multivariable prediction model development and validation for a binary outcome. Properly assessing the statistical and clinical validity of a multivariable prediction model is essential for reassuring the generalizability and reproducibility of the published tool.
Statistical techniques pertaining to the development and validation of predictive models, scores, and algorithms are essential in clinical research.1 Predictive models can be developed in a wide range of clinical settings. The term “=prediction modeling refers to determining the association between a set of predictor variables (i.e., risk factors) and an outcome. Predictive models can be useful for predicting resource utilization, costs, and charges, as well as for helping decision-making and informing patients and families of risks. These models cannot be used for understanding causation or treatment efficacy. Model development and validation techniques are widely used in impactful studies published in top peer-reviewed journals.2–9 However, anesthesiologists may not be familiar with the steps involved in how to go about developing, internally validating, and externally validating multivariable prediction models.
The aim of this article is to educate anesthesia researchers about key statistical concepts in model development and validation for a multivariable prediction model for a binary outcome. Key concepts include discrimination, receiver operating characteristics curve analysis, calibration, internal validation, and external validation. An article previously published in Anesthesiology will be used to illustrate the model development and validation process, with focus on binary outcome variables. Properly applying the statistical techniques for model development and validation will improve the generalizability and reproducibility of published predictive algorithms and models.10,11
Key Concepts for Validation
We will start by providing an overview of important statistical concepts and terminology that are integral to prediction model development, assessment, and validation (box 1). A glossary of statistical terms is provided in the appendix.
Model Development
Once a researcher has identified a clinical outcome of interest and a set of candidate predictor variables, model development can be done. A model must be trained using a cleaned dataset with predictor variables that will be readily available at the time of implementation, using information only available in the training set, and using a set of predictors that do not represent the same information as the outcome (these general situations are examples of leakage in model development and should be avoided). Traditionally, logistic regression modeling is used to analyze the independent associations between a set of predictor variables and a dichotomous outcome. Clinical expertise is needed to help select variables to be considered for inclusion in the multivariable modeling analysis. The full regression model can be applied with continuous and categorical predictor variables to obtain predicted probabilities. Presentation of the full model allows for predictions for individuals, and the full model can lead to a clinically useful nomogram.12 When prediction models are simplified by dichotomization of variables or truncating coefficients, there is most often a loss in model performance.13 Validation statistics, both internal and external, are important to implement to assess the performance of a prediction model. A model may demonstrate a certain level of performance with regards to discrimination and calibration, as described in the sections below. A careful interpretation of all evidence from internal and external assessment needs to be made to determine the level of the statistical and clinical validity of a prediction model.
Types of Validation
Model validation refers to the statistical assessment of the performance of a prediction model. Evaluating the performance of a prediction model using the same dataset that was used to create the model is referred to as internal validation. Internal split-sample validation is performed by randomly splitting the sample into two parts: a training set with which the model is built, and a test set for assessing validity of the model. Internal split-sample validation requires a relatively large sample size and number of events observed in order to perform this partitioning. If the discrimination or calibration of the model in the test set is noticeably worse than the performance in the training set, then a new model may be sought because this would indicate unacceptable predictive performance. This approach is also useful for determining whether the original model is overfit. For a split-sample internal validation approach, rather than randomly splitting the dataset into a training set and a test set, it is better to divide the sample by time period where the model is developed in the earlier time period and then it is validated using data from a later time period. Alternatively, the data can be split by hospital or center, and the model is developed using data from one center and tested using data from a different center.14 Internal validation with splitting by hospital or splitting time period should be performed whenever possible rather than random internal splitting. This will lead to a better assessment of generalizability of the prediction model. These methods are steps in the direction toward external validation, and they are the most robust forms of split-sample internal validation. Internal validity can also be assessed using bootstrapping techniques, where the performance of the prediction model is examined across a specified number of datasets created by bootstrap resampling based on the original data (box 2). This procedure involves fitting prediction models separately within each bootstrap sample and measuring the predictive performance of this model within the corresponding bootstrap sample (using the statistics described below for model discrimination and calibration). Then, the model that was developed based on the bootstrap sample is applied to the original dataset, and the same model performance statistics are obtained. The performance within the bootstrap sample and the performance in the original dataset are compared to determine optimism-adjusted estimates for the discrimination and calibration statistics. Often, researchers choose to perform 1,000 bootstrap resamples for evaluating model performance. Internal validation with bootstrap resampling is targeted at evaluating the model in new datasets that are generated by sampling with replacement from the original dataset. However, internal validation methods cannot take the place of a well-performed external validation.
Bootstrapping, or bootstrap resampling, is the process of generating new datasets by performing repeated random sampling with replacement from an existing sample of patients. The sample size and number of these new datasets can be specified, and often researchers choose at least 1,000 resamples to evaluate model performance. The purpose of bootstrapping is to make more informed inferences about a population parameter or validation statistic by repeatedly sampling with replacement from the same distribution. In the validation of prediction models, models are fit separately within each bootstrap sample and statistics measuring the predictive ability of this model within the corresponding bootstrap sample are calculated. Then, the model that was developed in the bootstrap sample is applied to the original dataset and the same model performance statistics are obtained. The performance within the bootstrap sample and the performance of the same model in the original dataset are compared to determine optimism-adjusted estimates for the discrimination (AUC) and calibration (e.g., Brier score) statistics.
External validation is the process of using an outside or external dataset to evaluate the model. External validation is powerful as it can provide evidence of model performance in the context of generalizability and transportability to other settings. External temporal validation can be performed in which an entirely new external dataset is generated at a later time frame to evaluate the performance of a previously developed model. In order to have strong predictive performance, a model needs to demonstrate evidence of excellent discrimination as well as excellent calibration. The level of validity of a model depends on careful internal and external assessment of the statistics for discrimination and calibration. Weak model performance with respect to discrimination or calibration may lead the investigators to consider an entirely new model. It must be emphasized that both of these features are important, and excellent discrimination and excellent calibration are needed to have strong model performance. A general framework for the statistics involved in internal and external validation, and four scenarios regarding internal and external validation, are presented in figure 1 (infographic) and figure 2.
Assessment of Discrimination
The first assessment of model performance is to determine its discriminatory ability. This is done to determine how well the model can differentiate between patients with and without the outcome of interest. This can also be thought of as the predictive ability of the model in terms of risk-stratifying patients into high or low risk. Model discrimination can be evaluated within the derivation set, by internal validation, and by external validation.
Receiver Operating Characteristics Curves
Receiver operating characteristics curve analysis is traditionally used to assess the discriminatory ability of the prediction model. Receiver operating characteristics curves are constructed by plotting the sensitivity and 1 – specificity corresponding to every possible cutoff for the predicted probabilities from a prediction model in a sample. The area under the receiver operating characteristics curve (AUC) or c-statistic from a regression model is calculated as a summary measure of the prognostic performance of the model.15 An AUC of 0.500 represents no predictive ability of the prediction model beyond chance. Traditionally, an AUC value of 0.700 to 0.799 is indicative of good discrimination, a value of 0.800 to 0.899 indicates very good discrimination, and a value greater than or equal to 0.900 indicates excellent discrimination16 (fig. 3). The interpretation of AUC is often situation-specific and study-dependent. Based on the particular outcome of interest and implications of prediction, careful consideration needs to be made regarding the clinical context to determine the discriminatory performance of a model. It is important to include a 95% CI around the AUC as a measure of precision to help guide interpretation.
Receiver operating characteristics analysis is also useful for determining the optimal cutpoint for a continuous variable, or for risk score derived from a model. The sensitivity of a certain cutoff is the proportion of patients who “test positive” among the group of patients who experience the outcome, where a patient is classified as “test positive” if they have a value greater than the cutoff value being considered. The specificity of a predictor for a particular cutoff is defined as the percentage of patients who “test negative” among the group of patients who do not experience the outcome, where a patient is classified as “test negative” if they have an observed value lower than the cutoff value being considered. Youden’s J index is defined as the sum of sensitivity and specificity minus 1, and it can be used to identify the cutpoint that maximizes the sum of sensitivity and specificity. However, it is advisable to identify the best cutpoint as one that maximizes both sensitivity and specificity to provide reasonable performance on both metrics and to be careful of the scenario where Youden’s J index suggests that the best cutpoint is one with either very high sensitivity or very high specificity, but with a very low value for the other.17 This issue may arise when the predictor is not a good discriminator of outcomes. Furthermore, in certain scenarios, the investigator may wish to choose a cutpoint for a predictor that provides higher sensitivity for ruling in the outcome given testing positive or higher specificity for ruling out the outcome given testing negative. There has been discussion of alternatives to Youden’s J index for situations of modest discrimination.18
Assessment of Calibration
Calibration of a model refers to how well the predicted probabilities of the prediction model align or calibrate with the observed proportion with the outcome in the data. Establishing good calibration is important for providing evidence of reliability and consistency between predicted and observed probabilities of the binary outcome in the original dataset (the training set) and in an external validation cohort.
Several statistics are valuable for examining calibration. A calibration plot (or alternatively a calibration table) can be created showing the observed (empirical) rates versus expected (based on the model) probabilities of the outcome. This will depict how closely aligned the model-based probabilities using the predictive model are with the empirical data. The calibration plot can be accompanied with the slope and intercept of the calibration curve, the concordance calibration coefficient (closer to 1 is better), the Brier score19 (closer to 0 is better, and a value greater than 0.3 would suggest poor calibration), the pseudo R-squared such as Nagelkerke’s R-squared20 (larger is better), and Somers’ D (larger is better). In the training dataset, the Hosmer–Lemeshow goodness-of-fit test can be used to assess whether there is evidence of lack of good calibration with a P value less than 0.05. It is done by comparing observed and predicted probabilities by deciles, or another selected quantile. Some researchers discourage using the Hosmer–Lemeshow test, especially in the validation set, since it is often based on an arbitrary number of quantiles, it is underpowered in small samples, and it results only in a P value.21
Illustrative Example
To provide an illustrative example in which internal bootstrap validation and external validation were performed, we will use the study published by Nasr et al. in Anesthesiology entitled “Pediatric Risk Stratification Is Improved by Integrating both Patient Comorbidities and Intrinsic Surgical Risk.”22 In this study, a prediction model for 30-day mortality was constructed using the 2012 to 2016 National Surgical Quality Improvement Program (NSQIP)–Pediatric database (N = 367,065; 1,252 mortalities) using six variables identified as significant independent predictors: weight less than 5 kg, American Society of Anesthesiologists Physical Status III or higher, sepsis, inotropic support, ventilator dependence, and high intrinsic surgical risk. A simplified risk score was created based on the count of the number of comorbidity risk factors present in a given patient, ranging from 0 to 5, and stratified by low versus high intrinsic surgical risk. Among patients undergoing a low intrinsic surgical risk procedure, the model-based risk of 30-day mortality ranged from 0.00% (95% CI, 0.00 to 0.01%) when no comorbidities were present to 4.74% (95% CI, 3.17 to 7.03%) when all comorbidities were present. For a procedure with high intrinsic surgical risk, the probability of 30-day mortality ranged from 0.07% (95% CI, 0.05 to 0.09%) when no comorbidities were present to 46.72% (95% CI, 43.03 to 50.44%) when all comorbidities were present. It must be noted that the full prediction model may be presented as a nomogram; however, it was decided to present a simplified model in this illustrative example for ease of clinical use with the emphasis on intrinsic surgical risk.
Internal Bootstrap Validation in the Illustrative Example
Internal bootstrap validation was applied to the 2012 to 2016 NSQIP-Pediatric data to evaluate the predictive score internally using 1,000 bootstrap resamples. Bootstrap resampling using the original dataset can be used to validate a model in the case where it may not be feasible to acquire an external data source to perform external validation. However, internal bootstrap validation cannot take the place of an external validation because it does not provide evidence of generalizability in external settings. For the study being considered, internal validation was performed looking at the full prediction model. The internal validation based on bootstrap resampling suggested that the model had excellent internal validity regarding discrimination and calibration. There was strong model discrimination with an AUC of 0.961, and strong calibration with a bias-corrected Somers’ D rank correlation of 0.922 and a Brier quadratic probability score of 0.003. Additional statistics included the Nagelkerke’s R-squared measure of 0.395, and the intercept and slope of an overall logistic calibration equation equal to 0.020 and 1.007, respectively.
External Validation in the Illustrative Example
External validation of the full prediction risk model derived using the 2012 to 2016 NSQIP-Pediatric database was performed using the 2017 NSQIP-Pediatric data (N = 110,474). This was a completely new and independent dataset that had not been analyzed in the context of this prediction model, and therefore it was used as a data source for performing external validation. The 2017 NSQIP-Pediatric database was utilized because it provided a large sample size of high-quality data from multiple participating centers with a similar patient mix, and it contained the variables needed to validate the model. External validation was done in order to assess the model performance (discrimination and calibration) in this outside data in order to establish generalizability of the model for 30-day mortality.
The result of the external validation in this illustrative example showed good external model calibration as seen in the observed versus fitted (expected) probabilities of 30-day mortality (table 1). This showed good alignment and proximity between the empirical percentages and predicted probabilities of mortality, with Hosmer–Lemeshow goodness-of-fit P equal to 0.116. Model discrimination in the external validation dataset was excellent, with an AUC equal to 0.953 (95% CI, 0.944 to 0.961; fig. 4).
Discussion
Internal and external validation techniques are essential for assessing the performance of a multivariable prediction model. Establishing good model performance is crucial for creating a reliable, reproducible, and generalizable model. External validation is the most rigorous form of model validation, and it should be done whenever it is possible.
Internal validation using bootstrap resampling was illustrated above; however, internal validation may also be performed using a split-sample technique. In this approach, the original dataset is split into a derivation (training) set and a validation (test) set, using a 60% to 40% or similar allocation ratio. The model is developed and built using the training set, and model performance is assessed using the test set. The split-sample internal validation exposes the model to a new dataset; however, it is a random sample of the same population that was used to build the model. A model is not truly validated until an external validation is done, as an external validation provides evidence regarding whether the model will be generalizable and perform well in future practice with data resembling the external validation dataset.
The overall process of assessing the internal and external validity of a prediction model includes the concepts of all common types of validity found most often in survey research23 : face validity, content validity, construct validity, criterion validity, and concurrent validity.24–26 Face validity refers to how valid or intuitive the results appear on the surface. Content validity refers to the coverage of the range of an underlying construct. Construct validity refers to the ability of measuring an underlying idea or construct. Criterion validity refers to the predictive ability of an item to a construct. Concurrent validity refers to the consistency and correlation of one measurement to another, often previously validated tool. In the process of internal and external validation of a prediction model, these five validity concepts are involved. For example, face validity is involved in understanding if the risk factors are sensible on the surface, and concurrent validity may be examined by comparing a prediction tool to other existing tools for predicting the same outcome. It is important to understand the distinctions and overlaps between internal and external validity versus these other validity terms and concepts.
In our primer, we focus on the validation of predictive models for dichotomous outcomes; however, an investigator may wish to study predictors of other types of outcomes such as continuous outcomes or time-to-event endpoints.27–29 While the general regression modeling approaches such as multivariable linear and Cox proportional hazards regressions may be utilized to analyze these different types of data, the statistics that can be used for model assessment are different for these types of outcomes. It is beyond the scope of our primer to discuss other types of outcome data, and we focus on the scenario of binary events. Also beyond the scope of this article are validation techniques in machine learning and artificial intelligence, such as k-fold cross-validation, which may be of potential utility for anesthesiologists with large databases as a method used to partition their original data into k equally sized subsets to validate model performance.30
Several checklists exist for different types of research studies that can be implemented and referenced a guideline for achieving good reporting. For reporting a multivariable prediction models, the Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) statement31 can be used. The TRIPOD statement falls under the Enhancing the QUAlity and Transparency Of health Research (EQUATOR) Network of checklists. This network includes other useful tools such as the Consolidated Standards of Reporting Trials (CONSORT) Statement,32 the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement,33 and the Preferred Reporting Items for Systematic Review and Meta-Analysis (PRISMA) Statement,34 among others. Our current article can be used in conjunction with these checklists and statements to improve understanding and provide education regarding the items within each. A list of other resources is also provided (box 3).
Altman DG, Vergouwe Y, Royston P, Moons KG: Prognosis and prognostic research: Validating a prognostic model. BMJ 2009; 338:b605.
Steyerberg EW, Vickers AJ, Cook NR, Gerds T, Gonen M, Obuchowski N, Pencina MJ, Kattan MW: Assessing the performance of prediction models: A framework for some traditional and novel measures. Epidemiology 2010; 21:128–38.
Steyerberg EW, Vergouwe Y: Towards better clinical prediction models: Seven steps for development and an ABCD for validation. Eur Heart J 2014; 35:1925–31.
Conclusions
Clinical prediction models in anesthesia have many applications. Prediction models cannot be used to understand causation or therapy efficacy; however, they can help inform physicians, patients, and families of predicted risks associated with surgical procedures, and they can also inform anesthesiologists and operative teams regarding risk stratifications with implications for resource utilization and costs. Since prediction models have a variety of important uses in clinical care, it is imperative that these predictive algorithms are validated in a comprehensive way in order to establish robustness of the model in terms of accuracy, predictive ability, reliability, and generalizability.
Research Support
Support was provided solely from institutional and/or departmental sources.
Competing Interests
The authors declare no competing interests.
References
Appendix: Glossary of Statistical Terms
Area under the curve: The area under the receiver operating characteristics curve (AUC) is used to measure discrimination of a prediction model. AUC values from 0.700 to 0.799 are usually interpreted as good, values from 0.800 to 0.899 are usually interpreted as very good, and values greater than or equal to 0.900 are usually interpreted as excellent.
Bootstrap resampling internal validation: The process of interval validation in which the original dataset is used to generate new datasets (bootstrapped datasets) in which the prediction model is evaluated for discrimination and calibration.
Brier score: The Brier quadratic probability score is a statistic used for evaluating calibration of the prediction model, with values ranging from 0 to 1, and values closer to 0 representing better calibration between the observed and predicted values.
Calibration: A measure of the performance of a prediction model used in the validation process that refers to the model-fit to the data. In other words, calibration refers to the alignment and consistency between model-based (expected) probabilities and empirical (observed) proportions. Calibration tables and plot, the Brier score, the Hosmer–Lemeshow goodness-of-fit test, and Somers’ D are examples of statistical methods for assessing calibration.
Discrimination: A measure of the performance of a prediction model used in the validation process that refers to the ability of the prediction model to discriminate (i.e., distinguish between) patients with versus without the outcome of interest. Receiver operating characteristics curves and the AUC are commonly used to assess discrimination.
External validation: The process of performing validation using an outside or external data source or an independent cohort of patients that is newly applied to the prediction model in order to establish generalizability.
Generalizability: The generalizability of a prediction model refers to the ability of the model to perform in the real world or in an external environment. Generalizability can be established via external validation techniques.
Hosmer–Lemeshow test: The Hosmer–Lemeshow goodness-of-fit test compares observed and expected probabilities by dividing the data into a certain number of evenly sized quantiles. A nonsignificant P value (P > 0.05) indicates a good result—that is, that there is no evidence of deviation from good calibration.
Internal validation: The process of performing validation using the original data that were used to construct the prediction model. Most often, internal validation is done using bootstrap resampling, or a split-sample validation.
k-fold cross-validation: An internal validation procedure in which the original data are partitioned into k equal-size subsets and the model is iteratively trained and validated.
Logistic regression: The regression modeling technique that is used for determining the association between a set of predictor variables and a binary outcome. Logistic regression is often used in the process of developing a prediction model along with receiver operating characteristics analysis.
Pseudo R-squared: Traditional R-squared is applied in linear regression, but pseudo R-squared statistics, such as the Nagelkerke’s R-squared, can be applied to evaluate the calibration of a logistic regression model to the observed data.
Receiver operating characteristics analysis: Receiver operating characteristics analysis is performed by calculating the sensitivity and specificity corresponding to each cutpoint of a continuous or ordinal variable in predicting a binary outcome. It can be used to evaluate the discriminatory ability of a model for predicting a binary outcome via the AUC, and it can be used to identify an optimal cutpoint for the predictor variable in discriminating between patients with and without the outcome of interest.
Somers’ D: The Somers’ rank correlation D is a measure of calibration between the observed and model-based data using the prediction model, with values closer to 1 indicating better concordance between the model and the data.
Split-sample internal validation: The process of interval validation in which the original dataset is split into two parts: a training or model development set, and a test or validation set. The model is derived using the training set, and then it is evaluated for using the test set, using methods analogous to external validation.
Test set: During split-sample validation, the test set (the validation set) is the portion of the data allocated for evaluating the prediction model. Often, less than half of the dataset (e.g., 40%) is used for model testing.
Training set: During split-sample validation, the training set (the development set) is the portion of the data allocated for building and creating the prediction model. Often, more than half of the dataset (e.g., 60%) is used for model development.
Transportability: A property of a prediction model that may be applied in an external setting or other studies as demonstrated by external validation.
Validation: The statistical process of evaluating the performance of a prediction model in order to establish good reliability, consistency, calibration, and discrimination.
Youden’s J index: In receiver operating characteristics analysis, Youden’s J index is calculated as sensitivity + specificity – 1, and it can be used to identify the cutpoint that corresponds to the best combination of sensitivity and specificity. Investigators may wish to use Youden’s J index, or can choose a different cutpoint that puts more importance on sensitivity or specificity.