Development and Validation of a Deep Neural Network Model for Prediction of Postoperative In-hospital Mortality

Lee, Christine K.; Hofer, Ira; Gabel, Eilon; Baldi, Pierre; Cannesson, Maxime

doi:10.1097/ALN.0000000000002186

Abstract

Editor’s Perspective

What We Already Know about This Topic

Robust predictions are required to compare perioperative mortality among hospitals
Deep neural network systems, a type of machine learning, can be used to develop highly nonlinear prediction models

What This Article Tells Us That Is New

The authors’ neural network model was comparable in accuracy to, but potentially more efficient at feature selection than logistic regression models
Deep neural network–based machine learning provides an alternative to conventional multivariate regression

Background

The authors tested the hypothesis that deep neural networks trained on intraoperative features can predict postoperative in-hospital mortality.

Methods

The data used to train and validate the algorithm consists of 59,985 patients with 87 features extracted at the end of surgery. Feed-forward networks with a logistic output were trained using stochastic gradient descent with momentum. The deep neural networks were trained on 80% of the data, with 20% reserved for testing. The authors assessed improvement of the deep neural network by adding American Society of Anesthesiologists (ASA) Physical Status Classification and robustness of the deep neural network to a reduced feature set. The networks were then compared to ASA Physical Status, logistic regression, and other published clinical scores including the Surgical Apgar, Preoperative Score to Predict Postoperative Mortality, Risk Quantification Index, and the Risk Stratification Index.

Results

In-hospital mortality in the training and test sets were 0.81% and 0.73%. The deep neural network with a reduced feature set and ASA Physical Status classification had the highest area under the receiver operating characteristics curve, 0.91 (95% CI, 0.88 to 0.93). The highest logistic regression area under the curve was found with a reduced feature set and ASA Physical Status (0.90, 95% CI, 0.87 to 0.93). The Risk Stratification Index had the highest area under the receiver operating characteristics curve, at 0.97 (95% CI, 0.94 to 0.99).

Conclusions

Deep neural networks can predict in-hospital mortality based on automatically extractable intraoperative data, but are not (yet) superior to existing methods.

ABOUT 230 million surgeries are performed annually worldwide.¹ While the postoperative mortality is low, less than 2%, about 12% of all patients—the high-risk surgery group—account for 80% of postoperative deaths.^2,3 To assist in guiding clinical decisions and prioritization of care, several perioperative clinical and administrative risk scores have been proposed.

The goal of perioperative clinical risk scores is to help guide care in individual patients by planning clinical management and allocating resources. The goal of perioperative administrative risk scores (based on diagnoses and procedures) is to help compare hospitals. In the perioperative setting, frequently used risk scores include the American Society of Anesthesiologists (ASA) Physical Status Classification (a preoperative score) and the Surgical Apgar score.^4,5 The ASA score was developed in 1963 and remains widely used.⁴ Its main limitation is that it is subjective, it presents with high inter- and intrarater variability, it cannot be automated, and it relies on clinicians’ experience. The Surgical Apgar score (an intraoperative score) uses three variables: (1) estimated blood loss, (2) lowest mean arterial pressure, and (3) lowest heart rate during surgery to predict major postoperative complications.⁵ Favored for its simplicity, the Surgical Apgar score presents with area under the receiver operating characteristics curve ranging from 0.6 to 0.8 for major complications or death with a correlation varying with subspecialty.^6–9 In addition, the Surgical Apgar score has been shown to not substantially improve mortality risk stratification when combined with preoperative scores.⁹ In response to these limitations, there has been work to create more objective and accurate scores. The most popular method used to develop new scoring systems is based on logistic regression, such as the Preoperative Score to Predict Postoperative Mortality.¹⁰ In order to make these scores accessible in clinical practice, the logistic regression coefficients are normalized to easily summed values to be interpreted as a score rather than the direct logistic regression output. Besides the aforementioned clinical risk scores, other recent perioperative administrative risk scores are the Risk Stratification Index (published initially in 2010¹¹ and validated in 2017 on nearly 40 million patients¹²) and the Risk Quantification Index.¹³

In recent years, and although they are not new,¹¹ neural networks and deep neural networks, known as “deep learning,” have been used to tackle a variety of problems, ranging from computer vision,^12–17 gaming,^18–20 high-energy physics,^21,22 chemistry,^23–25 and biology.^26–28 While there have been studies using other machine-learning methods for clinical applications such as predicting cardiorespiratory instability^29,30 and 30-day readmission,^31,32 the use of deep neural networks in medicine is relatively limited.^33–36

In this manuscript, we present the development and validation of a deep neural network model based upon intraoperative clinical features, to predict postoperative in-hospital mortality in patients undergoing surgery under general anesthesia. Its performance is presented together with other published clinical risk scores and administrative risk scores, as well as a logistic regression model using the same intraoperative features as the deep neural network. The deep neural networks were also assessed for leveraging preoperative information by the addition of ASA score and Preoperative Score to Predict Postoperative Mortality as features.

Materials and Methods

This manuscript follows the “Guidelines for Developing and Reporting Machine Learning Predictive Models in Biomedical Research: A Multidisciplinary View.”³⁷

Electronic Medical Record Data Extraction

All data for this study were extracted from the Perioperative Data Warehouse, a custom-built robust data warehouse containing all patients who have undergone surgery at University of California Los Angeles (Los Angeles, California) since the implementation of the electronic medical record (EPIC Systems, USA) on March 17, 2013. The construction of the Perioperative Data Warehouse has been previously described.³⁸ Briefly, the Perioperative Data Warehouse has a two-stage design. In the first stage, data are extracted from EPIC’s Clarity database into 26 tables organized around three distinct concepts: patients, surgical procedures, and health system encounters. These data are then used to populate a series of 800 distinct measures and metrics such as procedure duration, readmissions, admission International Classification of Diseases (ICD) codes, and others. All data used for this study were obtained from this data warehouse, and institutional review board approval (No. 15-000518) has been obtained for this retrospective review.

A list of all surgical cases performed between March 17, 2013, and July 16, 2016, were extracted from the Perioperative Data Warehouse. The University of California Los Angeles Health System includes two inpatient medical centers and three ambulatory surgical centers; however, only cases performed in one of the two inpatient hospitals (including operating room and “off-site” locations) under general anesthesia were included in this analysis. Cases on patients younger than 18 yr of age or older than 89 yr of age were excluded. In the event that more than one procedure was performed during a given health system encounter, only the first case was included.

Model Endpoint Definition

The occurrence of an in-hospital mortality was extracted as a binary event (0, 1) based upon either the presence of a “mortality date” in the electronic medical record between surgery time and discharge or a discharge disposition of expired combined with a note associated with the death (i.e., death summary, death note). The definition of in-hospital mortality was independent of length of stay in the hospital.

Model Input Features

Each surgical record corresponded to a unique hospital admission and contained 87 features calculated or extracted at the end of surgery (table 1). These features were considered to be potentially predictive of in-hospital mortality by clinicians’ consensus (I.H., M.C., E.G.) and included descriptive intraoperative vital signs, such as minimum and maximum blood pressure values; summary of drug and fluid interventions, such as total blood infused and total vasopressin administered; and patient anesthesia descriptions, such as presence of an arterial line and type of anesthesia (all features are detailed in table 1).

Table 1.

Eighty-seven Features Used in Models with Description and Applied Maximum Possible Values as Defined by Domain Experts

View large

Data Preprocessing

Before model development, missing values were filled with the mean value for the respective feature. In addition, to account for observations where the value is clinically out of range, values greater than a clinically normal maximum were set to a maximum possible value (table 1). These out-of-range values were due to the data artifact in the raw electronic medical record data. For example, a systolic blood pressure of 400 mmHg is not clinically possible; however, it may be recognized as the maximum systolic blood pressure for the case during electronic medical record extraction. The data were then randomly divided into training (80%) and test (20%) data sets, with equal percent occurrence of in-hospital mortality. Training data were rescaled to have a mean of 0 and SD of 1 per feature. Test data were rescaled with the training data mean and SD.

Development of the Model

In this work, we were interested in classifying patients at risk of in-hospital mortality using deep neural networks, also referred to as deep learning. During development of deep neural networks, there are many unknown model parameters that need to be optimized by the deep neural network during training. These model parameters are first initialized and then optimized to decrease the error of the model’s output to correctly classify in-hospital mortality. This error is referred to as a loss function. The type of deep neural network used in this study is a feedforward network with fully connected layers and a logistic output. “Fully connected” refers to the fact that all neurons between two adjacent layers are fully pairwise connected. A logistic output was chosen so that the output of the model could be interpreted as probability of in-hospital mortality (0 to 1). To develop a deep neural network, it is important to fine-tune the hyperparameters as well as the architecture. We utilized stochastic gradient descent with momentums (0.8, 0.85, 0.9, 0.95, 0.99) and initial learning rates (0.01, 0.1, 0.5), and a batch size of 200. We also assessed deep neural network architectures of one to five hidden layers with 10 to 300 neurons per layer, and rectified linear unit and hyperbolic tangent activation functions. The loss function was cross entropy. We utilized fivefold cross-validation with the training set (80%) to select the best hyperparameters and architecture based on mean cross-validation performance. These best hyperparameters and architecture were then used to train a model on the entire training set (80%) before testing final model performance on the separate test set (20%).

Overfitting.

In addition, overfitting was a major concern in the development of our model. While approximately 50,000 patients is large for clinical data, it is small relative to data sets typically found in deep learning tasks such as vision and speech recognition, where millions of samples are available. Thus, regularization was critical. To address this, we utilized three methods: (1) early stopping, (2) L2 weight decay, and (3) dropout. Early stopping is the halting of model training when the loss of a separate early stopping validation set starts to increase compared to the training loss, indicating overfitting. This early stopping validation set was taken as a random 20% of the training set, and a patience of 10 epochs was utilized. L2 weight decay is a method of limiting the size of the weight of every parameter. The standard L2 weight penalty involves adding an extra term to the loss function that penalizes the squared weights, keeping the weights small unless the error derivative is big. We utilized an L2 weight penalty of 0.0001. Dropout is a method where neurons are removed from the network with a specified probability, to prevent coadapting of the neurons.^39–41 Dropout was applied to all layers with a probability of 0.5.

Data Augmentation.

The goal of training was to optimize model parameters to decrease classification error of in-hospital mortality. However, the actual percent of occurrence of in-hospital mortality in the data was low and thus the data were skewed. The percent occurrence of mortality in the training data set was less than 1%. To help with this skewed distribution, training data were augmented by taking only the observations positive for in-hospital mortality and adding Gaussian noise. This was performed by adding a random number taken from a Gaussian distribution with a SD of 0.0001 to each feature’s value. This essentially duplicated the in-hospital mortality observations with a slight perturbation. The in-hospital mortality observations in the training data set were augmented using this method to approximately 45% occurrence before training. During cross-validation, this meant that only training folds were augmented. The validation fold was not augmented.

Feature Reduction and Preoperative Feature Experiments

Experiments to assess the impact of (1) reducing the number of features from the clinician chosen 87 to 45 features, and (2) adding ASA score and Preoperative Score to Predict Postoperative Mortality as a feature, were also conducted. The reduced 45 feature set was created by excluding all “derived” features, specifically average, median, SD, and last 10 min of the surgical case features (table 1).

After choosing the best performing deep neural network architecture and hyperparameters with the complete 87 features data set, five additional deep neural networks were each trained with the following: (1) the addition of ASA score as a model feature (88 features); (2) the addition of Preoperative Score to Predict Postoperative Mortality as a model feature (88 features); (3) a reduced model feature set (45 features); (4) the addition of ASA score to the reduced feature set (46 features); and (5) the addition of Preoperative Score to Predict Postoperative Mortality to the reduced feature set (46 features).

Model Performance

All model performances were assessed on 20% of the data held out from training as a test set. Model performance was compared to ASA score, Surgical Apgar, Risk Quantification Index, Risk Stratification Index, Preoperative Score to Predict Postoperative Mortality, and a standard logistic regression model using the same combination of features as in the deep neural network. ASA score was extracted from the University of California Los Angeles preoperative assessment record. Surgical Apgar was calculated using Gawande et al.⁵ Risk Quantification Index could not be calculated using the downloadable R package from Cleveland Clinic’s Web site (http://my.clevelandclinic.org/departments/anesthesiology/depts/outcomes-research; accessed October 16, 2017) due to technical issues with the R version, and so Risk Quantification Index log probability and score were calculated from equations provided in Sigakis et al.⁴² Uncalibrated Risk Stratification Index was calculated using coefficients provided by the original authors (Supplemental Digital Content, https://links.lww.com/ALN/B681).⁴³ To calculate Risk Stratification Index, all International Classification of Diseases, Ninth Revision (ICD-9) diagnosis codes for each patient were matched with a Risk Stratification Index coefficient and the coefficients were then summed. Preoperative Score to Predict Postoperative Mortality scores were extracted from the Perioperative Data Warehouse, where they were calculated as described by Le Manach et al.¹⁰ Each of the diseases described by Le Manach et al.¹⁰ were extracted as a binary endpoint from the admission ICD codes for the relevant hospital admission. In addition to assigning points based on patient comorbidities, the Preoperative Score to Predict Postoperative Mortality also assigns points for the type of surgery performed. These points were assigned based on the primary surgical service for the given procedure.

Area under the Receiver Operating Characteristics Curves.

Model performance was assessed using area under the receiver operating characteristics curve and 95% CIs for area under the receiver operating characteristics curve were calculated using bootstrapping with 1,000 samples.

Choosing a Threshold.

The F1 score, sensitivity, and specificity were calculated for different thresholds for the deep neural network models, logistic regression model, ASA score, and Preoperative Score to Predict Postoperative Mortality. The F1 score is a measure of precision and recall, ranging from 0 to 1. It is calculated as , where precision is (true positives/predicted true) and recall is equivalent to sensitivity. Two different threshold methods were assessed: (1) a threshold that optimized the observed in-hospital mortality rate, and (2) a threshold based on the highest F1 score. The number of true positives, true negatives, false positives, and false negatives were then assessed for each threshold to assess differences in the number of patients correctly predicted by each model.

Calibration.

Calibration was performed to account for the use of data augmentation on the training data set to be used during training of the deep neural network. This data augmentation served to balance classes in the training data set to approximately 45% mortality versus the true distribution of mortality (less than 1%). This extreme augmentation of the training data set classes skewed predicted probabilities to be higher than the expected probability based on the true distribution of mortality (less than 1%). Therefore, we performed calibration after finalizing the model. Calibration was performed only on the test data set. Calibration of the deep neural network predicted probability output was performed using the following equation:

where and P(0)=1−P(1). This calibration formula was used to maintain the rank of predicted probabilities, and thus not changing any model performance metrics (area under the receiver operating characteristics curve, sensitivity, specificity, or F1 score). Additionally, calibration plots and Brier scores were used to assess calibration of predictions.

Feature Importance.

To assess which features are the most predictive in the deep neural network, we performed a feature ablation analysis. This analysis consisted of removing model features grouped by type of clinical feature, and then retraining a deep neural network with the same final architecture, as well as hyperparameters on the remaining features. The change in area under the receiver operating characteristics curve with the removal of each feature was then assessed to evaluate the importance of each group of features. To assess which features are the most predictive in the logistic regression model, we assessed which features corresponded to the largest weights.

All deep neural network models were developed and applied using Keras.⁴⁴ Logistic regression models and performance metrics were calculated with scikit-learn.⁴⁵

Results

Patient Characteristics

The data consisted of 59,985 surgical records. Patient demographics and characteristics of the training and test data sets are summarized in table 2. The in-hospital mortality rate of both the training and test set is less than 1%. The presence of invasive lines is also similar for both sets (26.5% in training; 26.7% in test). The most prevalent ASA score is III at 49.9% for both sets.

Table 2.

Training and Test Data Set Patient Characteristics Reported as Number of Patients (%) or Mean ± SD

View large

Development of the Model

The final deep neural network architecture consists of four hidden layers of 300 neurons per layer with rectified linear unit activations and a logistic output (fig. 1). The deep neural network was trained with dropout probability of 0.5 between all layers, L2 weight decay of 0.0001, and a learning rate of 0.01 and momentum of 0.9.

Fig. 1.

View large Download slide

Summary visualization of the deep neural network. Input layer (blue) of features feed into the first hidden layer of 300 neurons with rectified linear unit activations (grey). All the activations of neurons in the first hidden layer are fed into each of the neurons in the second, then all the second are fed into the third, and finally, all the third are fed into the fourth. All the activations of the neurons in the fourth hidden layer are then fed into a logistic output layer to produce a probability for in-hospital mortality between 0 and 1.

Model Performance

All performance metrics reported below refer to the test data set (n = 11,997).

Area under the Receiver Operating Characteristics Curves.

Receiver operating characteristics curves and area under the receiver operating characteristics curve results are shown in figure 2 and table 3. All logistic regression models and all deep neural networks had higher area under the receiver operating characteristics curves than Preoperative Score to Predict Postoperative Mortality (0.74 [95% CI, 0.68 to 0.79]) and Surgical Apgar (0.58 [95% CI, 0.52 to 0.64]) for predicting in-hospital mortality (fig. 2, table 3). All deep neural networks had higher area under the receiver operating characteristics curves than logistic regressions for each combination of features except for the reduced feature set with Preoperative Score to Predict Postoperative Mortality (logistic regression 0.90 [95% CI, 0.86 to 0.93] vs. deep neural network 0.90 [95% CI, 0.87 to 0.93]). In addition, reducing the feature set from 87 to 45 features did not reduce the deep neural network model area under the receiver operating characteristics curve performance, and the addition of ASA score and Preoperative Score to Predict Postoperative Mortality as features modestly improved the area under the receiver operating characteristics curves of both the full and reduced feature set deep neural network models. The highest deep neural network area under the receiver operating characteristics curve result was the deep neural network with reduced feature set and ASA score (0.91 [95% CI, 0.88 to 0.93]). The highest risk score area under the receiver operating characteristics curve was Risk Stratification Index (0.97 [95% CI, 0.94 to 0.99]), and the highest logistic regression area under the receiver operating characteristics curves were the logistic regression with reduced feature set and ASA score (0.90 [95% CI, 0.87 to 0.93]), and the logistic regression with reduced feature set and Preoperative Score to Predict Postoperative Mortality (0.90 [95% CI, 0.86 to 0.93]).

Table 3.

Area under the Receiver Operator Characteristic Curve Results with 95% CIs for the Test Set (N = 11,997)

View large

Fig. 2.

View large Download slide

Receiver operating characteristic curves to predict postoperative in-hospital mortality.

Choosing a Threshold.

For comparison of F1 scores, sensitivity and specificity at different thresholds, deep neural network with original 87 features (DNN), deep neural network with a reduced feature set and Preoperative Score to Predict Postoperative Mortality (DNN_rfsPOSPOM), and deep neural network with a reduced feature set and ASA score (DNN_rfsASA) are compared to ASA score, Preoperative Score to Predict Postoperative Mortality, logistic regression with original 87 features, logistic regression with a reduced feature set and Preoperative Score to Predict Postoperative Mortality (LR_rfsPOSPOM), and logistic regression with a reduced feature set and ASA score (LR_rfsASA; table 4). To compare the number of correctly predicted patients by the deep neural networks at different thresholds, a table of the number of correctly and incorrectly classified patients is shown for all models at different thresholds for all test patients (n = 11,997; table 5).

Table 4.

Percentage of Observed Mortality Patients Correctly Identified, F1 Score, Sensitivity, and Specificity Performance of ASA Score; POSPOM; Logistic Regression Model and DNN Model with 87 Features; Logistic Regression Model and DNN Model with Reduced Feature Set and ASA Score; and Logistic Regression Model and DNN Model with Reduced Feature Set and POSPOM at Different Thresholds

View large

Table 5.

Number of Correctly and Incorrectly Classified Patients for ASA Score; POSPOM; Logistic Regression Model and DNN Model with 87 Features; Logistic Regression Model and DNN Model with Reduced Feature Set and ASA; and Logistic Regression Model and DNN Model with Reduced Feature Set and POSPOM at Different Thresholds

View large

If we choose a threshold that optimizes the observed in-hospital mortality rate, the thresholds (% observed mortality) for Preoperative Score to Predict Postoperative Mortality, ASA score, and logistic regression, LR_rfsPOSPOM, and LR_rfsASA are 10 (93.1%), 3 (97.7%), 0.00015 (98.9%), 0.002 (97.7%), and 0.0034 (96.66%), respectively (table 4). The thresholds for deep neural network, DNN_rfsPOSPOM, and DNN_rfsASA are 0.05 (98.9%), 0.2 (96.6%), and 0.22 (96.6%), respectively. At these thresholds, Preoperative Score to Predict Postoperative Mortality, ASA score, logistic regression, LR_rfsPOSPOM, LR_rfsASA, deep neural network, DNN_rfsPOSPOM, and DNN_rfsASA all have high and comparable sensitivities. The deep neural network with the highest area under the receiver operating characteristics curve, DNN_rfsASA, had a sensitivity of 0.97 (95% CI, 0.92 to 1) and specificity of 0.64 (95% CI, 0.64 to 0.65), and the logistic regression with the highest area under the receiver operating characteristics curve, LR_rfsASA, had a sensitivity of 0.97 (95% CI, 0.92 to 1) and specificity of 0.64 (95% CI, 0.63 to 0.65). However, all deep neural networks reduced false positives while maintaining the same or similar number of false negatives (table 5). The deep neural network with all 87 original features decreased the number of false positives compared to logistic regression, from 11,873 to 9,169 patients. DNN_rfsASA decreased the number of false positives compared to LR_rfsASA, from 4,332 patients to 4,241 patients; when compared to Preoperative Score to Predict Postoperative Mortality and ASA score, from 9,169 patients and 6,666 patients, respectively.

If we choose a threshold that optimizes precision and recall via the F1 score, the thresholds for Preoperative Score to Predict Postoperative Mortality, ASA score, logistic regression, LR_rfsPOSPOM, and LR_rfsASA are higher at 20, 5, 0,1, 0.1, and 0.1, respectively (table 4). All the thresholds for deep neural network, DNN_rfsPOSPOM, and DNN_rfsASA also increased to 0.3, 0.4, and 0.3, respectively. The highest F1 scores were comparable for ASA score, LR_rfsASA, and DNN_rfsASA at 0.24 (95% CI, 0.14 to 0.35), 0.26 (95% CI, 0.18 to 0.33), and 0.22 (95% CI, 0.12 to 0.30). However, DNN_rfsASA had a lower number of false positives at 35 patients, compared to LR_rfsASA at 115 patients (table 5).

Calibration.

For comparison of calibration, Brier scores and calibration plots were assessed for logistic regression, DNN_rfsASA, and calibrated DNN_rfsASA. DNN_rfsASA had the worst Brier score of 0.0352, and logistic regression had the best score of 0.0065 (fig. 3). However, the calibrated DNN_rfsASA had a comparable Brier score of 0.0071. Calibration of DNN_rfsASA shifted the best thresholds for observed mortality optimization and F1 optimization from 0.2 and 0.4 to 0.0018 and 0.0048, respectively.

Fig. 3.

View large Download slide

Calibration plot with mean predicted probability versus true positive frequency (number of true positives/number of samples) per probability value bins in the test data set (N = 11,997) for logistic regression, deep neural network (DNN) with reduced feature set and American Society of Anesthesiologists (ASA) score, and calibrated DNN with reduced feature set and ASA score. Bins of predicted probability were at intervals of 0.1: (0 to 0.1), (0.1 to 0.2)…(0.9 to 1.0).

Feature Importance.

To assess feature importance in the deep neural network, we assessed the decrease in area under the receiver operating characteristics curve for the removal of groups of features from the best deep neural network (DNN_rfsasa; table 6; fig. 4). For the analysis, 13 groups were used (age, anesthesia, ASA score, input, blood pressure, output, vasopressor, vasodilator, labs, heart rate, invasive line, inotrope, and pulse oximetry). To assess feature importance, we assessed the weights for the logistic regression model (LR_rfsASA; fig. 5). The top five deep neural network features groups were: labs, ASA score, anesthesia, blood pressure, and vasopressor administration. The top logistic regression feature was ASA score. In addition, similar to the deep neural network, vasopressin administration, hemoglobin, presence of arterial or pulmonary arterial line, and sevoflurane administration are found in the top 10 weights.

Table 6.

Features Removed with Each Group during Each Step of the Feature Ablation Analysis for the DNN

View large

Fig. 4.

View large Download slide

Decrease in area under the receiver operating characteristic curve (AUC) performance for each feature group removed during feature ablation analysis for deep neural network with reduced feature set and American Society of Anesthesiologists (ASA) score. BP, blood pressure; DNN, deep neural network; HR, heart rate.

Fig. 5.

View large Download slide

Logistic regression model weight assigned to each feature in the logistic regression model with reduced feature set and American Society of Anesthesiologists (ASA) score.

We have developed a Web site application that performs predictions for DNN_rfsASA and DNN_rfs on a given data set. The application, as well as downloadable model package, are available at http://risknet.ics.uci.edu.

Discussion

The results in this study demonstrate that deep neural networks can be utilized to predict in-hospital mortality based on automatically extractable and objective intraoperative data. In addition, these predictions are further improved via the addition of preoperative information, as summarized in a patient’s ASA score or Preoperative Score to Predict Postoperative Mortality. The area under the receiver operating characteristics curve of the “best” deep neural network model with a reduced feature set and ASA score (DNN_rfsASA) also outperformed Surgical Apgar, Preoperative Score to Predict Postoperative Mortality, and ASA score. Optimizing thresholds to capture the most observed mortality patients, in other words optimizing for sensitivity, DNN_rfsASA has higher sensitivity than Preoperative Score to Predict Postoperative Mortality, but comparable to ASA score, LR_rfsASA, and LR_rfsPOSPOM. This may make sense as ASA score is a feature in this deep neural network model. Most notably, however, is that DNN_rfsASA reduces the number of false positives compared to Preoperative Score to Predict Postoperative Mortality and ASA score by 54% and 36%, respectively. DNN_rfsASA also reduced the number of false positives to the most comparably performing logistic regression model LR_rfsASA by 2%. In addition, it should be noted that for each feature set combination (all 87 features, 87 features with ASA score, 87 features with Preoperative Score to Predict Postoperative Mortality, reduced features, reduced features with ASA score, and reduced features with Preoperative Score to Predict Postoperative Mortality), the deep neural network slightly outperformed logistic regression, with the exception of the reduced feature set with Preoperative Score to Predict Postoperative Mortality. However, the addition of Preoperative Score to Predict Postoperative Mortality is adding a logistic regression model output as a feature to another logistic regression model, which can be thought of as adding one hidden layer to a neural network with a logistic output. While the area under the receiver operating characteristics curve of logistic regression with the same reduced feature set and ASA score (LR_rfsASA) was not significantly lower than DNN_rfsASA, the deep neural network with all 87 original features outperformed logistic regression with the same 87 features in area under the receiver operating characteristics curve and significantly decreased the number of false positives by 2,377 patients (20%). This suggests that without careful feature selection to reduce the number of features, as well adding preoperative information, logistic regression did not perform comparably to a deep neural network. Logistic regression can be thought of as a neural network with no hidden layers. When preserving complexity, such as not performing careful feature selection or more rigorous preprocessing, neural networks with many hidden layers are able to perform well and in some cases better than logistic regression.

Due to such a low incidence of true positives (n = 87), the numbers for false negatives are hard to compare in this very small mortality population. This small number of mortality patients also affects the interpretation of the calibration results. Extensive data augmentation was used in training the deep neural network on balanced classes, resulting in predicted probabilities that were shifted up. The deep neural network’s predicted probability was calibrated to the expected probability of mortality (less than 1%), and all predicted probabilities were then shifted down significantly less than 0.01 to reflect the % occurrence of in-hospital mortality, while maintaining all performance metrics. After calibration, the calibrated DNN_rfsASA resulted in a better Brier score that was also closer to that of logistic regression, and the optimal mortality threshold for DNN_rfsASA was shifted down from 0.2 to 0.0018, a more reasonable threshold considering the low percent occurrence of mortality. For direct comparison in the calibration plot, the same probability bins at intervals of 0.1 were chosen for the DNN_rfsASA calibrated and uncalibrated as well as logistic regression. A limitation of the calibration plot is that it is highly dependent on the choice of bins. This limitation is reflected in the resulting calibration plot for the calibrated DNN_rfsASA, where 86 mortality patients were predicted in the bin (0 to 0.1) and one patient was predicted in the bin (0.9 to 1). Thus, the interpretation of these results is limited to the number of true positives that exist.

While the Risk Quantification Index had a high and comparable area under the receiver operating characteristics curve to the DNN_rfsASA, it could only be calculated on 47% of the test patients due to a feature of Risk Quantification Index, specifically the Procedural Severity Score, which was available for only a limited number of Current Procedural Terminology codes. The Risk Stratification Index had the highest area under the receiver operating characteristics curve at 0.97 and, unlike Risk Quantification Index, could be calculated on a vast majority of the patients. Risk Stratification Index requires ICD-9 procedural and diagnosis codes. There are important distinctions to be made between a risk score based on clinical data (ASA score, Surgical Apgar, Preoperative Score to Predict Postoperative Mortality, and the logistic regression and deep neural network models reported here) versus administrative data (Risk Stratification Index, Risk Quantification Index). The first is that present-on-admission diagnoses and planned procedures (i.e., ICD-9 and ICD-10 codes) are theoretically available preoperatively. But in practice, the coding is done after discharge, and therefore is not actually available preoperatively to guide clinical care. This makes scores, such as the Risk Stratification Index, appropriate for its intended purpose—comparing hospitals—but not for individual patient care. Finally, point-of-care clinical data contain more information about specific patients than models based only on diagnoses and procedure codes, and therefore should be more specific and useful for guiding the care of individual patients. These distinctions should not be seen as “one is better than another,” so much as a matter of selecting the right model for particular purposes.

Perhaps the most attractive feature of this mortality model is that it provides a fully automated and highly accurate way to estimate the mortality risk of the patient at the end of surgery. All data contained in the risk score are easily obtained from the electronic medical record and could be automatically loaded into a model. While the ASA score is subjective, presents with high inter- and intrarater variability, and does require input from the anesthesiologist into the electronic medical record, this input is common practice as a part of preoperative assessment. In addition, we have also trained a deep neural network model using the Preoperative Score to Predict Postoperative Mortality score with comparable performance metrics. Thus, if the clinical need is to be completely objective, the DNN_rfsPOSPOM model would be the most automatic and objective, as Preoperative Score to Predict Postoperative Mortality is based on the presence of key patient comorbidities and could be automatically obtained from the electronic medical record.

The input into this mortality model is based heavily on intraoperative data available at the end of surgery. There are 45 intraoperative features in the reduced feature set and one preoperative feature was added accordingly to leverage preoperative information. The ability of the intraoperative-only mortality models (deep neural network and deep neural network with reduced feature set) to maintain high performance with no addition of preoperative features further supports the idea that intraoperative events and management may have a significant effect on postoperative outcomes.

By definition, any screening score will have to trade off between sensitivity (capturing all patients with the condition) and specificity (not capturing those who do not have the condition). As a result, clinically, we generally discuss the number needed to treat—the number of “false positives” that must be treated to capture one true positive. Our deep neural network model not only had the highest area under the receiver operating characteristics curve, but also reduced the number of false positives, thereby reducing the number needed to treat. Given the current transitions toward value-based care, this has some appeal.

Another key advantage of a deep neural network model is its ability to account for the relationships between various clinical factors. For example, in a logistic regression model, excess estimated blood loss might be assigned a certain weight and hypotension a different one, thus assigning a linear relationship between hypotension and blood loss. On the other hand, a deep neural network model could account for the differences and linear or nonlinear associations of hypotension in a minimal blood loss versus significant blood loss case. While a feature could be created to reflect this relationship of hypotension and blood loss and used as an input into a logistic regression model, a deep neural network model avoids this need for careful feature extraction and is able to create these features on its own. Eventually, integration of deep neural network models into electronic medical records could result in more accurate risk scores generated automatically per patient, thereby providing real-time assistance in the triaging of patients.

Study Limitations

There are several limitations to this study. Perhaps most significantly, this study is from a single center and of a somewhat limited sample size. As mentioned above, deep learning models in other fields have included millions of samples. In order to address this limitation and avoid overfitting, we chose a limited number of features and implemented regularization training techniques commonly used in deep learning. In addition, there were only 87 mortality patients in the test data set. Thus, it is possible that the results generated here are not fully generalizable to other institutions and will need to be validated on other data sets.

Conclusions

To the best of our knowledge, this study is the first to demonstrate the ability to use deep learning to predict postoperative in-hospital mortality based on intraoperative electronic medical record data. The deep learning model presented in this study is robust, shows improved or comparable discrimination to other risk scores, can be calculated automatically at the end of surgery, and does not rely on any administrative inputs.

Research Support

Support was provided solely from institutional and/or departmental sources.

Competing Interests

Dr. Lee is an Edwards Lifesciences (Irvine, California) employee, but this work was done independently from this position and as part of her Ph.D. Dr. Cannesson has ownership interest in Sironis, a company developing closed-loop systems, and does consulting for Edwards Lifesciences and Masimo Corp. (Irvine, California). Dr. Cannesson has received research support from Edwards Lifesciences through his department and National Institutes of Health (Bethesda, Maryland) grant Nos. R01 GM117622 (“Machine Learning of Physiological Variables to Predict Diagnose and Treat Cardiorespiratory Instability”) and R01 NR013912 (“Predicting Patient Instability Noninvasively for Nursing Care-Two [PPINNC-2]”). The other authors declare no competing interests.

References

1.

Weiser

TG

,

Regenbogen

SE

,

Thompson

KD

,

Haynes

AB

,

Lipsitz

SR

,

Berry

WR

,

Gawande

AA

:

An estimation of the global volume of surgery: A modelling strategy based on available data.

Lancet

2008

;

372

:

139

–

44

Google Scholar

Crossref

PubMed

2.

Pearse

RM

,

Harrison

DA

,

James

P

,

Watson

D

,

Hinds

C

,

Rhodes

A

,

Grounds

RM

,

Bennett

ED

:

Identification and characterisation of the high-risk surgical population in the United Kingdom.

Crit Care

2006

;

10

:

R81

Google Scholar

Crossref

PubMed

3.

Pearse

RM

,

Moreno

RP

,

Bauer

P

,

Pelosi

P

,

Metnitz

P

,

Spies

C

,

Vallet

B

,

Vincent

J-L

,

Hoeft

A

,

Rhodes

A

:

European Surgical Outcomes Study (EuSOS) group for the Trials groups of the European Society of Intensive Care Medicine and the European Society of Anaesthesiology: Mortality after surgery in Europe: A 7 day cohort study.

Lancet

2012

;

380

:

1059

–

65

Google Scholar

Crossref

PubMed

4.

American Society of Anesthesiologists: New classification of physical status.

Anesthesiology

1963

;

24

:

111

5.

Gawande

AA

,

Kwaan

MR

,

Regenbogen

SE

,

Lipsitz

SA

,

Zinner

MJ

:

An Apgar score for surgery.

J Am Coll Surg

2007

;

204

:

201

–

8

Google Scholar

Crossref

PubMed

6.

Reynolds

PQ

,

Sanders

NW

,

Schildcrout

JS

,

Mercaldo

ND

,

St Jacques

PJ

:

Expansion of the surgical Apgar score across all surgical subspecialties as a means to predict postoperative mortality.

Anesthesiology

2011

;

114

:

1305

–

12

Google Scholar

Crossref

PubMed

7.

Haynes

AB

,

Regenbogen

SE

,

Weiser

TG

,

Lipsitz

SR

,

Dziekan

G

,

Berry

WR

,

Gawande

AA

:

Surgical outcome measurement for a global patient population: Validation of the Surgical Apgar Score in 8 countries.

Surgery

2011

;

149

:

519

–

24

Google Scholar

Crossref

PubMed

8.

Regenbogen

SE

,

Ehrenfeld

JM

,

Lipsitz

SR

,

Greenberg

CC

,

Hutter

MM

,

Gawande

AA

:

Utility of the surgical apgar score: Validation in 4119 patients.

Arch Surg

2009

;

144

:

30

–

6

;

discussion 37

Google Scholar

Crossref

PubMed

9.

Terekhov

MA

,

Ehrenfeld

JM

,

Wanderer

JP

:

Preoperative surgical risk predictions are not meaningfully improved by including the Surgical Apgar score: An analysis of the Risk Quantification Index and present-on-admission risk models.

Anesthesiology

2015

;

123

:

1059

–

66

Google Scholar

Crossref

PubMed

10.

Le Manach

Y

,

Collins

G

,

Rodseth

R

,

Le Bihan-Benjamin

C

,

Biccard

B

,

Riou

B

,

Devereaux

PJ

,

Landais

P

:

Preoperative Score to Predict Postoperative Mortality (POSPOM): Derivation and validation.

Anesthesiology

2016

;

124

:

570

–

9

Google Scholar

Crossref

PubMed

11.

Schmidhuber

J

:

Neural networks.

Reviews

2015

;

61

:

85

–

117

Google Scholar

12.

Le Cun

Y

,

Boser

B

,

Denker

JS

,

Henderson

D

,

Howard

RE

,

Hubbard

W

,

Jackel

LD

:

Handwritten digit recognition with a back-propagation network.

Morgan Kaufmann

1990

Google Scholar

13.

Baldi

P

,

Chauvin

Y

:

Neural networks for fingerprint recognition.

Neural Computation

1993

;

5

Google Scholar

14.

Krizhevsky

Sutskever

,

Hinton

E

:

ImageNet classification with deep convolutional neural networks.

Advances in neural information processing systems

2012

Google Scholar

15.

Szegedy

C

,

Liu

W

,

Jia

Y

,

Sermanet

P

,

Reed

S

,

Anguelov

D

,

Erhan

D

,

Vanhoucke

V

,

Rabinovich

A

:

Going deeper with convolutions

in

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

2015

Google Scholar

Crossref

16.

Srivastava

K

,

Greff

,

Schmidhuber

J

:

Training very deep networks.

Advances in Neural Information Processing Systems

2015

Google Scholar

17.

He

K

,

Zhang

X

,

Ren

S

,

Sun

J

:

Deep residual learning for image recognition.

IEEE Conference on Computer Vision and Pattern Recognition

2016

Google Scholar

18.

Wu

L

,

Baldi

P

:

Learning to play Go using recursive neural networks.

Neural Netw

2008

;

21

:

1392

–

400

Google Scholar

Crossref

PubMed

19.

Wu

L

,

Baldi

P

:

A scalable machine learning approach to GO.

Advances in Neural Information Processing Systems

2007

;

19

Google Scholar

20.

Silver

D

,

Huang

A

,

Maddison

CJ

,

Guez

A

,

Sifre

L

,

van den Driessche

G

,

Schrittwieser

J

,

Antonoglou

I

,

Panneershelvam

V

,

Lanctot

M

,

Dieleman

S

,

Grewe

D

,

Nham

J

,

Kalchbrenner

N

,

Sutskever

I

,

Lillicrap

T

,

Leach

M

,

Kavukcuoglu

K

,

Graepel

T

,

Hassabis

D

:

Mastering the game of Go with deep neural networks and tree search.

Nature

2016

;

529

:

484

–

9

Google Scholar

Crossref

PubMed

21.

Baldi

P

,

Sadowski

P

,

Whiteson

D

:

Searching for exotic particles in high-energy physics with deep learning.

Nat Commun

2014

;

5

:

4308

Google Scholar

Crossref

PubMed

22.

Sadowski

PJ

,

Collado

J

,

Whiteson

D

,

Baldi

P

:

Deep learning, dark knowledge, and dark matter.

Journal of Machine Learning Research, Workshop and Conference Proceedings

2015

;

42

Google Scholar

23.

Kayala

MA

,

Azencott

CA

,

Chen

JH

,

Baldi

P

:

Learning to predict chemical reactions.

J Chem Inf Model

2011

;

51

:

2209

–

22

Google Scholar

Crossref

PubMed

24.

Kayala

MA

,

Baldi

P

:

ReactionPredictor: Prediction of complex chemical reactions at the mechanistic level using machine learning.

J Chem Inf Model

2012

;

52

:

2526

–

40

Google Scholar

Crossref

PubMed

25.

Lusci

A

,

Pollastri

G

,

Baldi

P

:

Deep architectures and deep learning in chemoinformatics: The prediction of aqueous solubility for drug-like molecules.

J Chem Inf Model

2013

;

53

:

1563

–

75

Google Scholar

Crossref

PubMed

26.

Di Lena

P

,

Nagata

K

,

Baldi

P

:

Deep architectures for protein contact map prediction.

Bioinformatics

2012

;

28

:

2449

–

57

Google Scholar

Crossref

PubMed

27.

Baldi

P

,

Pollastri

G

:

The principled design of large-scale recursive neural network architectures—dag-rnns and the protein structure prediction problem.

Journal of Machine Learning Research

2003

;

4

Google Scholar

28.

Zhou

J

,

Troyanskaya

OG

:

Predicting effects of noncoding variants with deep learning-based sequence model.

Nat Methods

2015

;

12

:

931

–

4

Google Scholar

Crossref

PubMed

29.

Guillame-Bert

M

,

Dubrawski

A

,

Wang

D

,

Hravnak

M

,

Clermont

G

,

Pinsky

MR

:

Learning temporal rules to forecast instability in continuously monitored patients.

J Am Med Inform Assoc

2017

;

24

:

47

–

53

Google Scholar

Crossref

PubMed

30.

Chen

L

,

Dubrawski

A

,

Clermont

G

,

Hravnak

M

,

Pinsky

M

:

Modelling risk of cardio-respiratory instability as a heterogeneous process.

AMIA Annual Symposium Proceedings

2015

Google Scholar

31.

Frizzell

JD

,

Liang

L

,

Schulte

PJ

,

Yancy

CW

,

Heidenreich

PA

,

Hernandez

AF

,

Bhatt

DL

,

Fonarow

GC

,

Laskey

WK

:

Prediction of 30-day all-cause readmissions in patients hospitalized for heart failure: Comparison of machine learning and other statistical approaches.

JAMA Cardiol

2017

;

2

:

204

–

9

Google Scholar

Crossref

PubMed

32.

Shadmi

E

,

Flaks-Manov

N

,

Hoshen

M

,

Goldman

O

,

Bitterman

H

,

Balicer

RD

:

Predicting 30-day readmissions with preadmission electronic health record data.

Med Care

2015

;

53

:

283

–

9

Google Scholar

Crossref

PubMed

33.

Nguyen, Tran, Wickramasinghe

:

Deepr: A convolutional net for medical records.

arXiv

2016

;

1607

.

07519v1

34.

Lipton

Z

,

Kale

D

,

Elkan

C

,

Wetzel

R

:

Learning to diagnose with LSTM recurrent neural networks

in

International Conference on Learning Representations

2016

Google Scholar

35.

Razavian

N

,

Sontag

D

:

Temporal convolutional neural networks for diagnosis from lab tests.

arXiv

2016

;

1511

.

07938v4

Google Scholar

36.

Gulshan

V

,

Peng

L

,

Coram

M

,

Stumpe

MC

,

Wu

D

,

Narayanaswamy

A

,

Venugopalan

S

,

Widner

K

,

Madams

T

,

Cuadros

J

,

Kim

R

,

Raman

R

,

Nelson

PC

,

Mega

JL

,

Webster

DR

:

Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs.

JAMA

2016

;

316

:

2402

–

10

Google Scholar

Crossref

PubMed

37.

Luo

W

,

Phung

D

,

Tran

T

,

Gupta

S

,

Rana

S

,

Karmakar

C

,

Shilton

A

,

Yearwood

J

,

Dimitrova

N

,

Ho

TB

,

Venkatesh

S

,

Berk

M

:

Guidelines for developing and reporting machine learning predictive models in biomedical research: A multidisciplinary view.

J Med Internet Res

2016

;

18

:

e323

Google Scholar

Crossref

PubMed

38.

Hofer

IS

,

Gabel

E

,

Pfeffer

M

,

Mahbouba

M

,

Mahajan

A

:

A systematic approach to creation of a perioperative data warehouse.

Anesth Analg

2016

;

122

:

1880

–

4

Google Scholar

Crossref

PubMed

39.

Baldi

P

,

Sadowski

P

:

The dropout learning algorithm.

Artificial Intelligence

2014

;

210

:

78

–

122

Google Scholar

Crossref

PubMed

40.

Srivastava

N

:

Dropout: A simple way to prevent neural networks from overfitting.

Journal of Machine Learning Research

2014

;

15

Google Scholar

41.

Hinton

GE

,

Srivastava

N

,

Krizhevsky

A

:

Improving neural networks by preventing co-adaptation of feature detectors.

arXiv

:

1207.0580 2012

42.

Sigakis

MJ

,

Bittner

EA

,

Wanderer

JP

:

Validation of a risk stratification index and risk quantification index for predicting patient outcomes: In-hospital mortality, 30-day mortality, 1-year mortality, and length-of-stay.

Anesthesiology

2013

;

119

:

525

–

40

Google Scholar

Crossref

PubMed

43.

Sessler

DI

,

Sigl

JC

,

Manberg

PJ

,

Kelley

SD

,

Schubert

A

,

Chamoun

NG

:

Broadly applicable risk stratification system for predicting duration of hospitalization and mortality.

Anesthesiology

2010

;

113

:

1026

–

37

Google Scholar

Crossref

PubMed

44.

Chollet

F

:

Keras.

Available at: https://github.com/fchollet/keras. Accessed October 16, 2017

45.

Pedregosa

F

,

Varoquaux

G

,

Gramfort

A

,

Michel

V

,

Thirion

B

,

Grisel

O

,

Blondel

M

,

Prettenhofer

P

,

Weiss

R

,

Dubourg

V

:

Scikit-learn: Machine learning in Python.

Journal of Machine Learning Research

2011

;

12

Google Scholar

2018

Development and Validation of a Deep Neural Network Model for Prediction of Postoperative In-hospital Mortality

Abstract

Materials and Methods

Electronic Medical Record Data Extraction

Model Endpoint Definition

Model Input Features

Data Preprocessing

Development of the Model

Overfitting.

Data Augmentation.

Feature Reduction and Preoperative Feature Experiments

Model Performance

Area under the Receiver Operating Characteristics Curves.

Choosing a Threshold.

Calibration.

Feature Importance.

Results

Patient Characteristics

Development of the Model

Model Performance

Area under the Receiver Operating Characteristics Curves.

Choosing a Threshold.

Calibration.

Feature Importance.

Discussion

Study Limitations

Conclusions

Research Support

Competing Interests

References

Visual Abstract

Citing articles via

Most Viewed

Most Cited

Email alerts

Development and Validation of a Deep Neural Network Model for Prediction of Postoperative In-hospital Mortality

Abstract

Materials and Methods

Electronic Medical Record Data Extraction

Model Endpoint Definition

Model Input Features

Data Preprocessing

Development of the Model

Overfitting.

Data Augmentation.

Feature Reduction and Preoperative Feature Experiments

Model Performance

Area under the Receiver Operating Characteristics Curves.

Choosing a Threshold.

Calibration.

Feature Importance.

Results

Patient Characteristics

Development of the Model

Model Performance

Area under the Receiver Operating Characteristics Curves.

Choosing a Threshold.

Calibration.

Feature Importance.

Discussion

Study Limitations

Conclusions

Research Support

Competing Interests

References

Visual Abstract

Related Podcast

Citing articles via

Most Viewed

Most Cited

Email alerts

Related Articles

Social Media

This Feature Is Available To Subscribers Only