Background

Accurate estimation of surgical transfusion risk is essential for efficient allocation of blood bank resources and for other aspects of anesthetic planning. This study hypothesized that a machine learning model incorporating both surgery- and patient-specific variables would outperform the traditional approach that uses only procedure-specific information, allowing for more efficient allocation of preoperative type and screen orders.

Methods

The American College of Surgeons National Surgical Quality Improvement Program Participant Use File was used to train four machine learning models to predict the likelihood of red cell transfusion using surgery-specific and patient-specific variables. A baseline model using only procedure-specific information was created for comparison. The models were trained on surgical encounters that occurred at 722 hospitals in 2016 through 2018. The models were internally validated on surgical cases that occurred at 719 hospitals in 2019. Generalizability of the best-performing model was assessed by external validation on surgical cases occurring at a single institution in 2020.

Results

Transfusion prevalence was 2.4% (73,313 of 3,049,617), 2.2% (23,205 of 1,076,441), and 6.7% (1,104 of 16,053) across the training, internal validation, and external validation cohorts, respectively. The gradient boosting machine outperformed the baseline model and was the best- performing model. At a fixed 96% sensitivity, this model had a positive predictive value of 0.06 and 0.21 and recommended type and screens for 36% and 30% of the patients in internal and external validation, respectively. By comparison, the baseline model at the same sensitivity had a positive predictive value of 0.04 and 0.144 and recommended type and screens for 57% and 45% of the patients in internal and external validation, respectively. The most important predictor variables were overall procedure-specific transfusion rate and preoperative hematocrit.

Conclusions

A personalized transfusion risk prediction model was created using both surgery- and patient-specific variables to guide preoperative type and screen orders and showed better performance compared to the traditional procedure-centric approach.

Editor’s Perspective
• Accurate surgical transfusion risk assessment helps strike the correct balance of blood bank resource utilization and blood product availability

• The most widely used risk assessment methods focus on historical procedure-specific transfusion rate and do not incorporate patient-specific factors

• A machine learning–based prediction algorithm to identify patients was derived and validated using more than 4 million national surgical registry records

• The algorithm demonstrated that the inclusion of patient factors decreased the number of recommended type and screen orders from 57% to 36% of cases while maintaining 96% sensitivity

• When validated using data from a single center, the algorithm reduced the number of recommended type and screen orders from 46% to 31%

• The most important variables for model prediction included procedure- specific transfusion rate, preoperative hematocrit, age, and laboratory indicators of coagulopathy

Blood transfusion can be a lifesaving therapy in the perioperative setting. Ideally, patients with nontrivial risk of transfusion should receive blood typing and antibody screening preprocedurally to ensure the availability of compatible blood products.1  Conversely, patients with low risk of transfusion should be spared the discomfort and cost of an unnecessary laboratory test.1  Information on transfusion risk is also useful for anesthetic planning and decision-making, including aiding in decisions about the need for additional intravenous access or invasive monitoring. Therefore, accurate estimation of a patient’s likelihood of transfusion has implications for both patient safety and cost.

A common approach for estimating a patient’s likelihood of transfusion is based on surgical characteristics such as the historical percentage of patients undergoing that procedure who require transfusion,2–5  referred to as the procedure- specific transfusion rate. For example, previous studies proposed that patients undergoing surgery with a transfusion rate less than 5% and average blood loss less than 50 ml could omit a type and screen.4,5  However, in previous guidelines, patient-specific factors were not considered, even though preoperative anemia, renal dysfunction, patient age, size, and sex all have been associated with an increased likelihood of surgical transfusion.6–8  We hypothesized that a machine learning model incorporating both patient- and surgery-specific variables would provide better discrimination of transfusion risk as compared to current guidelines; such models at the point of care could allow for more efficient decision-making regarding preoperative type and screen orders.

Machine learning techniques rely on computer-based algorithms to identify patterns in large data sets and make predictions by learning from examples.9  Machine learning has shown promise for clinical prediction in many domains of medicine,10  including for transfusion risk prediction,7,8,11–16  but previous work has typically focused on modeling a limited subset of procedures and did not incorporate practical considerations necessary to guide preoperative blood ordering practice, such as the asymmetric harms of false-positive and false-negative predictions. In this study, we have the following research objectives: (1) develop, evaluate, and validate the performance of multiple machine learning models to estimate surgical red cell transfusion risk using both surgery- and patient-specific variables; (2) compare machine learning model performance to a baseline model that uses only the procedure-specific transfusion rate, the current standard of care; and (3) evaluate the performance of transfusion risk prediction models when tailored specifically for the decision to order a type and screen.

The protocol for this retrospective observational study was approved by the institutional review board of Washington University (St. Louis, Missouri; approval No. 202102003) with a waiver of informed consent. This study is reported according to the Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis guidelines.17  An overview of prespecified experimental design is shown in figure 1. The analysis plan was written after the data were accessed.

Fig. 1.

Diagram of experimental design. Models were trained exclusively on the National Surgical Quality Improvement Program Participant Use File surgical case cohort from 2016 to 2018, which was split, with 80% to be used for model training and hyperparameter tuning and 20% used for model evaluation, early stopping, and selection of the final model. Once the final model in each model category was chosen based on the training data, all model parameters were fixed, and the models were evaluated on the internal validation data, which contained cases performed in 2019 from the same national database, and external validation data, which contained surgical cases performed at a single academic institution in 2020.

Fig. 1.

Diagram of experimental design. Models were trained exclusively on the National Surgical Quality Improvement Program Participant Use File surgical case cohort from 2016 to 2018, which was split, with 80% to be used for model training and hyperparameter tuning and 20% used for model evaluation, early stopping, and selection of the final model. Once the final model in each model category was chosen based on the training data, all model parameters were fixed, and the models were evaluated on the internal validation data, which contained cases performed in 2019 from the same national database, and external validation data, which contained surgical cases performed at a single academic institution in 2020.

Close modal

### Data Sources

This study included all surgical cases submitted to the American College of Surgeons (Chicago, Illinois) National Surgical Quality Improvement Program18  for the time period spanning January 1, 2016, to December 31, 2019. No formal power calculation was performed to determine data set size. The National Surgical Quality Improvement Program Participant Use File contains information on surgical procedures performed on adult patients sampled from academic and community hospitals across the United States. In this database, only the occurrence of red cell transfusion is captured; the details regarding the quantity of red cell transfusion or the transfusion of other blood products such as fresh frozen plasma or platelets are not available.

We also extracted information on surgical procedures performed between January 1, 2020, and December 31, 2020, on adult patients at Barnes-Jewish Hospital, a large academic medical center, from the electronic health record (Epic Systems, USA). In addition, transfusion data were collected for cases performed at the same institution between January 1, 2019 and December 31, 2019, to estimate the procedure- specific transfusion rate needed to evaluate model performance for the 2020 data set, as will be described in the section “Computing Historical Procedure-specific Transfusion Rates.” We were unable to use transfusion data from earlier than 2019 to estimate procedure-specific transfusion rates for the external validation data set due to a transition in the electronic health record system implemented in 2018.

The data sources were split into subsets that are referred to as training (for National Surgical Quality Improvement Program data collected between January 1, 2016, and December 31, 2018), internal validation (for National Surgical Quality Improvement Program data collected between January 1, 2019, and December 31, 2019), and external validation (for Barnes-Jewish Hospital data collected between January 1, 2020, and December 31, 2020).

Our goal was to create a model with generalizable performance across a diverse set of surgical procedures. Toward that end, all procedures occurring in an operating room were included. However, two groups of surgical procedures—ophthalmologic surgery and obstetric surgery—have special considerations related to blood ordering and transfusion19  for which a general surgical model may not apply. These two groups are not present in the National Surgical Quality Improvement Program database and therefore were excluded from the Barnes-Jewish Hospital cohort to match.

### Variable Selection and Extraction

The outcome variable for our models was the presence of red cell transfusion on the day of surgery as a binary outcome. For the training and internal validation data, this was transfusion in either the intraoperative or postoperative period on the day of surgery. For the external validation data set, this included transfusion during the intraoperative period only, due to lack of postoperative transfusion data.

Predictor variables were selected based on data availability, previous literature, and ease of retrieval from an electronic health record. Our goal was to create a model that can be implemented directly within an electronic health record.

The following patient-specific variables were included: patient demographics (age, height, weight, sex), patient comorbidities (history of hypertension, congestive heart failure, smoking, chronic obstructive lung disease, dialysis, diabetes), and patient preoperative laboratory values (hematocrit, platelet count, international normalized ratio, partial thromboplastin time, creatinine, sodium, albumin, and bilirubin). American Society of Anesthesiologists (Schaumburg, Illinois) Physical Status was not included due to its history of poor interrater reliability20,21  and its potential lack of availability in the preoperative setting.

The surgery-specific variables included in the model were elective surgery status (i.e., whether the patient arrived for surgery from home) and the historical procedure- specific transfusion rate, which were precomputed as described in the section “Computing Historical Procedure-specific Transfusion Rates.” Whether each variable was considered binary, ordinal, or continuous is indicated in table 1. Binary variables were coded 0 or 1. Ordinal variables were coded 0 through N based on the number of levels N.

Table 1.

Demographic Characteristics of the Training, Internal Validation, and External Validation Data Sets

For training and internal validation data, each selected variable corresponded to a variable available in the National Surgical Quality Improvement Program Participant Use File. Patient comorbidities were abstracted from medical records by trained data experts at each participating hospital according to detailed criteria.22  Laboratory values represented the most recent laboratory value drawn within 90 days of surgery date.

For the external validation data, patient comorbidities were extracted from structured preoperative assessment notes (Supplemental Digital Content 1, fig. S1, http://links.lww.com/ALN/C792). Patient demographics and preoperative laboratory values were extracted from the electronic health record using the same criteria as used for the training and internal validation data.

### Computing Historical Procedure-specific Transfusion Rates

To compute the procedure-specific transfusion rate (i.e., the historical frequency of transfusion for each surgery), the computation differed between the training, internal validation, and external validation data sets. For the training data set, the prevalence of transfusion for each unique primary Current Procedural Terminology code was computed across the entire training data set, and the resulting surgery- specific transfusion frequency table was mapped to each case based on its primary Current Procedural Terminology code. The Current Procedural Terminology codes were otherwise not used in the models. For example, 80 of 127,315 (0.06%) laparoscopic appendectomies (Current Procedure Terminology code 44970) in the training data required red cell transfusion on the day of surgery. Therefore, all laparoscopic appendectomies in the training and internal validation data were annotated with the procedure-specific transfusion rate of 0.06.

To simulate prospective implementation of the models developed on the training data for the internal validation data, the same surgery-specific transfusion frequency table from the 2016 through 2018 training data was used to annotate surgeries performed in the 2019 internal validation data set; that is, the actual transfusion prevalence for each surgery in 2019 was not used to avoid label leakage. New primary Current Procedural Terminology codes that only occurred in 2019 were assigned missing values for surgery-specific transfusion frequency. There were 2,796 procedure types included in the internal validation analysis (Supplemental Digital Content 2, appendix 1, http://links.lww.com/ALN/C793).

A similar process was performed for the external validation data set; however, the historical transfusion rate was not matched to the national database but computed specifically for surgeries occurring at Barnes-Jewish Hospital in 2019 grouped by the preprocedural primary procedure name used for the case booking. In other words, for the external validation analysis, a set of procedure-specific transfusion rates were computed that were specific to Barnes-Jewish Hospital. We chose not to use Current Procedural Terminology codes to group procedures for the external validation data set as these codes are often not available in the preoperative setting and would thus limit the practical translation and implementation of our models. This preprocedural text was not available for the training and internal validation data.

To ensure reliability of the historical procedure-specific transfusion rates used for external validation, only primary procedures that occurred at least 50 times in 2019 were included in the external validation data set. This cutoff was chosen such that the estimated 95% CI for transfusion frequency would be no worse than approximately ±5%, calculated using Clopper–Pearson CI sampled from the binomial distribution. Of the 30,114 surgical procedures performed at Barnes-Jewish Hospital in 2020, 16,053 (53.3%) were booked with a primary procedure that occurred at least 50 times in 2019 and thus were included in the analysis. There were 171 primary procedures included in the external validation analysis (Supplemental Digital Content 3, appendix 2, http://links.lww.com/ALN/C794).

### Data Preprocessing

Missing values were present in the training, internal validation, and external validation data sets with frequencies indicated in Supplemental Digital Content 1 (table S1, http://links.lww.com/ALN/C792). Missing values were replaced by median imputation using the data distribution of the training data for all three data sets. This is likely a reasonable imputation approach as missingness for laboratory values, which were most commonly missing, is influenced by low clinician suspicion of abnormality.23  The per-variable median values used for imputation are shown in Supplemental Digital Content 1 (table S2, http://links.lww.com/ALN/C792). No further preprocessing, such as nonlinear scaling of continuous variables, was performed.

To facilitate speed of model training, all data were normalized by mean subtraction and scaling to unit variance, using only the data distribution for the training data for all three data sets. The per-variable mean and variance used for normalization are shown in Supplemental Digital Content 1 (table S2, http://links.lww.com/ALN/C792).

### Model Training

The training data were split 80% for model training and hyperparameter tuning and 20% for model testing and early stopping (fig. 1). Four supervised machine learning models were trained using the selected predictor variables to predict the binary transfusion outcome.

The following models were constructed: penalized logistic regression24  with tuning of lasso and ridge parameters, decision tree25  with tuning of tree depth, random forest26  with tuning of tree depth and number of features considered each split, and gradient boosting machines, implemented using XGBoost27  1.2.0, with tuning of tree depth, node purity precluding a split, and feature subsampling. Early stopping was used to determine the number of boosting rounds for XGBoost using average precision achieved on the 20% test split of the training data. All models were implemented in scikit-learn28  0.22.1 and Python 3.7.6 with hyperparameter tuning determined using fivefold cross-validation on the 80% training split to optimize average precision.

To facilitate comparison between our models and previous methods to determine need for preoperative type and screen orders,4  we also created a baseline model using a single variable: the procedure-specific transfusion rate.4  We were unable to include estimated blood loss as it was not reported in the National Surgical Quality Improvement Program database, and it was poorly documented in the anesthetic records for the external validation data set.

### Model Evaluation

After model training on the training data, all model parameters were fixed, and the models were evaluated on the internal validation data. The best-performing model was then evaluated on the external validation data set.

Overall model discrimination was evaluated with area under the receiver operating characteristic curve (i.e., C-statistic) and area under the precision recall curve (i.e., average precision). Average precision was chosen because it measures model discrimination specifically for the positive class, which is the more relevant class given the relative rarity of surgical transfusion. Calibration of model-predicted transfusion risk is also an important measure of model performance,29  especially if model predictions are to be useful to guide other aspects of anesthetic care. Calibration of model-predicted probabilities was assessed using a calibration curve. Net benefit analysis was also performed to assess the relative value of the models across a range of prediction thresholds30  using the Decision Curve Analysis R package.31

For the specific use case of developing a transfusion risk prediction model to guide preoperative type and screen orders, a model decision threshold should balance the relative harms of false negatives (model predicts no transfusion, no type and screen is ordered, but patient subsequently requires transfusion) and false positives (model predicts transfusion, type and screen ordered, but no transfusion is required). Given that the potential patient safety harms of false negatives are far greater than the mostly monetary harms of false positives, we decided to set all model thresholds to achieve 96% sensitivity on the training data. Thresholds chosen on the training data were carried over to the internal validation data. Thresholds were readjusted on the external validation data set due to the higher prevalence of transfusion in that data set.

Then, model performance was evaluated in terms of the positive predictive value and overall frequency of positive predictions, both metrics for excess type and screen orders (i.e., type and screens recommended for patients who did not actually require transfusion). The potential cost savings for each model in reducing excess type and screen orders was evaluated in comparison to the baseline model using the 2020 Medicare Clinical Laboratory Fee Schedule.32  To estimate cost savings, the difference in excess type and screen orders between models was multiplied by the Medicare reimbursement rate for a type and screen ($15.75). ### Model Explanation Machine learning predictions are typically more trusted when explanations for why the model makes a particular prediction are available.33 We used Shapley values,34 a coalitional game theory approach to interpretation of machine learning predictions, to measure overall variable importance and explain individual patient predictions for the best- performing gradient boosting model, implemented with SHAP34 0.37.0. For this model, Shapley values for each variable are represented in logit space, similar to the coefficients for a logistic regression model.34 For a particular value of a variable, a high-magnitude (i.e., absolute value) Shapley value indicates that the variable caused a large change in the model’s predicted risk; a negative Shapley value implies that the value for that variable decreased risk, and a positive value implies increased risk. Shapley values were computed for individual patients to explain individual model predictions. To measure overall variable importance for a cohort, Shapley values were computed for all of the patients in the cohort and illustrated using a beeswarm plot. They were also summarized across the cohort using the mean absolute value of all of the Shapley values for each variable, indicating the overall extent to which the variable contributed to the model’s predictions. All computer codes necessary for model training, evaluation, and explanation are available at https://github.com/sslou/publications/tree/main/2021_blood_product. Example code is also provided for generating model predictions and explanations on new data. ### Statistical Analysis Differences in variable distribution between the training, internal validation, and external validation populations were explored with descriptive statistics, the Mann–Whitney U test for continuous variables, and the chi-square test for categorical variables. Two-tailed tests were employed throughout. CIs for model performance metrics were generated by bootstrap resampling of each data set. Pairwise comparisons of model performance were assessed using McNemar’s test statistic,35 which is a chi-square test statistic evaluating the comparative accuracy of two models. A Bonferroni-corrected P value of 0.05 was used to determine statistical significance. Statistical analysis was conducted using Python 3.7.6. ### Cohort Characteristics The models were trained on a cohort of 3,049,617 surgical encounters that occurred at 722 hospitals across the United States during 2016 through 2018, internally validated on a cohort of 1,076,441 surgical encounters occurring at 719 hospitals in 2019, and externally validated on a cohort of 16,053 surgical encounters that occurred at a single institution in 2020. Demographic characteristics and distribution of variables used in the models for the three cohorts are shown in table 1. Overall, data distribution was similar between the training and internal validation cohorts, but the external validation cohort was less healthy (table 1). Of surgical encounters, 2.4% required transfusion in the training data, while 2.2% required transfusion in the internal validation data, and 6.7% required transfusion in the external validation data. ### Performance of Machine Learning Models For comparison with widely accepted guidelines for preoperative blood typing and antibody screening,4 we constructed a baseline model that reports transfusion probability simply as the historical procedure-specific transfusion rate for each procedure (table 2). This baseline model achieves a C-statistic of 0.888 (95% CI, 0.881 to 0.894) and an average precision of 0.215 (95% CI, 0.197 to 0.235). Table 2. Model Performance on the Internal Validation Data We constructed four machine learning models: penalized logistic regression, decision tree, random forest, and gradient boosting machine. Model discrimination metrics for the internal validation data are shown in table 2. The gradient boosting machine outperformed the other models (pairwise McNemar’s test P values <0.001), achieving an C-statistic of 0.924 (95% CI, 0.919 to 0.929) and an average precision of 0.292 (95% CI, 0.273 to 0.314). Calibration plots for all the described models are shown in Supplemental Digital Content 1 (fig. S2, http://links.lww.com/ALN/C792). For the specific use case of guiding preoperative type and screen orders, we set model discrimination thresholds to achieve 96% sensitivity, given the asymmetric harms of false-positive (i.e., patient has type and screen but does not require transfusion) and false-negative (i.e., patient requires transfusion but has no type and screen) predictions. Sensitivity can also be characterized as the percentage of patients requiring transfusion who had a preoperative type and screen recommended by each model. For reference, the 5% procedure-specific transfusion rate threshold previously described to guide type and screen decisions4,5 only achieved a sensitivity of 83.7% (95% CI, 82.1 to 85.2%). At 96% sensitivity, the best-performing gradient boosting model made a positive prediction (i.e., recommended a type and screen, for 36.2% [95% CI, 35.9 to 36.5%] of the cases in the internal validation cohort with a positive predictive value of 0.058 [95% CI, 0.055 to 0.060]). In contrast, the baseline model recommended type and screens for 57.0% (95% CI, 56.7 to 57.3%) with a positive predictive value of 0.037 (95% CI, 0.035 to 0.038). In other words, for the same sensitivity or false-negative rate, the gradient boosting model required one-third fewer type and screen orders as compared with the baseline or penalized logistic regression models. ### Model Generalizability to External Validation The best-performing gradient boosting model was also evaluated on an independent hold-out external validation cohort in comparison to the baseline model (table 3). With the threshold set to achieve 96% sensitivity, the gradient boosting model recommended type and screens for 31.0% (95% CI, 30.4 to 31.6%) of cases with a positive predictive value of 0.213 (95% CI, 0.203 to 0.223). The baseline model was less efficient, recommending type and screens for 45.7% (95% CI, 45.0 to 46.4%) of cases with a positive predictive value of 0.144 (95% CI, 0.137 to 0.151). The gradient boosting model failed to recommend a type and screen for 45 (95% CI, 34 to 57) patients who subsequently required transfusion in this cohort (0.28%), whereas the baseline model failed to recommend a type and screen for 47 (95% CI, 35 to 59) patients who subsequently required transfusion in this cohort (0.29%). Given that the gradient boosting model ordered 2,360 fewer type and screens than the baseline model for this cohort of 16,053 patients, the estimated cost savings for implementing the personalized model was$37,167 ($2.32 per patient) for this cohort, as calculated using the Medicare reimbursement price of$15.75 for a type and screen. The gradient boosting model also had higher net benefit (Supplemental Digital Content 1, fig. S3, http://links.lww.com/ALN/C792).

Table 3.

Discrimination of the Best-performing Gradient Boosting Model on the External Validation Data

### Interpretation of Model Predictions

Explanations for how the gradient boosting model arrived at a prediction for an individual patient can be computed and illustrated (fig. 2). For this representative patient having a robotic-assisted partial nephrectomy, the predicted transfusion risk was decreased from the population average by the low historical frequency of transfusion for this procedure in addition to their high starting hematocrit and above-average weight. However, their transfusion risk was also revised upwards due to decreased renal function, mildly abnormal international normalized ratio, and age.

Fig. 2.

Explanation of model prediction for an individual patient. Explanation of the gradient boosting model’s transfusion risk prediction for an example patient in the external validation cohort. This patient was undergoing a laparoscopic robotic-assisted partial nephrectomy, a surgery that had an overall 1.3% rate of transfusion. After adjusting for patient factors, the model predicted a 0.7% risk of transfusion, now below the model threshold for recommending a type and screen. This patient did not require transfusion. INR, international normalized ratio; PTT, partial thromboplastin time.

Fig. 2.

Explanation of model prediction for an individual patient. Explanation of the gradient boosting model’s transfusion risk prediction for an example patient in the external validation cohort. This patient was undergoing a laparoscopic robotic-assisted partial nephrectomy, a surgery that had an overall 1.3% rate of transfusion. After adjusting for patient factors, the model predicted a 0.7% risk of transfusion, now below the model threshold for recommending a type and screen. This patient did not require transfusion. INR, international normalized ratio; PTT, partial thromboplastin time.

Close modal

Such individual patient explanations were summarized over a cohort to explain model predictions in aggregate (fig. 3). Across the cohort, the procedure-specific transfusion rate and preoperative hematocrit had the largest impact on model predictions; patients having procedures with high procedure-specific rates of transfusion had higher transfusion risk, and patients with low preoperative hematocrit had higher transfusion risk. Although international normalized ratio, platelet count, creatinine, weight, and albumin levels had low average impact on model predictions, these variables had larger impact when they were very abnormal.

Fig. 3.

Relative variable importance for model predictions. Beeswarm plots demonstrating the relative importance of the top 20 variables to all model predictions for the internal validation cohort. Each value for each variable observed (i.e., patient) in the cohort is shown as a single dot, colored by value and with a position on the x axis indicating the impact that that value of the variable had on the model’s prediction for that patient in logit space (i.e., Shapley value). Variables with wide spread have a large effect on model predictions. Color indicates whether low or high values of each variable impact risk and in which direction. For example, pink colors to the right of the midline (i.e., impact on model output greater than 0) suggest that high values of the variable increase the model’s predicted risk of transfusion. For categorical variables such as patient comorbidities, pink indicates the presence of that variable, and blue is the absence. Variables with blue or pink colors on both sides of midline indicate that that variable can either increase or decrease transfusion risk, depending on interactions with other variables. For example, patients with low platelet count are shown with blue dots that appear both to the left and right of midline, indicating that low platelet count can either increase or decrease predicted transfusion risk depending on each patient’s other characteristics. Average impact on model prediction is shown on the right and indicates overall variable importance. This is computed as the mean absolute value of all Shapley values observed for that variable in the cohort. COPD, chronic obstructive pulmonary disease; INR, international normalized ratio; PTT, partial thromboplastin time.

Fig. 3.

Relative variable importance for model predictions. Beeswarm plots demonstrating the relative importance of the top 20 variables to all model predictions for the internal validation cohort. Each value for each variable observed (i.e., patient) in the cohort is shown as a single dot, colored by value and with a position on the x axis indicating the impact that that value of the variable had on the model’s prediction for that patient in logit space (i.e., Shapley value). Variables with wide spread have a large effect on model predictions. Color indicates whether low or high values of each variable impact risk and in which direction. For example, pink colors to the right of the midline (i.e., impact on model output greater than 0) suggest that high values of the variable increase the model’s predicted risk of transfusion. For categorical variables such as patient comorbidities, pink indicates the presence of that variable, and blue is the absence. Variables with blue or pink colors on both sides of midline indicate that that variable can either increase or decrease transfusion risk, depending on interactions with other variables. For example, patients with low platelet count are shown with blue dots that appear both to the left and right of midline, indicating that low platelet count can either increase or decrease predicted transfusion risk depending on each patient’s other characteristics. Average impact on model prediction is shown on the right and indicates overall variable importance. This is computed as the mean absolute value of all Shapley values observed for that variable in the cohort. COPD, chronic obstructive pulmonary disease; INR, international normalized ratio; PTT, partial thromboplastin time.

Close modal

In this study, we developed a machine learning model trained on a large national cohort of surgical cases to predict transfusion risk using both surgery- and patient-specific variables. Compared with a baseline model, which used only the historic procedure-specific transfusion rate as is recommended by current guidelines,1,3,4  the gradient boosting machine model demonstrated the best discriminative performance (tables 2 and 3), with the highest C-statistic and average precision. When tailored specifically to guide preoperative type and screen decision-making, the gradient boosting model required one-third fewer type and screen orders, while maintaining 96% sensitivity to detect transfusion, compared to the baseline model in both internal and external validation data sets. These findings highlight the considerable potential for utilizing these models to estimate transfusion risk and guide preoperative type and screen ordering decisions.

Current guidelines for preoperative type and screen decision-making1,3,4  (i.e., the maximum surgical blood ordering schedule)2  have focused on surgical characteristics such as the transfusion rate for each surgery. Consistent with this previous work, our baseline model with knowledge only of historical procedure-specific transfusion rates performed reasonably well (table 2), and this procedure-specific transfusion rate variable was by far the most important variable in our best-performing multivariable gradient boosting model (fig. 3). However, the 5% procedure-specific transfusion risk threshold previously described to guide type and screen orders3,4  achieved a sensitivity of only 0.837 on the internal validation cohort, suggesting that it would miss over 16% of patients who actually required transfusion. In practice, most institutions incorporate clinician opinion or other procedure-specific variables into their maximum surgical blood ordering schedule such that type and screen ordering practice is much more conservative.4  Nonetheless, we were able to substantially improve predictive performance over the baseline model by incorporating patient-specific variables (table 2), demonstrating the importance of personalized patient-specific risk prediction. In contrast to previous transfusion risk-prediction models,7,8,11–16,36  our models incorporated both surgery- and patient-specific variables, used only variables available in the preoperative setting and readily extractable from a patient’s health record, made predictions across a diverse range of surgical procedures, and explicitly considered a decision threshold appropriate for a preoperative type and screen decision-making use case.

One key innovation of our approach is the use of the surgery-specific transfusion rate as a form of transfer learning, allowing our model to be generalized across hospital systems. By training our model on a large multiinstitutional database with diverse transfusion practices, the model learned how patient comorbidities and preoperative laboratory tests typically transform the procedure-specific transfusion risk to a personalized transfusion risk. For application in new settings, any hospital can then specify their historical transfusion rates for each procedure, and the model can apply its knowledge of how patient-specific variables modify that baseline risk. We demonstrated the feasibility of this transfer learning approach by generalizing model performance to an independent cohort of surgical patients at a single academic medical center as a proof of principle (table 3). Importantly, we demonstrated generalizability of model performance despite this external validation data set containing information retrieved from an electronic health record and therefore not being as well curated as the training data derived from a national quality registry.

The second innovation of our approach is the display of individualized explanations of model predictions (fig. 2). Interpretable machine learning is critical for building clinician trust in model outputs.33  Visualization of model explanations also provides a margin of safety to protect against model failures because the reason for nonsensical predictions can be easily identified and addressed. The most important variables for our best-performing model (fig. 3) match with clinical intuition and previous literature: the procedure-specific transfusion rate, preoperative hematocrit, age, and laboratory indicators of coagulopathy.

We propose that our model predictions and visualizations can be implemented as point-of-care clinical decision support within an electronic health record to guide preoperative type and screen ordering practice for any hospital. To implement our model at a new hospital, only the historical procedure-specific transfusion rates are required; these are the same data that are used to create a conventional maximum surgical blood ordering schedule. Then, for each new patient, our model can predict their transfusion risk and recommend whether to order a type and screen given their planned surgery and patient-specific characteristics. We intentionally chose patient-specific variables that would be easy to abstract from an electronic health record to aid in the ease of model implementation.

However, our model likely requires further validation before it should be widely implemented to guide preoperative type and screen ordering practice. Although we trained our models on a large multihospital cohort of surgical cases and demonstrated model generalizability to a single hospital using a transfer learning approach, model performance may vary at other hospitals with different transfusion practices or patient populations. In addition, our external validation analysis was limited by the lack of postoperative transfusion data. Prospective validation of the model and the type and screen decision threshold are needed before implementation. We do not claim that our model accounts for all possible variables that contribute to surgical transfusion risk or the decision to order a type and screen; for example, preoperative anticoagulation medications were not included in our models.

Another limitation is that our model adjusts predicted risk substantially based on the planned procedure and available laboratory values (fig. 3), which could change between the preoperative visit (where the model might be implemented) and the day of surgery. For example, a common preoperative workflow is to order the preoperative laboratory tests at the same time as the type and screen; thus, the newly ordered laboratory tests may not have resulted by the time the decision to obtain a type and screen needs to be made. Our model is tolerant of the absence of laboratory values, as is commonly the case (Supplemental Digital Content 1, table S1, http://links.lww.com/ALN/C792) when unnecessary laboratory tests are not ordered for relatively healthy patients. Several practical solutions to the problem of changing information are possible; for example, the model could be updated as new information becomes available, and new recommendations can be made. Additional research is needed to explore the pragmatic integration of our model into clinical workflow, situated within local practice settings.

Three years of historical transfusion data were used to estimate procedure-specific transfusion rates for the internal validation data. Because only 1 yr of historical data was available for the external validation data set, procedure-specific transfusion rates could be estimated with reasonable precision for only 53.3% of the surgical cases that occurred at the external validation institution in 2020, and thus model predictions were only made for this subset (table 3). We expect other institutions with longer durations of historical transfusion data to be able to apply our model to a larger fraction of their surgical cases. Model predictions were inaccurate if low-precision procedure-specific transfusion rates—such as those estimated for uncommon procedures—were provided as model input (Supplemental Digital Content 1, table S3, http://links.lww.com/ALN/C792). This limitation regarding uncommon procedures, which in aggregate may comprise a meaningful fraction of surgical case volume, is also present for existing maximum surgical blood ordering schedule strategies.

All models can experience decay of prediction performance over time37  due to changes in transfusion practice, case mix, or surgical technique. In our case, this can potentially be ameliorated by continuously updating the historical procedure-specific transfusion rate using our transfer learning technique, although this has yet to be demonstrated to be effective. Conversely, inaccuracies in procedure- specific transfusion rates or changes in procedure naming that disrupt matching of cases to their historical transfusion rates can magnify prediction error (Supplemental Digital Content 1, table S3, http://links.lww.com/ALN/C792).

Finally, although we chose to fix our models at 96% sensitivity, some may disagree regarding the optimal sensitivity threshold at which the cost of false-positive predictions and the risk of false-negative predictions are balanced. Our threshold can be considered aggressive, although administration of unmatched emergency release blood in the setting of unexpected transfusion has generally been demonstrated to be safe.38,39  Improved identification of the true cost to patient safety and systems cost of unnecessary type and screen orders could allow the selection of a more optimal decision threshold using utility analysis.40  Although we modeled cost savings using the Medicare reimbursement rate for type and screens, this may not be representative of all costs; for example, the organizational costs of unnecessary sample acquisition, storage, and processing, and potential patient harm induced by unnecessary needle sticks and phlebotomy were not included.

In summary, we developed a gradient boosting machine model to predict surgical transfusion risk using both patient and surgery-specific variables and tailored it to guide decision-making regarding preoperative type and screen orders. Our model outperformed a baseline model that used surgical information alone, as is recommended by current blood ordering guidelines. Although our model requires further prospective validation and implementation, it is an important first step toward personalized surgical blood orders and has the potential to improve patient safety and reduce healthcare costs.

### Acknowledgments

The authors thank Derek Harford, B.A., and Alex Kronzer, B.A. (Washington University School of Medicine, St. Louis, Missouri), and Kevin Heard, B.A. (BJC Healthcare, St. Louis, Missouri), for their assistance obtaining the data used for this study.

### Research Support

Supported in part by National Institutes of Health (Bethesda, Maryland) grant No. 5T32GM108539-07 (to Dr. Lou).

### Competing Interests

Dr. Hall is consulting director of the American College of Surgeons National Surgical Quality Improvement Program (Chicago, Illinois). Dr. Kannampallil has consulting relationships with Pfizer Inc. (New York, New York) and Elsevier (Amsterdam, The Netherlands) that are unrelated to this work. The other authors declare no competing interests.

1.
American Society of Anesthesiologists Task Force on Perioperative Blood Management
:
Practice guidelines for perioperative blood management: An updated report by the American Society of Anesthesiologists Task Force on Perioperative Blood Management.
Anesthesiology
2015
;
122
:
241
75
2.
Friedman
BA
:
An analysis of surgical blood use in United States hospitals with application to the maximum surgical blood order schedule.
Transfusion
1979
;
19
:
268
78
3.
Dexter
F
,
Ledolter
J
,
Davis
E
,
Witkowski
TA
,
Herman
JH
,
Epstein
RH
:
Systematic criteria for type and screen based on procedure’s probability of erythrocyte transfusion.
Anesthesiology
2012
;
116
:
768
78
4.
Frank
SM
,
Rothschild
JA
,
Masear
CG
,
Rivers
RJ
,
Merritt
WT
,
Savage
WJ
,
Ness
PM
:
Optimizing preoperative blood ordering with data acquired from an anesthesia information management system.
Anesthesiology
2013
;
118
:
1286
97
5.
Woodrum
CL
,
Wisniewski
M
,
Triulzi
DJ
,
Waters
JH
,
Alarcon
LH
,
Yazer
MH
:
The effects of a data driven maximum surgical blood ordering schedule on preoperative blood ordering practices.
Hematology
2017
;
22
:
571
7
6.
Geißler
RG
,
Franz
D
,
Buddendick
H
,
Krakowitzky
P
,
Bunzemeier
H
,
Roeder
N
,
Van Aken
H
,
Kessler
T
,
Berdel
W
,
Sibrowski
W
,
Schlenke
P
:
Retrospective analysis of the blood component utilization in a university hospital of maximum medical care.
Transfus Med Hemother
2012
;
39
:
129
38
7.
Frisch
NB
,
Wessell
NM
,
Charters
MA
,
Yu
S
,
Jeffries
JJ
,
Silverton
CD
:
Predictors and complications of blood transfusion in total hip and knee arthroplasty.
J Arthroplasty
2014
;
29
:
189
92
8.
Hayn
D
,
Kreiner
K
,
Ebner
H
,
Kastner
P
,
Breznik
N
,
Rzepka
A
,
Hofmann
A
,
Gombotz
H
,
Schreier
G
:
Development of multivariable models to predict and benchmark transfusion in elective surgery supporting patient blood management.
Appl Clin Inform
2017
;
8
:
617
31
9.
Mathis
MR
,
Kheterpal
S
,
Najarian
K
:
Artificial intelligence for anesthesia: What the practicing clinician needs to know: More than black magic for the art of the dark.
Anesthesiology
2018
;
129
:
619
22
10.
Jalilian
L
,
Cannesson
M
:
Precision medicine in anesthesiology.
Int Anesthesiol Clin
2020
;
58
:
17
22
11.
Nuttall
GA
,
Santrach
PJ
,
Oliver
WC
, Jr
,
Ereth
MH
,
Horlocker
TT
,
Cabanela
ME
,
Trousdale
RT
,
Bryant
S
,
Currie
TW
:
A prospective randomized trial of the surgical blood order equation for ordering red cells for total hip arthroplasty patients.
Transfusion
1998
;
38
:
828
33
12.
Klei
WA
, van
,
Moons
KG
,
Leyssius
AT
,
Knape
JT
,
Rutten
CL
,
Grobbee
DE
:
A reduction in type and screen: Preoperative prediction of RBC transfusions in surgery procedures with intermediate transfusion risks.
Br J Anaesth
2001
;
87
:
250
7
13.
Palmer
T
,
Wahr
JA
,
O’Reilly
M
,
Greenfield
ML
:
Reducing unnecessary cross-matching: A patient-specific blood ordering system is more accurate in predicting who will receive a blood transfusion than the maximum blood ordering system.
Anesth Analg
2003
;
96
:
369
75
14.
Mitterecker
A
,
Hofmann
A
,
Trentino
KM
,
Lloyd
A
,
Leahy
MF
,
Schwarzbauer
K
,
Tschoellitsch
T
,
Böck
C
,
Hochreiter
S
,
Meier
J
:
Machine learning-based prediction of transfusion.
Transfusion
2020
;
60
:
1977
86
15.
Walczak
S
,
Velanovich
V
:
Prediction of perioperative transfusions using an artificial neural network.
PLoS One
2020
;
15
:
e0229450
16.
Jalali
A
,
Lonsdale
H
,
Zamora
LV
,
L
,
Nguyen
ATH
,
Rehman
M
,
Fackler
J
,
Stricker
PA
,
Fernandez
AM
;
Pediatric Craniofacial Collaborative Group
:
Machine learning applied to registry data: Development of a patient-specific prediction model for blood transfusion requirements during craniofacial surgery using the pediatric craniofacial perioperative registry dataset.
Anesth Analg
2021
;
132
:
160
71
17.
Collins
GS
,
Reitsma
JB
,
Altman
DG
,
Moons
KG
:
Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis (TRIPOD): The TRIPOD statement.
Ann Intern Med
2015
;
162
:
55
63
18.
Shiloach
M
,
Frencher
SK
, Jr
,
Steeger
JE
,
Rowell
KS
,
Bartzokis
K
,
Tomeh
MG
,
Richards
KE
,
Ko
CY
,
Hall
BL
:
Toward robust information: Data quality and inter-rater reliability in the American College of Surgeons National Surgical Quality Improvement Program.
J Am Coll Surg
2010
;
210
:
6
16
19.
Frank
SM
,
Oleyar
MJ
,
Ness
PM
,
Tobian
AA
:
Reducing unnecessary preoperative blood orders and costs by implementing an updated institution-specific maximum surgical blood order schedule and a remote electronic blood release system.
Anesthesiology
2014
;
121
:
501
9
20.
Mak
PH
,
Campbell
RC
,
Irwin
MG
;
American Society of Anesthesiologists
:
The ASA Physical Status classification: Inter-observer consistency.
Anaesth Intensive Care
2002
;
30
:
633
40
21.
Sankar
A
,
Johnson
SR
,
Beattie
WS
,
Tait
G
,
Wijeysundera
DN
:
Reliability of the American Society of Anesthesiologists Physical Status scale in clinical practice.
Br J Anaesth
2014
;
113
:
424
32
22.
Hall
BL
,
Hamilton
BH
,
Richards
K
,
Bilimoria
KY
,
Cohen
ME
,
Ko
CY
:
Does surgical quality improve in the American College of Surgeons National Surgical Quality Improvement Program: An evaluation of all participating hospitals.
Ann Surg
2009
;
250
:
363
76
23.
Hamilton
BH
,
Ko
CY
,
Richards
K
,
Hall
BL
:
Missing data in the American College of Surgeons National Surgical Quality Improvement Program are not missing at random: Implications and potential impact on quality assessments.
J Am Coll Surg
2010
;
210
:
125
139.e2
24.
Zou
H
,
Hastie
T
:
Regularization and variable selection via the elastic net.
J R Stat Soc Ser B Stat Methodol
2005
;
67
:
301
20
25.
Breiman
L
,
Friedman
JH
,
Olshen
RA
,
Stone
CJ
:
Classification and Regression Trees. Monterey, CA, Wadsworth & Brooks/Cole Advanced Books & Software
,
1984
.
Available at: https://catalogue.library.cern/literature/1fzjd-7yq74. Accessed April 23, 2021
.
26.
Breiman
L
:
Random forests.
Mach Learn
2001
;
45
:
5
32
27.
Chen
T
,
Guestrin
C
:
XGBoost: A scalable tree boosting system
.
Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
.
New York
,
Association for Computing Machinery
,
2016
,
pp 785
94
28.
Pedregosa
F
,
Varoquaux
G
,
Gramfort
A
,
Michel
V
,
Thirion
B
,
Grisel
O
,
Blondel
M
,
Prettenhofer
P
,
Weiss
R
,
Dubourg
V
,
Vanderplas
J
,
Passos
A
,
Cournapeau
D
,
Brucher
M
,
Perrot
M
,
Duchesnay
É
:
Scikit-learn: Machine learning in Python.
J Mach Learn Res
2011
;
12
:
2825
30
29.
Van Calster
B
,
McLernon
DJ
,
van Smeden
M
,
Wynants
L
,
Steyerberg
EW
;
Topic Group “Evaluating diagnostic tests and prediction models” of the STRATOS Initiative
:
Calibration: The Achilles heel of predictive analytics.
BMC Med
2019
;
17
:
230
30.
Vickers
AJ
,
Van Calster
B
,
Steyerberg
EW
:
Net benefit approaches to the evaluation of prediction models, molecular markers, and diagnostic tests.
BMJ
2016
;
352
:
i6
31.
Vickers
AJ
:
Decision curve analysis
.
2015
.
Available at: www.decisioncurveanalysis.org. Accessed August 21, 2021
.
32.
Centers for Medicare and Medicaid Services
:
Clinical laboratory fee schedule
.
2020
. .
33.
Diprose
WK
,
Buist
N
,
Hua
N
,
Thurier
Q
,
Shand
G
,
Robinson
R
:
Physician understanding, explainability, and trust in a hypothetical machine learning risk calculator.
J Am Med Inform Assoc
2020
;
27
:
592
600
34.
Lundberg
SM
,
Erion
G
,
Chen
H
,
DeGrave
A
,
Prutkin
JM
,
Nair
B
,
Katz
R
,
Himmelfarb
J
,
Bansal
N
,
Lee
SI
:
From local explanations to global understanding with explainable AI for trees.
Nat Mach Intell
2020
;
2
:
56
67
35.
Dietterich
TG
:
Approximate statistical tests for comparing supervised classification learning algorithms.
Neural Comput
1998
;
10
:
1895
923
36.
Pempe
C
,
Werdehausen
R
,
Pieroh
P
,
Federbusch
M
,
Petros
S
,
Henschler
R
,
Roth
A
,
Pfrepper
C
:
Predictors for blood loss and transfusion frequency to guide blood saving programs in primary knee- and hip-arthroplasty.
Sci Rep
2021
;
11
:
4386
37.
Nestor
B
,
McDermott
MBA
,
Boag
W
,
Berner
G
,
Naumann
T
,
Hughes
MC
,
Goldenberg
A
,
Ghassemi
M
:
Feature robustness in non-stationary health records: Caveats to deployable model performance in common clinical machine learning tasks
.
Proc Mach Learn Res
2019
;
106
:
381
405
38.
Dutton
RP
,
Shih
D
,
Edelman
BB
,
Hess
J
,
Scalea
TM
:
Safety of uncrossmatched type-O red cells for resuscitation from hemorrhagic shock.
J Trauma
2005
;
59
:
1445
9
39.
Napolitano
LM
,
Kurek
S
,
Luchette
FA
,
Anderson
GL
,
Bard
MR
,
Bromberg
W
,
Chiu
WC
,
Cipolle
MD
,
Clancy
KD
,
Diebel
L
,
Hoff
WS
,
Hughes
KM
,
Munshi
I
,
Nayduch
D
,
Sandhu
R
,
Yelon
JA
,
Corwin
HL
,
Barie
PS
,
Tisherman
SA
,
Hebert
PC
;
EAST Practice Management Workgroup; American College of Critical Care Medicine (ACCM) Taskforce of the Society of Critical Care Medicine (SCCM)
:
Clinical practice guideline: Red blood cell transfusion in adult trauma and critical care.
J Trauma
2009
;
67
:
1439
42
40.
Vickers
AJ
,
Elkin
EB
:
Decision curve analysis: A novel method for evaluating prediction models.
Med Decis Making
2006
;
26
:
565
74