Interest in developing and using novel biomarkers in critical care and perioperative medicine is increasing. Biomarkers studies are often presented with flaws in the statistical analysis that preclude them from providing a scientifically valid and clinically relevant message for clinicians. To improve scientific rigor, the proper application and reporting of traditional and emerging statistical methods (e.g., machine learning) of biomarker studies is required. This Readers’ Toolbox article aims to be a starting point to nonexpert readers and investigators to understand traditional and emerging research methods to assess biomarkers in critical care and perioperative medicine.
Biomarkers are increasingly used as personalized markers of diagnosis, in the assessment of disease severity or risk, and to prognosticate and guide clinical decisions.1,2 Biomarkers exploring the cardiovascular system and kidneys, as well as inflammation, have proliferated in critical care and perioperative medicine. While existing guidelines are available to provide guidance on key information to report in a biomarker study, they do not explicitly provide guidance on appropriate statistical methods.3–5 The use of inappropriate statistical methods for assessing the clinical value of biomarkers obfuscates any meaningful interpretation and usability of the study findings for clinicians.1,2
This article does not aim to be an exhaustive review of biostatistical and methodologic issues, but rather, to be a starting point for nonexpert readers and investigators to understand traditional and emerging research methods used to assess biomarkers in critical care and perioperative medicine. We provide toolboxes with reporting checklists to assist authors and readers in the use of these statistical methods.
Different Biomarker Development Phases
A biomarker may have several roles in clinical practice. It may provide a diagnosis, have a prognostic role, be used to assess treatment responsiveness, or to guide the use of pharmaceuticals in treatment. Biomarkers were also proposed to identify critically ill patients’ molecular subphenotypes, regardless of outcome. Examples of biomarkers with different roles used in critical care and perioperative medicine are presented in table 1.6–14
Appropriate statistical analysis plan before biomarkers analysis
Valid methods based on the clinical question/hypothesis, biomarker phase of development (e.g., discovery, evaluation of accuracy and assessment of incremental value), and weaknesses of the statistical methods
Avoidance of common pitfalls in biomarkers studies (e.g., not considering properties of a biomarker assay, biomarker kinetics, imperfect accepted standard methods and different populations)
Inappropriate use of machine learning algorithms that results in overfitting models, lack of independent validation, and lack of comparison with simpler modeling approaches
Biomarker development is a multiphased process requiring different statistical methods to accomplish the various objectives. The three phases of biomarker development, in chronological order, include (1) discovery; (2) evaluation of predictive (or diagnostic) accuracy; and (3) assessment of incremental value when added to existing clinical prediction (or diagnostic) tools.15
Statistical Methods to Evaluate Biomarkers
In the early phase of biomarker development, the association between a biomarker and the outcome is often assessed using regression models and the reporting of odds/hazard ratios or estimates of relative risks to quantify this association, preferably including an assessment of their value over established biomarkers or clinical characteristics. A prospective design is preferable as it facilitates clear inclusion criteria, data collection procedures (minimizing missing data), and standardization of measurements, and ensures all relevant clinical information is measured. Registering a protocol and prespecifying study objectives, biomarkers of interest, and statistical methods will reduce publication bias and selective reporting.1
A commonly used approach to estimate biomarker discrimination and the incremental value of a biomarker is to calculate the area under the receiver operating characteristic curve (AUC).15 The receiver operating characteristic curve is formed by plotting false positive rates (1 − specificity) on the x-axis and the true positive rates (sensitivity) on the y-axis. The AUC quantifies the discriminative ability of the biomarker ranging from 0.5 (i.e., no better than flipping a coin) to 1 (i.e., perfect discrimination). Discrimination is the ability of the biomarker to differentiate those with and without the event (e.g., quantifying whether those with the event tend to have higher biomarker values compared to those who do not).
So-called “optimal” biomarker thresholds are often determined based on maximizing the Youden index (maximum [sensitivity + specificity − 1]).1 The Youden index is often used to determine the value of the biomarker that maximizes the sum of sensitivity and specificity. However, such an approach is problematic if the biomarker is used to either rule out (high sensitivity) or confirm (high specificity) a diagnosis when negative and positive likelihood ratios can be used to select thresholds. The 95% CI around the “optimal” cutoffs could be reported (e.g., bootstrap resampling).1,16 Furthermore, dichotomization (and indeed, categorization) of a biomarker is also biologically implausible, as no thresholds of a biomarker exist that cause a sudden change in risk (e.g., there is typically no reason why a person’s risk on either side of a cut-point will be dramatically different).
Categorization (including dichotomization) of a continuous measurement (e.g., biomarkers) should therefore be avoided during statistical analysis, as it will result in a loss of information and negatively impact predictive accuracy.17–19 The statistical analysis should ideally retain continuous measurements on their original scale, allowing for nonlinear relationships to be considered (using restricted cubic splines or fractional polynomials).20
To assess the incremental value of a novel biomarker when added to a clinical model or a standard biomarker, the difference in AUC between two prediction models (improvement in discrimination) is often used.21 Methods such as the DeLong nonparametric test and the Hanley and McNeil method are then used to compare AUCs of the biomarker under investigation against an already established biomarker or clinical model assessed in the same set of individuals.22,23 The main limitation in comparing AUCs is that a relatively large “independent” association is needed to result in a meaningfully larger AUC for the new biomarker. In response to the insensitivity of comparing AUCs, reclassification methods (e.g., net reclassification index, integrated discrimination index) have been proposed and are described in table 2.8 However, despite their popularity, it has since been shown that these approaches offer little more than existing approaches and can be unreliable in certain situations.24 Reclassification methods have been shown to have inflated false positive rates when testing the improved predictive performance of a novel biomarker.25,26 Approaches based on net benefit using decision analytic methods are now widely recommended, as they allow for meaningful assessment of a new biomarker against an established biomarker or combination of biomarkers by comparing the benefits and risk of decisions (true positives) against their relative harms (false positives).21,27–28 The comparison is made across all (or a range of) thresholds to evaluate whether the new biomarker has added clinical utility.
Looney SW, Hagan JL. Analysis of biomarker data: A practical guide. Hoboken, New Jersey, John Wiley & Sons, Inc., 2015
An introduction to biomarker analysis that includes the principles of good research study design; also contains SAS and R-based statistical packages
Rabbee N. Biomarker analysis in clinical trials with R. New York, Chapman and Hall/CRC, 2020
Describes the design and the statistical analysis plan of biomarker trials and covers the topic of combining multiple biomarkers to predict drug response/outcome using machine learning; reproducible codes and examples are provided in R
Clinical Risk Prediction Models Using Biomarkers
Clinical prediction models are typically developed using regression models (e.g., logistic regression or Cox regression). Logistic regression is mainly used for short-term binary outcomes (e.g., mortality, postoperative myocardial infarction), while survival methods (such as Cox regression) are used for time-to-event outcomes and allow for censoring. Methods for handling missing data should be considered before analysis (e.g., multiple imputation).29 Predictors with a high amount of missing data can be problematic, indicating the measurement is infrequently performed in daily practice and potentially limiting to a biomarker model’s usefulness. The choice of which variables to include in a model needs consideration: variables should have clinical relevance and be readily available at the intended moment of use of the model. The functional form of any continuous variables (e.g., biomarkers) should be appropriately investigated using fractional polynomials or restricted cubic splines to fully capture any nonlinearity in the association of the continuous variables with the outcome.17,20 The number of candidate predictors to consider in multivariable modeling has historically been constrained relative to the number of outcome events to avoid overfitting, in a concept called events-per-variable that minimizes the risk of overfitting (a condition where a statistical model describes random variation in the data rather than the true underlying relationship).30 It was widely recommended that studies should only be carried out when the events-per-variable exceeds 10. However, the events-per-variable concept has recently been refuted as having no strong scientific grounds.31,32 More recently, sample size formulae have been developed that are context-specific to minimize the potential for overfitting; that depends not only on the number of events relative to the number of candidate predictors (i.e., those considered for inclusion, not necessarily those that end up in the final model), but also on the total number of participants, the outcome proportion, and the expected predictive performance.33
The biomarkers and other explanatory variables should not be highly correlated to avoid redundancy and collinearity (e.g., blood urea nitrogen and creatinine levels in acute kidney injury).49
The selection of predictor variables should depend on clinical relevance in addition to statistical results. Adjustment for covariates that influence the pharmacokinetics of a biomarker (e.g., timing, age, renal function) should be performed.49
Dichotomization or categorization of continuous biomarkers should be avoided. It is biologically implausible and will result in a loss of information and a negative impact on predictive accuracy.18
The necessary sample size should be calculated a priori.19
Penalized regression methods (e.g., ridge regression, least absolute shrinkage and selection operator, elastic net) should be considered when developing prediction models for low dimensional data with few events to minimize overfitting.35-37
Both discrimination and calibration and should be used to assess the accuracy of biomarkers regression models.4
Methods for handling missing data should be considered (e.g., multiple imputation).29
External validation (e.g., assessing model performance in other participant data than was used for the model development) is necessary for determining generalizability.4,49 Interlaboratory biomarker assay reproducibility needs to be considered.
The use of penalized regression methods (e.g., least absolute shrinkage and selection operator, ridge regression, elastic net) can be considered since it facilitates the choice of variables to be included in the model while minimizing overfitting (table 2).34–36 However, it was reported that penalized approaches do not necessarily solve problems associated with small sample size.37 General and biomarker-specific considerations for a developing multivariable prediction models are summarized in Box 3.
More recently, machine learning methods have been gaining interest as an alternative approach to regression-based models in critical care and perioperative medicine.38–40 Algorithms that improve the clinical use of biomarkers have been developed with machine learning.41,42 A practical definition of machine learning is that it uses algorithms that automatically learn (i.e., are trained) from data, contrary to clinical prediction models, which are based on prespecifying predictors and their functional forms. These algorithms are divided into two categories: supervised and unsupervised. Supervised machine learning algorithms are used to uncover the relationship between a set of clinical features and biomarkers and known outcomes (predictive and prognostic models; Supplemental Digital Content 1, https://links.lww.com/ALN/C503).34 The main supervised learning algorithms (e.g., artificial neural network, tree-based methods, support vector machines) are described in table 2. Supervised conventional statistical modeling (e.g., logistic regression) and supervised machine learning should be considered complementary rather than mutually exclusive.43,44 Marafino et al. used a set of vital signs and biologic data from the first 24 h of admission for more than 100,000 unique intensive care unit (ICU) patients in a supervised machine learning algorithm, incorporating measures of clinical trajectory to develop and validate ICU mortality prediction models. The developed prediction model for mortality risk, leveraging serial data points for each predictor variable, exhibited discrimination comparable to classical mortality scores (e.g., Simplified Acute Physiology Score III and Acute Physiologic Assessment and Chronic Health Evaluation IV scores).41 In another example, Zhang et al. developed a prediction machine learning model that was used to differentiate between volume-responsive and volume-unresponsive acute kidney injury (AKI) in 6,682 critically ill patients. The extreme gradient boosting combined with a decision tree model was reported to outperform the traditional logistic regression model in differentiating the two groups.42
Machine learning is often claimed to have superior performance in high-dimensional settings (i.e., with a large number of explanatory variables). However, there is limited evidence to support this claim in fair and meaningful comparisons with regression-based approaches, as observed in a recent systematic review that showed no performance benefit in clinical studies.45 While machine learning algorithms are often declared to perform well, they require very large datasets, massive computations, and sufficient expertise.46 As such, they should not be considered as an “easy path to perfect prediction.” Limitations include overfitting, which captures random errors in the training dataset and makes the algorithm not generalizable to future predictions.47 Approaches to control for overfitting should be adapted from the established clinical prediction model literature to provide an unbiased assessment of predictive accuracy. The other disadvantage of supervised machine learning algorithms is that the underlying association between covariates and outcome cannot be fully understood by clinicians (“black box” models).48 Conversely, in logistic regression models, the regression coefficient of each covariate can be easily interpreted as the odds ratio (exponentiation of the regression coefficient), which reflects the magnitude of the association with the outcome. A causal interpretation of any association in a prediction model should be avoided, as the aim of a prediction model is to predict and not attribute causality.49 The interpretation of a model that includes biomarkers reflecting distinct pathophysiological pathways (e.g., myocardial injury, endothelial dysfunction) and their associations with outcome is more intuitive for clinicians when using classical regression models than machine learning algorithms.
Regardless of whether more traditional regression-based approaches or modern machine learning have been used to develop a prediction model, their predictive accuracy can be assessed with several metrics. The two widely recommended measures are calibration and discrimination.4,49 Calibration assesses how well the risk predicted from the model agrees with the actual observed risk. Calibration can be assessed graphically by plotting the observed risk of outcome against the predicted risk (e.g., mortality, postoperative AKI).50 Discrimination is a measure of how well the biomarker model can discriminate those who have and those who do not have the outcome of interest (mainly evaluated by the AUC). Another measure of predictive accuracy is the Brier score (squared difference between patient outcome and predicted risk), which reflects the clinical utility of prediction models. However, it has been suggested that the Brier score does not appropriately evaluate the clinical utility of diagnostic tests or prediction models.51 In practice, no one measure is enough, and the use of multiple metrics characterizing different components of prediction accuracy is required.52
Assessing model performance is an important and vital step. During the development of a prediction model, internal validation, using cross-validation or bootstrapping, that mimics the uncertainty in the building process and uses only the original study sample to assess model performance should be carried out.4,49 The reason to carry out an internal validation is to obtain a bias-corrected estimate of model performance, and for regression-based models, the regression coefficients can be subsequently shrunk due to overfitting.54 A stronger test of a model is to carry out an external validation, which consists of assessing the performance (discrimination and calibration) of the prediction model in different participant data than was used for the model development (typically collected from different institutions).4,49 It is often expected that upon external validation, the calibration of the model will be poorer, and methods to recalibrate the model should be considered.55
Phenotyping and Clustering Using Biomarkers
Unsupervised machine learning algorithms are used to identify naturally occurring clusters or subphenotypes of patients who have similar clinical or biologic/molecular features without targeting a specific outcome (Supplemental Digital Content 2, https://links.lww.com/ALN/C504).55,56 Several popular unsupervised learning algorithms (e.g., latent class analysis, cluster analysis) are described in table 2.
An example of using this method in critical care is in personalized medicine research. Patients sharing the same clinical/biologic characteristics are more likely to respond to targeted treatments (e.g., ventilation strategy, fluid administration strategy, statins).13,14,57 For example, Calfee et al. identified two different subphenotypes in acute respiratory distress syndrome (ARDS) patients using latent class analysis (mainly based on clinical data and inflammatory biomarkers) with a different response to a positive end-expiratory pressure (PEEP) strategy.13 The same group identified two different subphenotypes of ARDS in the Hydroxymethylglutaryl-CoA Reductase Inhibition with Simvastatin in Acute Lung Injury to Reduce Pulmonary Dysfunction cohort, with distinct clinical and biologic features (cytokines) and different clinical outcomes. The hyperinflammatory subphenotype had improved survival with simvastatin compared with placebo.57 Finally, Seymour et al. retrospectively identified four different phenotypes (mainly based on markers of inflammation, coagulation, and renal injury) in sepsis with different responses to early goal-directed therapy.14
Challenges and Common Pitfalls in Studies Evaluating Biomarkers
Properties of Biomarker Assay
The precision of the measurement of a biomarker should be assessed. Along this line, the biologic assay and its measurement errors should be reported. The biomarker assay should be sensitive, detecting low concentrations of the biomarker, and specific, in that it is not affected by other molecules. Interlaboratory biomarker assay reproducibility should be considered when assessing a biomarker model performance in a cohort collected from a different institution (external validation).
Another potential issue is that the same biomarker can be produced by different cells with a different pathway mechanism. For example, urinary kidney injury molecule-1 (a biomarker of kidney injury) can also be produced by kidney cancer cells in the absence of kidney injury.58,59 This point is difficult to control when analyzing data, as the physiology of a novel biomarker is often incompletely known.
Role of Time and Biomarker Kinetics
The timing of biomarker measurement is important to consider. For example, optimal information needed for the diagnosis of myocardial infarction in the postoperative period is obtained at the peak of troponin I concentrations (~24 h).
In major surgery and critical care, biomarkers of interest such as troponin T, N-terminal pro-B-type natriuretic peptide, and C-reactive protein may have completely different kinetics.60 The main issue in these conditions is the timing of biomarker measurement, which has to take into consideration not only the biomarkers kinetics, but also the time of onset of various pathophysiological processes (e.g., major surgery with a secondary onset of sepsis). Correlations between repeated measurements of the biomarker within an individual should also be considered during analysis. The use of mixed models instead of repeated measures analysis of variance offers distinct advantages in many instances.61
Another issue is that renal or hepatic function could influence the elimination of a biomarker and thus its diagnostic properties. This point is important to consider in elderly (with chronic organ dysfunctions), as well as major surgery and critical care, patients who are more likely to present with organ failure.
Along this line, the choice of the “optimal” biomarker measurement timing and adjustment for covariates (e.g., age, renal function) are a real challenge when including them in regression models and machine learning algorithms with clinical parameters gathered in real time.
Imperfect Accepted Standard Methods
The choice of the reference test used to define diseased and nondiseased patients (e.g., postoperative AKI, postoperative myocardial infarction) should be carefully considered. Novel biomarkers are frequently evaluated against accepted standards that are assumed to classify patients with perfect accuracy according to the presence or absence of disease. In practice, reference tests are rarely unerring predictors of disease and tend to misclassify patients. In the case of an imperfect accepted standard (e.g., delayed increase in serum creatinine in the case of AKI62 ), patient misclassification introduces biases into the sensitivity and specificity estimates of the new biomarker. One of the main methods suggested to improve an “imperfect” reference standard is composite reference standards. The rationale is that combining results of different imperfect tests leads to a more accurate reference test. Nevertheless, the accuracy of this approach has been questioned.63
There are some situations in which the outcome is not dichotomous (diseased or nondiseased patients) but continuous (e.g., creatinine level variation) or ordinal (e.g., AKI network stages). In this case, a nonparametric estimator of the novel biomarker diagnostic accuracy with an interpretation analogous to the AUC can be applied.64
The studied population could greatly influence the diagnostic and prognostic performance of a test. For example, there are different cutoff points of cardiac troponin I to diagnose postoperative myocardial infarction in noncardiac versus cardiac surgery, or even in cardiac surgery patients with different procedures (coronary artery bypass graft vs. valve surgery).65 Diagnostic test results may also vary in populations with different demographic characteristics and chronic illnesses (e.g., age, chronic kidney disease). Therefore, authors should describe the exact studied population about which they want to make inference. Adjustment for covariates (external influences) is a major point when including biomarkers in regression models.
Associated Clinical Predictors or Multiple Biomarkers
To assess associated clinical predictors or multiple biomarkers regarding an outcome, a risk prediction model could be developed using logistic regression or Cox regression. Two models are then built and compared based on the difference in the AUC or the difference in the Harrell C-statistic, the first with usual predictors and the second with usual predictors and the novel biomarkers, respectively.49,66 A multiple biomarker approach could also be applied. For example, stratification of long-term outcome is improved when adding several novel biomarkers of cardiac (N-terminal pro-B-type natriuretic peptide and soluble ST2) and vascular failure (bioactive adrenomedullin) to the multivariable clinical model.67
Conceptual issues related to the planning and analysis of biomarker performance are presented in Box 4. This methodologic approach could lead to a decrease in bias and thus obtain a pooled estimation of the biomarker performance. A summary of the most common avoidable pitfalls is presented in Box 5.
Severity assessment and risk stratification
Prediction of treatment effects or therapeutic monitoring1
Ideally: Prospective, multicenter study
Sample size consideration
Population: Clear description, sufficient number of events19
Receiver operating characteristic curve analysis (area under curve [CI 95%], sensitivity and specificity at multiple thresholds)19
Comparison with established biomarkers or clinical parameters using decision-analytic methods (e.g., net benefit)21,27–28
If multivariable regression model, assessment of collinearity of factors and biomarkers49
Describe any variable selection procedures
If multiple biomarkers
Biomarker evaluations need a rigorously documented statistical analysis plan, which should be set up before analysis. Investigators need to choose methods based on the clinical question/hypothesis, biomarker phase of development (i.e., discovery, evaluation of accuracy, assessment of incremental value), and weaknesses of the statistical methods. Biomarkers studies are often presented with statistical analyses pitfalls (e.g., not considering properties of a biomarker assay, biomarker kinetics, imperfect accepted standard methods, and different populations) that preclude them from providing a pragmatic scientific message for anesthesiologists and intensivists. Therefore, the tables and toolboxes provided in this article could be used in addition to existing guidelines by investigators, editors, and reviewers to ensure the publication of high-quality biomarker studies for informed readers.
Furthermore, novel biostatistical techniques (e.g., machine learning) are used more and more in critical care and perioperative medicine research. Machine learning is a promising tool to improve outcome prediction and patient subphenotyping to personalize treatments in critical patients. However, we believe that there is a real need for further research to better evaluate the role of machine learning to predict pathology or response to treatment. A direct implementation of machine learning in clinical decision making is as deleterious for patients as a poorly implemented statistical approach. Tables are provided in this article to help the reader to better understand machine learning techniques applied in health care and to avoid their misuse (e.g., overfitting, lack of independent validation, lack of comparison with simpler modeling approaches).
The precision of measurement of the biomarker should be assessed and reported.1
The measurement of the biomarker should be sensitive (detects low concentration) and specific (without interferences with other molecules) and be reported.
Interlaboratory biomarker assay reproducibility should be considered when assessing a biomarker model performance in a cohort collected from a different institution (external validation).1
The timing of biomarker measurement is important to consider. For example, in the postoperative period, the maximum amount of information to diagnose myocardial infarction is obtained at the peak of troponin I (~24 h).60
Correlations between repeated measurements of the biomarker within an individual should also be considered during analysis. The use of mixed models instead of repeated-measures ANOVA offers distinct advantages in many instances.61
Renal or hepatic dysfunction in critical patients could influence the elimination of a biomarker and thus its diagnostic properties. In this condition, adjustment for these covariates should be performed.1
The choice of the reference test used to define diseased and nondiseased patients should be carefully considered.62
In the case of an imperfect accepted standard (e.g., delayed increase in serum creatinine in the case of acute kidney injury), the classification potential of the new biomarker could be falsely decreased.
The studied population could greatly influence the diagnostic and prognostic performance of a test—for example, cardiac troponin I to diagnose postoperative myocardial infarction in noncardiac versus cardiac surgery.65
Authors should describe the exact studied population about which they want to make inference.
Support was provided solely from institutional and/or departmental sources.
Dr. Jüni has received honoraria to the institution for participation in advisory boards from Amgen; has received research grants to the institution from AstraZeneca (Cambridge, United Kingdom), Biotronik (Berlin, Germany), Biosensors International (Singapore), Eli Lilly (Indianapolis, Indiana), and The Medicines Company (Parsippany-Troy Hills, New Jersey); and serves as an unpaid member of the steering groups of trials funded by AstraZeneca (Cambridge, United Kingdom), Biotronik (Berlin, Germany), Biosensors (Singapore), St. Jude Medical (St. Paul, Minnesota), and The Medicines Company. Dr. Mebazaa has received speaker’s honoraria from Abbott (Chicago, Illinois), Orion (Auckland, New Zealand), Roche (Basel, Switzerland), and Servier (Suresnes, France); and fees as a member of the advisory boards and/or steering committees and/or research grants from BMS (New York, New York), Adrenomed (Hennigsdorf, Germany), Neurotronik (Durham, North Carolina), Roche (Basel, Switzerland), Sanofi (Paris, France), Sphyngotec (Hennigsdorf, Germany), Novartis (Basel, Switzerland), Otsuka (Chiyoda City, Tokyo, Japan), Philips (Amsterdam, Netherlands) and 4TEEN4 (Hennigsdorf, Germany). Dr. Gayat received fees as a member of the advisory boards and/or steering committees and/or from research grants from Magnisense (Paris, France), Adrenomed (Hennigsdorf, Germany), and Deltex Medical (Chichester, United Kingdom). The remaining authors declare no competing interests.