An appropriate measure of performance is needed to identify anesthetic depth indicators that are promising for use in clinical monitoring. To avoid misleading results, the measure must take into account both desired indicator performance and the nature of available performance data. Ideally, anesthetic depth indicator value should correlate perfectly with anesthetic depth along a lighter-deeper anesthesia continuum. Experimentally, however, a candidate anesthetic depth indicator is judged against a "gold standard" indicator that provides only quantal observations of anesthetic depth. The standard anesthetic depth indicator is the patient's response to a specified stimulus. The resulting observed anesthetic depth scale may consist only of patient "response" versus "no response," or it may have multiple levels. The measurement scales for both the candidate anesthetic depth indicator and observed anesthetic depth are no more than ordinal; that is, only the relative rankings of values on these scales are meaningful.

Criteria were established for a measure of anesthetic depth indicator performance and the performance measure that best met these criteria was found.

The performance measure recommended by the authors is prediction probability PK, a rescaled variant of Kim's dy.x measure of association. This performance measure shows the correlation between anesthetic depth indicator value and observed anesthetic depth, taking into account both desired performance and the limitations of the data. Prediction probability has a value of 1 when the indicator predicts observed anesthetic depth perfectly, and a value of 0.5 when the indicator predicts no better than a 50:50 chance. Prediction probability avoids the shortcomings of other measures. For example, as a nonparametric measure, PK is independent of scale units and does not require knowledge of underlying distributions or efforts to linearize or to otherwise transform scales. Furthermore, PK can be computed for any degree of coarseness or fineness of the scales for anesthetic depth indicator value and observed anesthetic depth; thus, PK fully uses the available data without imposing additional arbitrary constraints, such as the dichotomization of either scale. And finally, PK can be used to perform both grouped- and paired-data statistical comparisons of anesthetic depth indicator performance. Data for comparing depth indicators, however, must be gathered via the same response-to-stimulus test procedure and over the same distribution of anesthetic depths.

Prediction probability PK is an appropriate measure for evaluating and comparing the performance of anesthetic depth indicators.

A monitor of anesthetic depth [1–3]during general anesthesia would be useful for assessing a patient's response to anesthetic agents and for titrating administration of the agents. Several anesthetic depth indicators have been suggested for monitoring purposes, including those based on hemodynamics, [4–6],* spontaneous electroencephalogram, [4,5,7],*,** auditory evoked potential, [8,9]spontaneous electromyogram, [10,11]esophageal contractility, [12,13]pupillary reflex, [4,14,15]skin conductivity, [16]and anesthetic concentration [4,17]and delivery rate. [18]Combinations of variables, using discriminant analysis, [19,20],*** multivariate logistic regression, [21,22]and neural networks, [20],*** also are being considered. An appropriate performance measure is needed to evaluate and compare candidate indicators of anesthetic depth. Such a measure could help identify indicators that merit further investigation.

A wide variety of possible performance measures are available. Among these are measures of difference or separation, such as the familiar parametric t and F statistics and the nonparametric Mann-Whitney statistic [23]; classification measures, such as sensitivity, specificity, and percent correct [24,25]; receiver operating characteristic (ROC) measures, such as parametric and nonparametric ROC area [25–27]; quantal response curve measures, such as the likelihood value and the slope parameter [28–31]; and parametric and nonparametric correlation measures, such as the Pearson product-moment correlation, [23]the biserial, [32]and the Spearman correlation coefficients [23]and a variety of measures of association. [33–39].

A performance measure for assessing proposed anesthetic depth indicators must be selected with care to avoid false conclusions about the potential utility of an indicator or how it compares with other indicators. The measure should have a meaningful interpretation and permit statistical comparisons of performance. More specifically, the measure should take into account the relationship desired between anesthetic depth indicator value and anesthetic depth and the nature of the available experimental data. In doing so, it should not require unjustified assumptions or ignore available information in the data. Previous measures have not adequately met these requirements.

We now present a new measure, prediction probability (P^{K}), for evaluating and comparing anesthetic depth indicators. This article is divided into three main sections. In the first section, we characterize the problem of assessing anesthetic depth indicators to identify the specific criteria that a performance measure should meet. In the second section, we present P^{K}as a measure that meets these criteria. Specifically, we define and interpret P^{K}, and describe how it is computed and used. We illustrate the application of P^{K}both on hypothetical data and on the experimental data of Leslie et al., [4]the first investigators to use this measure to compare candidate indicators of anesthetic depth. In the third section, we contrast the proposed measure to a variety of alternative measures.

## Background: Assessing Anesthetic Depth Indicator Performance

### Measurement Scales

Observed Versus Underlying Anesthetic Depth Scale. Anesthetic depth often is defined in terms of a response-to-stimulus test, such as whether or not the patient moves to skin incision [40,41]or responds to a voice command. [42,43]Such a "gold standard" anesthetic depth indicator defines a dichotomous scale of observed anesthetic depths having the two levels "response" and "no response." We assume that this observation scale is a coarse lumping of an underlying anesthetic depth continuum. The test stimulus defines a critical threshold point, dividing the underlying continuum into the two observed levels. A patient at a depth of anesthesia less than this critical point moves when stimulated, whereas a patient at a depth of anesthesia greater than this point does not. A stimulus of a different strength can define a different threshold point on the underlying depth continuum. [7,17,44,45]For example, the stimulus of suturing during skin closure defines a lesser threshold point than that from a 25-cm surgical incision. [17].

The scale of observed anesthetic depth can have more than two levels. One way to define a multilevel scale is to apply a graded sequence of stimuli. For example, the application of two stimuli, the first being weak and the second strong, could define the three ordered levels, "response to weak stimulus," "response to strong but not to weak stimulus," and "no response." A multilevel observed depth scale also can be created for a single-strength stimulus. For example, patient response to a single stimulus could be graded more finely into the categories "unequivocal move," "equivocal move," and "no move." [8]Or, for a repeated stimulus, the scale could be more finely graded by measuring the percentage of the patient's appropriate responses to the stimulus. [9].

The level of measurement of the observed anesthetic depth scale is no more than ordinal, in contrast to interval or ratio. That is, anesthetic levels can be rank ordered along the scale, but neither the sizes of intervals nor ratios of levels are meaningful in the context of depth of anesthesia. For example, consider the three-level observed depth scale, "response to weak stimulus," "response to strong but not to weak stimulus," and "no response." These levels can be rank ordered, so the scale is ordinal, but it is not an interval scale, because we cannot say how the difference between the levels of "response to weak stimulus" and "response to strong but not to weak stimulus" compares with the difference between the levels of "response to strong but not to weak stimulus" and "no response."

Anesthetic Depth Indicator Scale. The anesthetic depth indicator scale also is ordinal. Usually, the anesthetic depth indicator scale is finely graded compared with the observed depth scale, but it can be coarse and even dichotomous. As a specific example of a possible anesthetic depth indicator, consider the amplitude of the pupillary reflex to a light flash. [4,14,15]Theoretically, reflex amplitude is a continuous variable, although experimentally, it is made discrete by the limits of resolution for processing, display, or data storage. Reflex amplitudes can be rank ordered, so the anesthetic depth indicator scale is at least ordinal. One might argue that the reflex amplitude scale is interval or ratio, because the scale has units of length, and differences and ratios of length have meaning. However, within the context of predicting anesthetic depth, there is no uniform meaning over the range of reflex amplitudes of a given difference between, or ratio of, reflex amplitudes. Furthermore, because we do not know the form of the bivariate distribution of anesthetic depth and reflex amplitude, or even have as yet a meaningful interval or ratio scale for anesthetic depth, we do not know how to transform reflex amplitude into a variable for which intervals or ratios are meaningful.

It may be argued that the measurement scale for anesthetic concentration goes beyond ordinal, to interval or ratio, especially in view of minimum alveolar concentration additivity. [46]However, if an anesthetic depth indicator that has an interval or ratio output scale is to be compared with an indicator that has an output scale that is no more than ordinal, the same measure of performance must be used for both, namely, a performance measure applicable to ordinal scales.

### Ideal Versus Actual Performance of an Anesthetic Depth Indicator

Ideal Performance. Ideally, an anesthetic depth indicator would change in a continuous fashion to show the changes in a patient's underlying anesthetic depth, as illustrated by the hypothetical curve in Figure 1(a). In the figure, the horizontal axis shows anesthetic depth indicator value, x, increasing to the right, and the vertical axis shows the underlying anesthetic depth, y^{u}, increasing downward. These directional conventions are a compromise between those commonly used for graphs and for tables; we use the same conventions later for tables. We use x for anesthetic depth indicator value and y^{u}for anesthetic depth to denote that during clinical monitoring we want the known indicator value, x, to predict the unknown anesthetic depth, y^{u}.

Mathematically, anesthetic depth y^{u}in Figure 1(a) is a monotonically increasing function of anesthetic depth indicator value x. This type of relationship is ideal, because it ensures that anesthetic depth indicator value can predict anesthetic depth perfectly and that a change in y^{u}is reflected by a change in the same direction in x. To say that y^{u}is a mathematical function of x means that there is one and only one y^{u}for each value of x; we want this mathematical relationship because clinically we want to use x to predict y^{u}. The statement "y^{u}is a function of x" is not intended to suggest that anesthetic depth y^{u}is caused by indicator x. "Monotonically increasing" means that y^{u}increases as x increases; that is, the slope of y^{u}versus x is positive (recall that y^{u}increases downward in Figure 1). If desired, a nonlinear transformation could be used to straighten the curve.

Experimentally, values of observed, not underlying, anesthetic depth are available. Thus, anesthetic depth indicator performance can be assessed only in terms of how well indicator value correlates with or predicts observed anesthetic depth. Figure 1(b) illustrates the effect of the quantal nature of the observed anesthetic depth scale on the hypothetical ideal relationship in Figure 1(a). For generality, we show two threshold points on the continuum, y^{u1}and y^{u2}, which divide the continuum into a three-level observed depth scale, with quantal levels y^{1}, y^{2}, and y^{3}. As a result, the smoothly changing curve in Figure 1(a) is reduced to a series of steps in Figure 1(b). Mathematically, observed anesthetic depth y in Figure 1(b) is a monotonically nondecreasing function of anesthetic depth indicator x. That is, the mathematical curve of y versus x can have zero and positive, but not negative, slope (again, recall that y increases downward in the figure). Given this relationship, anesthetic depth indicator value can predict observed anesthetic depth perfectly.

Actual Performance. Now consider an actual, non-ideal anesthetic depth indicator used on a population of patients. At each point on the underlying anesthetic depth continuum, there is a distribution or spread of indicator values. Consequently, each indicator value corresponds to a distribution or spread of underlying anesthetic depths. Figure 2illustrates a population distribution of hypothetical data points, each point consisting of an anesthetic depth indicator value and an underlying anesthetic depth. To create the figure, the x-y^{u}plane from Figure 1(a) was rotated and tilted, and a z direction was added. The x axis is west-to-east, the y^{u}axis is north-to-south, and the z axis elevates upward out of the plane. The curved ridge or hill elevated above the plane represents the population distribution. For comparison, the dashed curve on the x-y^{u}plane shows the original hypothetical ideal curve from Figure 1(a). The west-east dark line at anesthetic depth y^{u1}illustrates the spread in anesthetic depth indicator values at a given point on the depth continuum. Similarly, the north-south dark line at indicator value x^{1}illustrates the spread in anesthetic depth corresponding to a given anesthetic depth indicator value.

The population distribution in Figure 2degrades the indicator's ability to predict observed, as well as underlying, anesthetic depth. The effect of this distribution on the graph of observed anesthetic depth versus indicator value in Figure 1(b) would be to increase the lengths of the line segments at the different levels of observed depth, resulting in horizontal overlapping of these segments. A given indicator value then would correspond to more than one observed anesthetic depth. That is, observed anesthetic depth no longer would be a monotonically nondecreasing function of anesthetic depth indicator value, and indicator value no longer could predict observed depth perfectly.

### Performance Measurement Criteria

In summary, to assess anesthetic depth indicator performance, we want a measure of the correlation or association between indicator value and underlying anesthetic depth. Experimentally, however, we can measure only observed, not underlying, anesthetic depth. In view of this limitation, we define ideal indicator performance to mean the perfect prediction by indicator value x of observed anesthetic depth y. Anesthetic depth indicator value and observed anesthetic depth are ordinal variables. Therefore, ideal indicator performance is achieved when observed anesthetic depth y is mathematically a monotonically nondecreasing function of indicator value x. We want a performance measure that shows how close the relationship between observed anesthetic depth and indicator value is to this goal. The performance measure should use fully the experimental data for any degree of coarseness or fineness of the indicator and observed anesthetic depth scales. In addition, as stated earlier, the measurement should have a meaningful interpretation and permit statistical comparisons of performance.

## Recommended Performance Measure: Prediction Probability P sub K

The performance measure we recommend to meet these criteria is prediction probability P^{K}, a type of nonparametric correlation known as a measure of association. Measures of association are attractive for our application because they are suited to ordinal variables and can accommodate variable scales having any degree of coarseness or fineness. Measures of association are not new, but there are many to choose from, and insight is still being developed as to their properties and appropriate use. Relatively recently, Freeman identified which measure of association was appropriate for each of several models of ideal relationship between the variables. [47]Our ideal is for observed anesthetic depth y to be a monotonically nondecreasing mathematical function of anesthetic depth indicator x. For this ideal, Kim's d^{y}*symbol* x [37]is the appropriate measure of association. [47]We have rescaled d^{y}*symbol* x to create P^{K}, a measure having an interpretation that is simpler and more meaningful for our application. We use the subscript K in P^{K}to denote its close relationship to Kim's measure.

Measures of association, which apply to cross-tabulations of ordinal variables, are not widely familiar. Therefore, before proceeding with P^{K}, we show that experimental data on anesthetic depth indicators can be expressed in a tabular format. We also review the terminology of ordinal relationships and restate the performance desired of an anesthetic depth indicator in this terminology.

### Tabular Presentation of Anesthetic Depth Indicator Data

Because the scales for both anesthetic depth indicator value and observed anesthetic depth are no more than ordinal, the information in experimental data resides in the rank order of the data points along these scales, not their specific numeric values and units. Thus, without any loss of information, we can present experimental data in table form, with column index increasing to the right and showing increasing rank of indicator values, and row index increasing downward and showing increasing rank of observed anesthetic depth.

To express the experimental data in table form, we do not need to "bin" the data. Instead, on each scale, we let each distinct experimental value define a distinct rank; that is, each distinct indicator value in the data sample defines a column, and each distinct observed anesthetic depth value in the data sample defines a row. Data for continuous, as well as discrete, scales can be put in table form in this way.

As an example, consider the six hypothetical data points shown on the ideal relationship between underlying anesthetic depth and indicator value in Figure 1(a) and replotted on the three-level observed anesthetic depth scale in Figure 1(b). Table 1shows these same six data points. Each cell entry in the table shows the number of data points that have that particular combination of indicator value (column) and observed anesthetic depth (row). Because observed anesthetic depth y in Figure 1(b) is a monotonically nondecreasing function of indicator value x, all of the data points in each succeeding row of Table 1lie to the right of all of the data points in previous rows. An anesthetic depth indicator that generated these data should be judged to perform perfectly.

(Table 2) shows a more general set of hypothetical data points, assumed to be a sampling from some unknown bivariate distribution of underlying anesthetic depth and indicator value, such as that in Figure 2. Observed anesthetic depth again has three levels. Table 2shows nonideal indicator performance, because there is overlap from row to row.

### Desired Performance Expressed in the Terminology of Ordinal Relationships

The relationship between ordinal variables x (indicator value) and y (observed anesthetic depth) is described in terms of the rank ordering of the x and y values for pairs of data points. A concordance occurs when the x values and y values for a pair of data points are rank ordered in the same direction. For example, in Table 2, the point in cell B2 and a point in cell A1 or in cell C3 compose a concordance. A discordance occurs when the x and y values are rank ordered in opposite directions (e.g., the point in cell B2 and a point in cell C1). An indicator-only, or x-only, tie (tie in indicator value but not in observed depth) occurs when the x values are tied but the y values are not (e.g., the point in cell B2 and a point in cell B1). Similarly, a depth-only, or y-only, tie occurs when the x values are not tied but the y values are (e.g., the point in cell B2 and a point in cell D2). Finally, a joint tie occurs when there are ties in both x and in y (e.g., any two data points in cell D2).

Our ideal is for indicator value x to predict observed anesthetic depth y perfectly. Therefore, concordances are desired, because the rank order of the x values correctly predicts the rank order of the y values. Discordances are undesirable, because x order incorrectly predicts y order. Indicator-only ties also are undesirable, because the values of x provide no predictive information about the rank order of the y values; the order of the y values is then only a guess, with a 50:50 chance of being correct. Ties in y (both y-only and joint ties) are not caused by the indicator, but by the experimental limitations that result in a quantal scale of observed anesthetic depth, so ties in y should not be considered in the evaluation of indicator performance.

Thus, an ideal relationship between indicator value x and observed anesthetic depth y consists of concordances, with no discordances or x-only ties, and with ties in y tolerated. A measure of anesthetic depth indicator performance should reward concordances, penalize discordances and x-only ties, and ignore ties in y (both y-only and joint ties).

### Definition and Interpretation of P sub K

Prediction probability P^{K}is a variant of Kim's d^{y}*symbol* x [37]measure of association. Kim's d^{y}*symbol* x is defined for ordinal variables x and y in terms of the types of pairs of data points just described. Let P^{c}, P^{d}, and P^{tx}be the respective probabilities that two data points drawn at random, independently and with replacement, from the population are a concordance, a discordance, or an x-only tie. The only other possibility is that the two data points are tied in observed depth y; therefore, the sum of P^{c}, P^{d}, and P sub tx is the probability that the two data points have distinct values of observed anesthetic depth, that is, that they are not tied in y.

Kim's d^{y}*symbol* x is defined to be Equation 1. Alternatively, we define prediction probability P^{K}to be Equation 2which, by inserting (1) into (2), becomes Equation 3.

Thus, P^{K}and Kim's d^{y}*symbol* x differ in scale and range of values but convey the same information. As desired, both Kim's d sub y *symbol* x and P^{K}reward concordances, penalize discordances and indicator-only ties, and ignore ties in observed depth y. The range for Kim's d^{y}*symbol* x is from -1 to +1, while that for P^{K}is from 0 to 1. When the probabilities of discordance and indicator-only tie are both zero, d^{y}*symbol* x and P^{K}both equal 1. When the probability of discordance equals that of concordance, d^{y}*symbol* x = 0 and P^{K}= 0.5. A negative value of d^{y}*symbol* x, or a value of P^{K}less than 0.5, means that discordances are more likely than concordances.

The advantage of prediction probability P^{K}over d^{y}*symbol* x is its simple interpretation as a probability that directly relates to the goal of using indicator value to predict observed anesthetic depth. Specifically, given two randomly selected data points with distinct observed anesthetic depths, P^{K}is the probability that the indicator values of the data points predict correctly which of the data points is the lighter (or deeper). Appendix 1 supports this interpretation. A value of P^{K}= 0.5 means that the indicator correctly predicts the anesthetic depths only 50% of the time, i.e., no better than a 50:50 chance. A value of P^{K}= 1 means that the indicator predicts the anesthetic depths correctly 100% of the time.

In contrast, though Kim's d^{y}*symbol* x embodies the same information as P^{K}, its interpretation is a more abstract and cumbersome difference between two probabilities. Specifically, given two randomly selected data points with distinct observed anesthetic depths, Kim's d^{y}*symbol* x is the probability that the two data points are concordant, minus the probability that the two data points are discordant.

### Estimation of P sub K

Prediction probability P^{K}is computed from sample data by replacing the probabilities in Equation 3with sample estimates. Table 1and Table 2are examples of sample data. Estimates of P^{K}and its standard error (SE) can be obtained by using the jackknife method or more traditionally by using the relationship of P^{K}to other measures of association. Calculations can be performed by using a custom spreadsheet macro, PKMACRO,**** or commercial statistical software. The program PKMACRO provides both jackknife and the more traditional estimates of P sub K and its SE.

We recommend using the jackknife method [48,49]to estimate P^{K}and its SE. An advantage of the jackknife method is that sampling variability can be approximated by the Student's t distribution, [48]thus taking into account sample size. Another advantage is that the jackknife method makes possible paired-data, as well as grouped-data, statistical comparisons of P^{K}values. Finally, the jackknife method reduces bias in the estimation of P^{K}, although, as we shall see, bias may not be a significant concern. The jackknife method assumes independent data points.

At the outset, the application of the jackknife method to P^{K}appears computationally forbidding. For a sample of n data points, the method requires the computation of n + 1 estimates of P^{K}, one on the entire n-sample and n more on the (n - 1)-size samples obtained by deleting each of the n data points one at a time. Fortunately, it is possible to take advantage of the mathematical structure of P^{K}to speed up these computations. As a result, PKMACRO performs the jackknife computations rapidly.

For the nonideal data in Table 2, the sample estimate of prediction probability computed using Equation 3is P^{K}= 0.867. The jackknife method gives nearly the same value, P^{Kjack}= 0.866, suggesting that there is little or no bias in P^{K}. The jackknife SE estimate is sigma^{P}^{Kjack}= 0.070. Alternatively, for the ideal data in Table 1, P^{K}= P^{Kjack}= 1 with a jackknife SE of 0. These values were computed using PKMACRO.

A more traditional approach to computing P^{K}and its SE is to use the close relationship between Kim's d^{y}*symbol* x [37]and an older measure of association, Somers' d^{yx}*symbol*[35]This approach also assumes independent data points. It permits grouped-data, but not paired-data, comparisons of P^{K}values. An advantage of this approach is that computations can be performed using commercial statistical programs for Somers' measure. Previous work on Somers' measure [39]can be used to show that P^{K}is asymptotically unbiased and Gaussian. There are no clear guidelines, however, on the minimum sample size required to assume this asymptotic distribution.

As reviewed in Appendix 2, Kim's d^{y}*symbol* x is structurally equal to Somers' d^{xy}(note the reversed subscripts), and Somers' d^{xy}can be found using commercial software programs, such as BMDP [50](BMDP Statistical Software, Los Angeles, CA), SPSS [51](SPSS, Chicago, IL), and SAS [52](SAS Institute, Cary, NC). Thus, P sub K can be computed by inserting d^{y}*symbol* x = d^{xy}into Equation 2. The commercial programs compute Goodman and Kruskal's [39]approximate standarderror (ASE) of d^{xy}, denoting it s^{1}or approximate standard error 1. By Equation 2, the corresponding SE of P sub K, sigma^{P}^{K}1, equals one half of S^{1}. For the nonideal data in Table 2, BMDP program 4F [50]gives d^{xy}= 0.734 and S sub 1 = 0.132. Therefore, by Equation 2, P^{K}=(1 + 0.734)/2 = 0.867, the same value for P^{K}obtained previously, and sigma^{P}^{K}1 = 0.132/2 = 0.066, which is close to the jackknife SE shown previously. For the ideal data in Table 1, the more traditional approach results in P^{K}= 1 with an SE of 0, the same values obtained using the jackknife method.

Brown and Benedetti [53]recommended an adjusted version of S^{1}, denoted S^{0}, when using a sparse data table to test the null hypothesis of no association at a small level of significance (e.g., 0.01). The BMDP and SPSS software packages provide this alternative SE estimate for Somers' measure as a t value equal to d^{xy}/S^{0}. The corresponding SE estimate for P^{K}is sigma^{P}^{K}0 = S^{0}/2; in terms of P^{K}, the associated t-value equals (P^{K}- 0.5)/sigma sub P^{K}0. For Table 2, BMDP gives t = 5.643, showing that there is statistically significant association at P = 0.0001. Correspondingly, S sub 0 = 0.130 and sigma^{P}^{K}0 = 0.130/2 = 0.065, which is close to the previous values of sigma^{P}^{K}1 and sigma^{P}^{Kjack}. The program PKMACRO computes both sigma^{P}^{K}1 and sigma^{P}^{K}0.

Leslie et al. [4]applied P^{K}to experimental data for ten candidate indicators for monitoring anesthetic depth. In their study of human volunteers given propofol and nitrous oxide, observations of anesthetic depth consisted of noting whether or not a volunteer moved in response to electrical stimulation. This stimulus corresponds to a depth threshold point, such as y^{u1}in our Figure 2, dividing the depth scale into two observed levels, "move" and "no move."

Jackknife estimates of P^{K}on the 130-stimulus data for the ten candidates (Table 2of Leslie et al. [4]) ranged from 0.736 for propofol blood concentration to 0.864 for the Bispectral Index of the electroencephalogram. The jackknife estimates were close to the traditional estimates. Specifically, the values of P^{Kjack}and P^{K}agreed out to at least 13 decimal places, again suggesting negligible bias in P^{K}. The values of sigma^{P}^{Kjack}were greater than those of sigma^{P}^{K}1 by an average of 1.5% and were less than those of sigma^{P}^{K}0 by an average of 4.9%.

### Hypothesis Tests on P sub K

Test of an Indicator's Ability to Predict Observed Anesthetic Depth. One useful statistical test of indicator performance is whether P^{K}is different from 0.5, the value for an indicator that has no predictive power. For an n-sample, this test can be performed by using the t statistic (P^{Kjack}- 0.5)/sigma^{P}^{Kjack}with n - 1 degrees of freedom. For our Table 2, this statistic has the value 5.227. This value is consistent with the previously mentioned t statistic associated with sigma^{P}^{K}0, again showing the presence of statistically significant predictive ability at P = 0.0001.

For the ten candidate indicators investigated by Leslie et al., [4]the 130-stimulus t statistic values for jackknife hypothesis tests of the predictive ability of each variable ranged from 4.87 for propofol blood concentration to 11.05 for the Bispectral Index. Even with a Bonferroni correction for multiple comparisons, [54]all ten indicators showed statistically significant abilities to predict observed anesthetic depth at P = 0.001.

Test to Compare the Performance of Two Indicators. Another useful test is a comparison of the performance of two anesthetic depth indicators. If the data are collected independently on the two indicators, the test statistic is the difference between the two sample values of P sub K, divided by an estimated SE of this difference. If jackknife results are available, a t test can be performed. The choice of method for determining the estimated SE of the difference and the number of degrees of freedom for this test depends on whether or not the variances of the jackknife pseudovalues [48,49]for the two indicators are equal. [23]Alternatively, if the sampling variabilities for the two estimates of P^{K}are assumed to be Gaussian, then the estimated SE of the difference is the square root of the sum of the squares of the estimated SEs of the two sample P^{K}values, and the test statistic is Gaussian.

If paired data on the two indicators are collected and the indicators are positively correlated, we can avail ourselves of the greater statistical power of a paired-data method of comparing indicator performance. Paired-data comparisons can be performed using PKMACRO. Leslie et al. [4]used the jackknife method to make paired-data comparisons of the P^{K}values for alternative depth indicators.

Prediction probability is useful for comparing anesthetic depth indicators because it does not depend on distributional assumptions, the particular type or units of an indicator variable's scale, or the choice of a particular variable threshold value, and because its expected value is asymptotically independent of the number of experimental data points. When comparing indicators, however, it is necessary to gather data using the same stimulus procedure and over the same distribution of anesthetic depths. A good way to ensure appropriate conditions for comparing the performance of two anesthetic depth indicators is to measure the indicator values simultaneously for the same subjects--hence, the importance of being able to carry out paired-data comparisons. The sensitivity of P^{K}to data range is analogous to that for the more familiar Pearson product-moment correlation coefficient, r, when r is used to measure the degree of linear relationship between two interval variables. [55].

## P sub K Versus Alternative Performance Measures

### Separation Measures

One approach to measuring the performance of an anesthetic depth indicator is to determine the difference, or separation, between the populations of indicator value x that correspond to different values of observed anesthetic depth y. Consider first dichotomous observed depth y, with values R for "response" and N for "no response" such that N > R, and let x^{R}and X^{N}denote corresponding values of x.

Perhaps the most familiar separation measure is the t statistic (or its equivalent in this situation, the F statistic), which evaluates separation between the means of two populations. [23]A drawback of this parametric statistic is that its proper interpretation requires the distributions of X^{R}and X^{N}to be simultaneously Gaussian with equal variance. Analysis of variance [23]also assumes Gaussian distributions with equal variance. Another drawback to these statistics is that their expected values change with the number of experimental data points, reducing their usefulness for comparisons. In contrast, prediction probability P^{K}does not require distributional assumptions, and its expected value is asymptotically independent of sample size.

Nonparametric alternatives to the t statistic, such as the Mann-Whitney U statistic, or, equivalently, Kendall's S or the Wilcoxon W, have the advantage that they can be used with ordinal variables. [23,34]These measures are appropriate for testing the hypothesis that the X^{R}and X^{N}populations are not separated. The chi-square statistic offers an even less restrictive test of the independence of the X^{R}and X^{N}populations. [23]Unlike the value of P^{K}, however, with its meaningful interpretation as a probability of prediction, the numeric values of these statistics lack intrinsic meanings, and again they vary with sample size.

Prediction probability P^{K}applies, and retains its meaning, when y has multiple levels. In contrast, while the separation measures can be generalized to test the mutual separation or independence of distributions of x for multiple y values, they cannot show the presence or degree of the relationship of interest, namely that x and y increase together.

### Classification and Receiver Operating Characteristic Measures

Another approach to measuring performance is to determine how successfully indicator value x classifies experimental data points according to observed anesthetic depth. Again consider first dichotomous y with values R and N. The x scale then is dichotomized by selecting a threshold, and x less than or greater than the threshold is used to predict that the value of observed depth y is, respectively, R or N. Sensitivity, specificity, and percent correct are common classification measures.

As discussed in Swets and Pickett [25], these and other related measures have shortcomings. Sensitivity or specificity alone is an incomplete performance measure, because each takes into account only one of the two possible types of classification errors. In contrast, P^{K}and percent correct take both types of errors into account. The values of sensitivity, specificity, and percent correct depend on the threshold used to dichotomize x. This dependence on threshold is undesirable, because there is no universal criterion for choosing a threshold value for a given anesthetic depth indicator, nor is there a clear way to select equivalent threshold values for alternative indicators. In contrast, P^{K}takes into account all possible threshold values along the x scale, no matter how coarsely or finely graded the scale is. Another drawback of sensitivity and specificity is that, unlike P^{K}, they cannot be used beyond dichotomous y.

For dichotomous y, parametric [27]and nonparametric [26]ROC areas, like prediction probability P^{K}, take into account both sensitivity and specificity for all possible threshold values of x. Parametric ROC area has the drawback that, unlike P^{K}, it requires distributional assumptions--typically, that the x scale is transformable so that X^{R}and X^{N}are simultaneously Gaussian, though with perhaps different variances. Like P^{K}, nonparametric ROC area can be applied to ordinal data, but, unlike P^{K}, it cannot be used beyond dichotomous y.

Moreover, the statistical assumptions on which ROC area is based do not apply generally to anesthetic depth indicator data. Receiver operating characteristic analysis makes the assumption that x is measured conditioned on y for each of the two values of y. [26]In contrast, and consistent with P^{K}, data for assessing anesthetic depth indicators usually consist of joint measurements of indicator value x and observed depth y, neither of which is known with certainty beforehand.

### Quantal Response Curves

When observed anesthetic depth y is dichotomous, with values R and N, and x is a suitably transformed indicator value, a simple, two-parameter function, such as the logistic or probit curve, can be fitted to a set of sample data to show the percent probability versus x that y = N. [28–31]The appeal of such a "quantal response curve" is that, for the distribution of anesthetic depths on which it is based, it shows directly how to achieve a particular probability of no response to stimulus. Common quantal response curve parameters are the value of x at the 50% probability of no patient response, x^{50}, and the steepness or slope of the curve, reciprocally related to the spread in x of the derivative of the curve.

Quantal response curves can be used to compare the performance of two anesthetic depth indicators. If the indicators have a common scale, then the better of the two indicators has the quantal response curve with the steeper slope. Another approach to comparing two indicators, say indicator A and indicator B, is to use stepwise regression and a measure such as the likelihood, (i.e., the maximized likelihood value). [29–31,50–52]The likelihood is the probability of occurrence of the sample results for the given quantal response curve parameter values; the likelihood increases to 1 when the quantal response curve fits the sample data perfectly.

An example of this approach would be to fit individual logistic curves to the suitably transformed indicators A and B, and then to use bivariate regression to fit a logistic curve to the linear combination of the transformed indicators A and B. If the likelihood value for the bivariate regression is statistically better than the univariate value for indicator A, but not better than the univariate value for indicator B, then indicator B is the better one.

A drawback to using quantal response curves to assess indicator performance is that, unlike P^{K}, they are limited to dichotomous observed anesthetic depth y. Also unlike P^{K}, the measure of indicator quality obtained using a quantal response curve is dependent on whatever nonlinear transformation first is applied to x. We expand on this issue in a later section. If slope parameters are compared then the transformation to put both indicators onto a common scale also affects the comparison. As with P^{K}, sample quantal response curves depend on the distribution of anesthetic depths, so data on indicators to be compared need to be gathered over the same depth distributions.

### Correlation Measures

Prediction probability P^{K}is a type of correlation measure. The familiar Pearson product-moment correlation coefficient, p or r, has the drawback that its usual statistical interpretation assumes that x and y are jointly Gaussian. [23]Similarly, the biserial correlation coefficient [32]assumes that y is a dichotomization of an underlying variable that is jointly Gaussian with x. The Spearman rank correlation coefficient, [23]a nonparametric alternative to tau, has the advantage of avoiding distributional assumptions. In its basic form, however, it has the drawback of assuming that there are no tied values of x or y, whereas the coarse nature of the observed anesthetic depth y scale typically results in many ties. Although procedures exist to correct for ties, [34]the numeric value of the Spearman correlation coefficient still lacks an intrinsic meaning, in contrast to the interpretation of P^{K}as a probability of correct prediction.

Prediction probability P^{K}is a variation of Kim's d^{y}*symbol*^{x}measure of association, which is equivalent to Morris' gamma^{k}. [36]We mentioned previously that Kim's d^{y}*symbol* x is closely related to Somers' d^{yx}[35]but that the latter is not appropriate here; as Appendix 2 shows, Somers' d^{yx}inappropriately ignores indicator-only ties and penalizes depth-only ties. There are several other closely related measures of association, including Kendall's tau^{b}, [33]Stuart's tau^{c}, [33]Goodman and Kruskal's gamma, [39]and Wilson's e. [38]These measures also are inappropriate, because the ways they treat ties in the data are inconsistent with the model we established of ideal indicator performance, namely, that observed depth y should be a monotonically nondecreasing function of indicator value x. Kendall's and Stuart's measures have the additional drawbacks that their numeric values lack an intrinsic meaning. [56]As desired, Goodman and Kruskal's gamma ignores ties in y, but it inappropriately also ignores indicator-only ties. These ties, in which the same indicator value x occurs for two different values of observed depth y, degrade the predictive power of x and should be penalized. Wilson's e is not appropriate, because it penalizes not only indicator-only ties but also depth-only ties. These latter ties occur because of the experimentally coarse observed depth scale and should not count against indicator performance.

### P sub K Versus Parametric Performance Measures that Require Model Fitting

An alternative to assuming ordinality and using a nonparametric performance measure such as P^{K}is to attempt to transform the data so that they fit the requirements for a parametric performance measure. An example of such a transformation is the use of the logarithm of anesthetic concentration to improve the fit of the logistic or probit parametric model of the quantal response curve for dichotomous y. [17,18,21,57,58](The sigmoidal quantal response equations used by Ausems et al. [17]and by Vuyk et al. [58]are equivalent to fitting logistic curves to the logarithm of concentration. [29]).

Ultimately, the clinical user of an anesthetic depth indicator needs to know how to interpret its output in the context of anesthesia administration. Model fitting may be a necessary step in achieving this level of understanding. However, when seeking promising indicators, there are drawbacks to the use of a model-dependent performance measure. The difficulty is that such a measure reflects not only the indicator's inherent potential but also how well the model is fit. The process of model fitting is iterative, with no guarantee of finding the best data transformation, [59–61]and there are multiple, at times contradictory, measures of goodness-of-fit. [30,31,50–52].

We now illustrate how P^{K}compares with a model-based performance measure, specifically, a measure based on logistic regression, using the 130-stimulus data for the ten candidate indicators of Leslie et al. [4]The measure of logistic curve fit we choose is the SPSS "significance" value for "--2 Log Likelihood." [51]This significance value is obtained for a test of the null hypothesis that the likelihood equals 1, that is, that the given logistic curve explains the dichotomous quantal data perfectly. Thus, a larger significance value corresponds to a better fit of the logistic curve to the quantal data. To investigate the effect of distributional shape, we apply both of the performance measures, P^{K}and logistic significance, before and after logarithmic transformations of the variables.

(Figure 3) shows logistic significance values for the ten candidate indicators versus the corresponding sample estimates of P^{K}. The open symbols in Figure 3are for the original indicators, and the filled symbols are for the logarithmic transformations of these indicators. The effect of logarithmic transformation was the greatest on the percent beta power of the electroencephalographic spectrum (BETA). Before the transformation, logistic significance ranked BETA as only the fifth best of the candidates; afterward, the ranking of BETA by logistic significance improved to second best. In contrast, P^{K}ranked BETA as second best, without the need for a transformation. Because P^{K}is a nonparametric measure, its values were unaffected by the transformation, that is, by the shape of the data distribution.

When using a model-based performance measure, it may not be clear whether the data need to be transformed and, if so, what transformation is best. The presence of positive skew in histograms of BETA for "move" and "no move" suggested that a logarithmic transformation might improve logistic fit. [59],Figure 4shows this skew (4a) and its reduction after the transformation (4b). Some other transformation, however, might have resulted in an even better logistic fit. Histograms, as well as previous propofol studies, [18,58]suggested that logarithmic transformations also would improve the logistic curve fits for blood and model-estimated effect-site propofol concentrations. The increases in logistic significance values for these variables in Figure 3confirm these expectations.

For some other indicators, however, the histograms were misleading. A logarithmic transformation of electroencephalographic percent delta power seemed to increase, rather than reduce, distributional skew, yet Figure 3shows that the transformation slightly improved the fit of the logistic curve. Also, although the skew for median frequency F50 resembled that shown in Figure 4(a) for BETA, Figure 3shows that the logarithmic transformation of the data worsened the fit of the logistic curve. Thus, for the logistic performance measure, an approach more sophisticated than simply viewing histograms is required in the search for appropriate data transformations.

We believe that P^{K}shows the potential performance of an indicator of anesthetic depth, whereas this potential may be revealed by a model-based measure only after a suitable transformation of the indicator has been found. The results in Figure 3support our view. There is a strong correlation between sample values of logistic significance and P^{K}after indicators were transformed to improve logistic curve fit. Specifically, using the higher of each pair of significance values shown, the sample correlation coefficient between logistic significance and P^{K}in Figure 3is 0.971.

In summary, prediction probability P^{K}is a performance measure particularly suited to evaluating and comparing anesthetic depth indicators. This measure is the probability that an indicator can predict correctly the rank order of an arbitrary pair of distinct observed anesthetic depths. The measure has the value of 0.5 when the indicator has no useful predictive power and the value of 1 when the indicator predicts perfectly. Sample prediction probability P^{K}and its estimated SE can be computed for any degree of coarseness or fineness of the anesthetic depth indicator and observed anesthetic depth scales, ranging from dichotomous to continuous. Confidence intervals can be determined, and grouped-data and paired-data statistical comparisons can be made of anesthetic depth indicator performance. Prediction probability P^{K}is convenient for making comparisons because it is not dependent on linear or nonlinear scaling of x, or on the choice of some x threshold, and because its expected value is asymptotically independent of the number of experimental data points. Data for comparisons, however, must be gathered using the same stimulus procedure and for the same distribution of anesthetic depths.

The authors thank Kate Leslie, M.B.B.S., F.A.N.Z.C.A., Daniel I. Sessler, M.D., and the ANESTHESIOLOGY reviewers for many helpful suggestions; Mehernoor F. Watcha, M.D., for providing information on biserial correlation; and Paul Manberg, Ph.D., for discussions on logistic curve analysis.

## Appendix 1. Probabilistic Interpretation of P sub K

Consider two cases drawn randomly, but with distinct values of observed anesthetic depth y. Assume that we use anesthetic depth indicator value x for the two cases to predict which case is deeper according to the following rule:(1) If the two indicator values are distinct, predict that anesthesia is deeper for the case with the greater indicator value;(2) If the indicator values are equal, randomly guess which case is deeper. Then P^{K}is the probability of correctly predicting the deeper case from the indicator values.

We now prove this interpretation of P^{K}as a prediction probability. The probability of a correct prediction of the deeper case by part A of the prediction rule equals the probability that the two cases are concordant. This probability, conditioned on the assumption of distinct observed depths, is Equation 4. Part B of the prediction rule applies if the indicator values are equal. The probability of repeated indicator values, again conditioned on distinct observed depths, is Equation 5.

According to the prediction rule, if the indicator values for the two cases are equal, then randomly guess which is the deeper case. Because there is a 50:50 chance that the guess is correct, the probability of a correct prediction by part B of the prediction rule is one half of Equation 5. The overall probability of a correct prediction of the deeper case, conditioned on distinct observed depths, is Equation 4plus one half of Equation 5, which is the right hand side of Equation 3, the definition of P^{K}.

## Appendix 2. The Relationship between Kim's d sub y *symbol* x and Somers' d sub yx

Kim's d^{y}*symbol* x and Somers' d^{yx}are both "asymmetric" measures of association, in that they are intended to measure prediction accuracy in just one direction, specifically, to measure how well x predicts y. The order of the x and y subscripts for these measures denotes this particular direction of prediction.

The mathematical expression for Somers' d^{yx}has the same form as that in Equation 1for Kim's d^{y}*symbol* x, except for one critical difference. To create the expression for Somers' d^{yx}, the probability P^{tx}of an indicator-only (x-only) tie in the denominator of Equation 1for Kim's d^{y}*symbol* x is replaced by the probability P^{ty}of a depth-only (y-only) tie. That is, both Kim's d sub y *symbol* x and Somers' d^{yx}reward concordances and penalize discordances. However, whereas Kim's d^{y}*symbol* x penalizes indicator-only ties and ignores ties in y, Somers' d^{yx}penalizes depth-only ties and ignores ties in x. Because of this critical difference, Somers' d^{yx}is not an appropriate predictive measure for our application.

Kim's d^{y}*symbol* x, however, is structurally equal to Somers' d^{xy}(note the reversal of subscript order). Thus, using Equation 2, we can apply theory developed for Somers' d^{xy}to P^{K}. Also, we can use commercial statistical programs that compute Somers' measure, such as the BMDP program 4F [50](frequency tables), the SPSS command CROSSTABS, [51]and the SAS procedure FREQ, [52]to compute Kim's d^{y}*symbol* x, and, therefore, P^{K}.

These programs print out estimates of Somers' measure for both directions of prediction, that is, d^{yx}for x predicting y and d^{xy}for y predicting x. To show the direction of prediction, the programs label the predicting variable as "independent" and the predicted variable as "dependent." We want to use indicator value x (independent variable) to predict observed depth y (dependent variable). To obtain a value for Kim's d^{y}*symbol* x, we must select Somers' d^{xy}(note the reversed subscripts) from the computer output, as though our goal is to use observed depth y to predict indicator value x, that is, as though y was the independent variable and x was the dependent variable.

*Dutton RC, Smith WD, Smith NT: Does the EEG predict anesthetic depth better than cardiovascular variables?(abstract). ANESTHESIOLOGY 1990; 73:A532.

**Dutton RC, Smith WD, Smith NT: EEG prediction of arousal during anesthesia with combinations of isoflurane, fentanyl, and N^{2}O (abstract). ANESTHESIOLOGY 1991; 75:A448.

***Watt RC, Samuelson H, Navabi MJ: A comparison of artificial neural networks and classical statistical analysis. ANESTHESIOLOGY 1991; 75:A451.

****The command macro, PKMACRO, written in Microsoft Excel 4.0 (Microsoft, Redmond, WA) for the Macintosh computer (Apple, Cupertino, CA), is available from the authors.