In this package there are currently 2 functions that provide robust
alternatives to the t_TOST
function.
The Wilcoxon group of tests (includes Mann-Whitney U-test) provide a
non-parametric test of differences between groups, or within samples,
based on ranks. This provides a test of location shift, which is a fancy
way of saying differences in the center of the distribution (i.e., in
parametric tests the location is mean). With TOST, there are two
separate tests of directional location shift to determine if the
location shift is within (equivalence) or outside (minimal effect). The
exact calculations can be explored via the documentation of the
wilcox.test
function.
In the TOSTER package, we accomplish this with the
wilcox_TOST
function. Overall, this function operates
extremely similar to the t_TOST
function. However, the
standardized mean difference (SMD) is not calculated. Instead
the rank-biserial correlation is calculated for all types of
comparisons (e.g., two sample, one sample, and paired samples). Also,
there is no plotting capability at this time for the output of this
function.
As an example, we can use the sleep data to make a non-parametric comparison of equivalence.
data('sleep')
library(TOSTER)
= wilcox_TOST(formula = extra ~ group,
test1 data = sleep,
paired = FALSE,
eqb = .5)
print(test1)
##
## Wilcoxon rank sum test with continuity correction
##
## The equivalence test was non-significant W = 20.000, p = 8.94e-01
## The null hypothesis test was non-significant W = 25.500, p = 6.93e-02
## NHST: don't reject null significance hypothesis that the effect is equal to zero
## TOST: don't reject null equivalence hypothesis
##
## TOST Results
## Test Statistic p.value
## NHST 25.5 0.069
## TOST Lower 34.0 0.894
## TOST Upper 20.0 0.013
##
## Effect Sizes
## Estimate C.I. Conf. Level
## Median of Differences -1.346 [-3.4, -0.1] 0.9
## Rank-Biserial Correlation -0.490 [-0.7493, -0.1005] 0.9
The standardized effect size reported for the
wilcox_TOST
procedure is the rank-biserial correlation.
This is a fairly intuitive measure of effect size which has the same
interpretation of the common language effect size (Kerby 2014). However, instead of assuming
normality and equal variances, the rank-biserial correlation calculates
the number of favorable (positive) and unfavorable (negative) pairs
based on their respective ranks.
For the two sample case, the correlation is calculated as the proportion of favorable pairs minus the unfavorable pairs.
\[ r_{biserial} = f_{pairs} - u_{pairs} \]
For the one sample or paired samples cases, the correlation is calculated with ties (values equal to zero) not being dropped. This provides a conservative estimate of the rank biserial correlation.
It is calculated in the following steps wherein \(z\) represents the values or difference between paired observations:
\[ r_j = -1 \cdot sign(z_j) \cdot rank(|z_j|) \] 2. Calculate the positive and negative sums: \[ R_{+} = \sum_{1\le i \le n, \space z_i > 0}r_j \]
\[ R_{-} = \sum_{1\le i \le n, \space z_i < 0}r_j \] 3. Determine the smaller of the two rank sums:
\[ T = min(R_{+}, \space R_{-}) \]
\[ S = \begin{cases} -4 & R_{+} \ge R_{-} \\ 4 & R_{+} < R_{-} \end{cases} \] 4. Calculate rank-biserial correlation:
\[ r_{biserial} = S \cdot | \frac{\frac{T - \frac{(R_{+} + R_{-})}{2}}{n}}{n + 1} | \]
The Fisher approximation is used to calculate the confidence intervals.
For paired samples, or one sample, the standard error is calculated as the following:
\[ SE_r = \sqrt{ \frac {(2 \cdot nd^3 + 3 \cdot nd^2 + nd) / 6} {(nd^2 + nd) / 2} } \]
wherein, nd represents the total number of observations (or pairs).
For independent samples, the standard error is calculated as the following:
\[ SE_r = \sqrt{\frac {(n1 + n2 + 1)} { (3 \cdot n1 \cdot n2)}} \]
The confidence intervals can then be calculated by transforming the estimate.
\[ r_z = atanh(r_{biserial}) \]
Then the confidence interval can be calculated and back transformed.
\[ r_{CI} = tanh(r_z \pm Z_{(1 - \alpha / 2)} \cdot SE_r) \]
Two other effect sizes can be calculated for non-parametric tests. First, there is the concordance probability, which is also known at the c-statistic, c-index, or probability of superiority1. The c-statistic is converted from the correlation using the following formula:
\[ c = \frac{(r_{biserial} + 1)}{2} \]
The Wilcoxon-Mann-Whitney odds (O’Brien and Castelloe 2006), also known as the “Generalized Odds Ratio” (Agresti 1980), is calculated by converting the c-statistic using the following formula:
\[ WMW_{odds} = e^{logit(c)} \]
Either effect size is available by simply modifying the
ses
argument for the wilcox_TOST
function.
# Rank biserial
wilcox_TOST(formula = extra ~ group,
data = sleep,
paired = FALSE,
ses = "r",
eqb = .5)
##
## Wilcoxon rank sum test with continuity correction
##
## The equivalence test was non-significant W = 20.000, p = 8.94e-01
## The null hypothesis test was non-significant W = 25.500, p = 6.93e-02
## NHST: don't reject null significance hypothesis that the effect is equal to zero
## TOST: don't reject null equivalence hypothesis
##
## TOST Results
## Test Statistic p.value
## NHST 25.5 0.069
## TOST Lower 34.0 0.894
## TOST Upper 20.0 0.013
##
## Effect Sizes
## Estimate C.I. Conf. Level
## Median of Differences -1.346 [-3.4, -0.1] 0.9
## Rank-Biserial Correlation -0.490 [-0.7493, -0.1005] 0.9
# Odds
wilcox_TOST(formula = extra ~ group,
data = sleep,
paired = FALSE,
ses = "o",
eqb = .5)
##
## Wilcoxon rank sum test with continuity correction
##
## The equivalence test was non-significant W = 20.000, p = 8.94e-01
## The null hypothesis test was non-significant W = 25.500, p = 6.93e-02
## NHST: don't reject null significance hypothesis that the effect is equal to zero
## TOST: don't reject null equivalence hypothesis
##
## TOST Results
## Test Statistic p.value
## NHST 25.5 0.069
## TOST Lower 34.0 0.894
## TOST Upper 20.0 0.013
##
## Effect Sizes
## Estimate C.I. Conf. Level
## Median of Differences -1.3464 [-3.4, -0.1] 0.9
## WMW Odds 0.3423 [0.1433, 0.8173] 0.9
# Concordance
wilcox_TOST(formula = extra ~ group,
data = sleep,
paired = FALSE,
ses = "c",
eqb = .5)
##
## Wilcoxon rank sum test with continuity correction
##
## The equivalence test was non-significant W = 20.000, p = 8.94e-01
## The null hypothesis test was non-significant W = 25.500, p = 6.93e-02
## NHST: don't reject null significance hypothesis that the effect is equal to zero
## TOST: don't reject null equivalence hypothesis
##
## TOST Results
## Test Statistic p.value
## NHST 25.5 0.069
## TOST Lower 34.0 0.894
## TOST Upper 20.0 0.013
##
## Effect Sizes
## Estimate C.I. Conf. Level
## Median of Differences -1.346 [-3.4, -0.1] 0.9
## Concordance 0.255 [0.1254, 0.4497] 0.9
The bootstrap is a simulation based technique, derived from
re-sampling with replacement, designed for statistical estimation and
inference. Bootstrapping techniques are very useful because they are
considered somewhat robust to the violations of assumptions for a simple
t-test. Therefore we added a bootstrap option, boot_t_TOST
to the package to provide another robust alternative to the
t_TOST
function.
In this function we provide a percentile bootstrap solution outlined
by Efron and Tibshirani (1993) (see
chapter 16, page 220). The bootstrapped p-values are derived from the
“studentized” version of a test of mean differences outlined by Efron and Tibshirani (1993). Overall, the
results should be similar to the results of t_TOST
.
However, for paired samples, the Cohen’s d(rm) effect
size cannot be calculated at this time.
Form B bootstrap data sets from x* and y* wherein x* is sampled with replacement from \(\tilde x_1,\tilde x_2, ... \tilde x_n\) and y* is sampled with replacement from \(\tilde y_1,\tilde y_2, ... \tilde y_n\)
t is then evaluated on each sample, but the mean of each sample (y or x) and the overall average (z) are subtracted from each
\[ t(z^{*b}) = \frac {(\bar x^*-\bar x - \bar z) - (\bar y^*-\bar y - \bar z)}{\sqrt {sd_y^*/n_y + sd_x^*/n_x}} \]
\[ p_{boot} = \frac {\#t(z^{*b}) \ge t_{sample}}{B} \]
The same process is completed for the one sample case but with the one sample solution for the equation outlined by \(t(z^{*b})\). The paired sample case in this bootstrap procedure is equivalent to the one sample solution because the test is based on the difference scores.
Again, we can use the sleep data to see the bootstrapped results. Notice that the plots show how the re-sampling via bootstrapping indicates the instability of Hedges’ d(z).
data('sleep')
= boot_t_TOST(formula = extra ~ group,
test1 data = sleep,
paired = TRUE,
eqb = .5,
R = 999)
print(test1)
##
## Bootstrapped Paired t-test
##
## The equivalence test was non-significant, t(9) = -2.777, p = 1e+00
## The null hypothesis test was significant, t(9) = -4.062, p = 0e+00
## NHST: reject null significance hypothesis that the effect is equal to zero
## TOST: don't reject null equivalence hypothesis
##
## TOST Results
## t df p.value
## t-test -4.062 9 < 0.001
## TOST Lower -2.777 9 1
## TOST Upper -5.348 9 < 0.001
##
## Effect Sizes
## Estimate SE C.I. Conf. Level
## Raw -1.580 0.3885 [-2.321, -1.059] 0.9
## Hedges's g(z) -1.174 0.6698 [-2.7323, -0.9414] 0.9
## Note: percentile bootstrap method utilized.
plot(test1)
In many bioequivalence studies, the differences between drugs are compared on the log scale (He et al. 2022). The log scale allows researchers to compare the ratio of two means.
\[ log ( \frac{y}{x} ) = log(y) - log(x) \] The United States Food and Drug Administration (FDA)2 hs stated a rationale for using the log transformed values:
Using logarithmic transformation, the general linear statistical model employed in the analysis of BE data allows inferences about the difference between the two means on the log scale, which can then be retransformed into inferences about the ratio of the two averages (means or medians) on the original scale. Logarithmic transformation thus achieves a general comparison based on the ratio rather than the differences.
In addition, the FDA, considers two drugs as bioequivalent when the ratio between x and y is less than 1.25 and greater than 0.8 (1/1.25), which is the default equivalence bound for the log functions.
For example, we could compare whether the cars of different
transmissions are “equivalent” with regards to gas mileage. We can use
the default equivalence bounds (eqb = 1.25
).
log_TOST(
~ am,
mpg data = mtcars
)
##
## Log-transformed Welch Two Sample t-test
##
## The equivalence test was non-significant, t(23.96) = -1.363, p = 9.07e-01
## The null hypothesis test was significant, t(23.96) = -3.826, p = 8.19e-04
## NHST: reject null significance hypothesis that the effect is equal to one
## TOST: don't reject null equivalence hypothesis
##
## TOST Results
## t df p.value
## t-test -3.826 23.96 < 0.001
## TOST Lower -1.363 23.96 0.907
## TOST Upper -6.288 23.96 < 0.001
##
## Effect Sizes
## Estimate SE C.I. Conf. Level
## log(Means Ratio) -0.3466 0.09061 [-0.5017, -0.1916] 0.9
## Means Ratio 0.7071 NA [0.6055, 0.8256] 0.9
Note, that the function produces t-tests similar to the
t_TOST
function, but provides two effect sizes. The means
ratio on the log scale (the scale of the test statistics), and the means
ratio. The means ratio is missing standard error because the confidence
intervals and estimate are simply the log scale results
exponentiated.
However, it has been noted in the statistics literature that t-tests
on the logarithmic scale can be biased, and it is recommended that
bootstrapped tests be utilized instead. Therefore, the
boot_log_TOST
function can be utilized to perform a more
precise test.
boot_log_TOST(
~ am,
mpg data = mtcars,
R = 499
)
##
## Bootstrapped Log Welch Two Sample t-test
##
## The equivalence test was non-significant, t(23.96) = -1.363, p = 9.58e-01
## The null hypothesis test was significant, t(23.96) = -3.826, p = 0e+00
## NHST: reject null significance hypothesis that the effect is equal to 1
## TOST: don't reject null equivalence hypothesis
##
## TOST Results
## t df p.value
## t-test -3.826 23.96 < 0.001
## TOST Lower -1.363 23.96 0.958
## TOST Upper -6.288 23.96 < 0.001
##
## Effect Sizes
## Estimate SE C.I. Conf. Level
## log(Means Ratio) -0.3466 0.08823 [-0.4877, -0.1958] 0.9
## Means Ratio 0.7071 0.06298 [0.614, 0.8222] 0.9
## Note: percentile bootstrap method utilized.
It was requested that a function be provided that only calculates and
effect size. Therefore, I created the ses_calc
function as
robust effect size calculation. The interface is almost the same as
wilcox_TOST
but you don’t set an equivalence bound.
ses_calc(formula = extra ~ group,
data = sleep,
paired = TRUE,
ses = "r")
## estimate lower.ci upper.ci conf.level
## Rank-Biserial Correlation 0.9818182 0.928369 0.9954785 0.95
Directly inspired by this blog post from Professor Frank Harrell https://hbiostat.org/blog/post/wpo/↩︎
Food and Drug Administration (2014). Bioavailability and Bioequivalence Studies Submitted in NDAs or INDs — General Considerations.Center for Drug Evaluation and Research. Docket: FDA-2014-D-0204↩︎