Statistical tests for the differences between standardized accuracies (staccuracies)

Because the distribution of staccuracies is uncertain (and indeed, different staccuracies likely have different distributions), bootstrapping is used to empirically estimate the distributions and calculate the p-values. See the return value description for details on what the function provides.

Usage

sa_diff(
  actual,
  preds,
  ...,
  na.rm = FALSE,
  sa = NULL,
  pct = c(0.01, 0.02, 0.03, 0.04, 0.05),
  boot_alpha = 0.05,
  boot_it = 1000,
  seed = 0
)

Arguments

actual: numeric vector. The actual (true) labels.
preds: named list of at least two numeric vectors. Each element is a vector of the same length as actual with predictions for each row corresponding to each element of actual. The names of the list elements should be the names of the models that produced each respective prediction; these names will be used to distinguish the results.
...: not used. Forces explicit naming of subsequent arguments.
na.rm: See documentation for staccuracy()
sa: list of functions. Each element is the unquoted name of a valid staccuracy function (see staccuracy() for the required function signature.) If an element is named, the name will be displayed as the value of the sa column of the result. Otherwise, the function name will be displayed. If NULL (default), staccuracy functions will be automatically selected based on the datatypes of actual and preds.
pct: numeric with values from (0, 1). The percentage values on which the difference in staccuracies will be tested.
boot_alpha: numeric(1) from 0 to 1. Alpha for percentile-based confidence interval range for the bootstrapped means; the bootstrap confidence intervals will be the lowest and highest (1 - 0.05) / 2 percentiles. For example, if boot_alpha = 0.05 (default), the intervals will be at the 2.5 and 97.5 percentiles.
boot_it: positive integer(1). The number of bootstrap iterations.
seed: integer(1). Random seed for the bootstrap sampling. Supply this between runs to assure identical results.

Value

tibble with staccuracy difference results:

staccuracy: name of staccuracy measure
pred: Each named element (model name) in the input preds. The row values give the staccuracy for that prediction. When pred is NA, the row represents the difference between prediction staccuracies (diff) instead of staccuracies themselves.
diff: When diff takes the form 'model1-model2', then the row values give the difference in staccuracies between two named elements (model names) in the input preds. When diff is NA, the row instead represents the staccuracy of a specific model prediction (pred).
lo, mean, hi: The lower bound, mean, and upper bound of the bootstrapped staccuracy. The lower and upper bounds are confidence intervals specified by the input boot_alpha.
p__: p-values that the difference in staccuracies are at least the specified percentage amount or greater. E.g., for the default input pct = c(0.01, 0.02, 0.03, 0.04, 0.05), these columns would be p01, p02, p03, p04, and p05. As they apply only to differences between staccuracies, they are provided only for diff rows and are NA for pred rows. As an example of their meaning, if the mean difference for 'model1-model2' is 0.0832 with p01 of 0.012 and p02 of 0.035, then 1.2% of bootstrapped staccuracies had a model1 - model2 difference of less than 0.01 and 3.5% were less than 0.02. (That is, 98.8% of differences were greater than 0.01 and 96.5% were greater than 0.02.)

Examples

lm_attitude_all <- lm(rating ~ ., data = attitude)
lm_attitude__a <- lm(rating ~ . - advance, data = attitude)
lm_attitude__c <- lm(rating ~ . - complaints, data = attitude)

sdf <- sa_diff(
  attitude$rating,
  list(
    all = predict(lm_attitude_all),
    madv = predict(lm_attitude__a),
    mcmp = predict(lm_attitude__c)
  ),
  boot_it = 10
)
sdf
#> # A tibble: 12 × 11
#>    staccuracy    pred  diff            lo    mean     hi     p01     p02     p03
#>    <chr>         <chr> <chr>        <dbl>   <dbl>  <dbl>   <dbl>   <dbl>   <dbl>
#>  1 WinMAE on MAD all   NA         0.672   0.719   0.776  NA      NA      NA     
#>  2 WinMAE on MAD madv  NA         0.640   0.705   0.767  NA      NA      NA     
#>  3 WinMAE on MAD mcmp  NA         0.586   0.635   0.692  NA      NA      NA     
#>  4 WinMAE on MAD NA    all-madv  -0.00660 0.0139  0.0369  0.455   0.727   0.818 
#>  5 WinMAE on MAD NA    all-mcmp   0.0440  0.0840  0.133   0.0909  0.0909  0.0909
#>  6 WinMAE on MAD NA    madv-mcmp  0.0291  0.0702  0.122   0.0909  0.0909  0.182 
#>  7 WinRMSE on SD all   NA         0.684   0.737   0.781  NA      NA      NA     
#>  8 WinRMSE on SD madv  NA         0.670   0.732   0.782  NA      NA      NA     
#>  9 WinRMSE on SD mcmp  NA         0.616   0.670   0.723  NA      NA      NA     
#> 10 WinRMSE on SD NA    all-madv  -0.00781 0.00529 0.0272  0.636   0.909   0.909 
#> 11 WinRMSE on SD NA    all-mcmp   0.0335  0.0666  0.107   0.0909  0.0909  0.182 
#> 12 WinRMSE on SD NA    madv-mcmp  0.0273  0.0613  0.108   0.0909  0.0909  0.182 
#> # ℹ 2 more variables: p04 <dbl>, p05 <dbl>