Area under the ROC curve — aucroc • staccuracy

Returns the area under the ROC curve based on comparing the predicted scores to the actual binary values. Tied predictions are handled by calculating the optimistic AUC (positive cases sorted first, resulting in higher AUC) and the pessimistic AUC (positive cases sorted last, resulting in lower AUC) and then returning the average of the two. For the ROC, a "tie" means at least one pair of pred predictions whose value is identical yet their corresponding values of actual are different. (If the value of actual are the same for identical predictions, then these are unproblematic and are not considered "ties".)

Usage

aucroc(
  actual,
  pred,
  na.rm = FALSE,
  binary_true_value = NULL,
  sample_size = 10000,
  seed = 0
)

Arguments

actual: any atomic vector. Actual label values from a dataset. They must be binary; that is, there must be exactly two distinct values (other than missing values, which are allowed). The "true" or "positive" class is determined by coercing actual to logical TRUE and FALSE following the rules of as.logical(). If this is not the intended meaning of "positive", then specify which of the two values should be considered TRUE with the argument binary_true_value.
pred: numeric vector. Predictions corresponding to each respective element in actual. Any numeric value (not only probabilities) are permissible.
na.rm: logical(1). TRUE if missing values should be removed; FALSE if they should be retained. If TRUE, then if any element of either actual or pred is missing, its paired element will be also removed.
binary_true_value: any single atomic value. The value of actual that is considered TRUE; any other value of actual is considered FALSE. For example, if 2 means TRUE and 1 means FALSE, then set binary_true_value = 2.
sample_size: single positive integer. To keep the computation relatively rapid, when actual and pred are longer than sample_size elements, then a random sample of sample_size of actual and pred will be selected and the ROC and AUC will be calculated on this sample. To disable random sampling for long inputs, set sample_size = NA.
seed: numeric(1). Random seed used only if length(actual) > sample_size.

Value

List with the following elements:

roc_opt: tibble with optimistic ROC data. "Optimistic" means that when predictions are tied, the TRUE/positive actual values are ordered before the FALSE/negative ones.
roc_pess: tibble with pessimistic ROC data. "Pessimistic" means that when predictions are tied, the FALSE/negative actual values are ordered before the TRUE/positive ones. Note that this difference is not merely in the sort order: when there are ties, the way that true positives, true negatives, etc. are counted is different for optimistic and pessimistic approaches. If there are no tied predictions, then roc_opt and roc_pess are identical.
auc_opt: area under the ROC curve for optimistic ROC.
auc_pess: area under the ROC curve for pessimistic ROC.
auc: mean of auc_opt and auc_pess. If there are no tied predictions, then auc_opt, auc_pess, and auc are identical.
ties: TRUE if there are two or more tied predictions; FALSE if there are no ties.

Examples

set.seed(0)
# Generate some simulated "actual" data
a <- sample(c(TRUE, FALSE), 50, replace = TRUE)

# Generate some simulated predictions
p <- runif(50) |> round(2)
p[c(7, 8, 22, 35, 40, 41)] <- 0.5

# Calculate AUCROC with its components
ar <- aucroc(a, p)
ar$auc
#> [1] 0.46875