Area under the ROC curve for regression target outcomes

Area under the ROC curve (AUCROC) is a classification measure. By dichotomizing the range of actual values, reg_aucroc() turns regression evaluation into classification evaluation for any regression model. Note that the model that generates the predictions is assumed to be a regression model; however, any numeric inputs are allowed for the pred argument, so there is no check for the nature of the source model.

Usage

reg_aucroc(
  actual,
  pred,
  num_quants = 100,
  ...,
  cuts = NULL,
  imbalance = 0.05,
  na.rm = FALSE,
  sample_size = 10000,
  seed = 0
)

Arguments

actual: numeric vector. Actual label values from a dataset. They must be numeric.
pred: numeric vector. Predictions corresponding to each respective element in actual.
num_quants: scalar positive integer. If cuts is NULL (default), actual will be dichotomized into quants quantiles and that many ROCs will be returned in the rocs element. However, if cuts is specified, then quants is ignored.
...: Not used. Forces explicit naming of the arguments that follow.
cuts: numeric vector. If cuts is provided, it overrides quants to specify the cut points for dichotomization of actual for the creation of cuts + 1 ROCs.
imbalance: numeric(1) in (0, 0.5]. The result element mean_auc averages the AUCs over three regions (see details of the return value). imbalance is the supposed percentage of the less frequent class in the data. If not provided, defaults to 0.05 (5%).
na.rm: See documentation for aucroc()
sample_size: See documentation for aucroc(). In addition to those notes, for reg_aucroc(), any sampling is conducted before the dichotomization of actual so that all classification ROCs are based on identical data.
seed: See documentation for aucroc()

Value

List with the following elements:

rocs: List of results for aucroc() for each dichotomized segment of actual.
auc: named numeric vector of AUC extracted from each element of rocs. Named by the percentile that the AUC represents.
mean_auc: named numeric(3). The average AUC over the low, middle, and high quantiles of dichotomization:
lo: average AUC with imbalance% (e.g., 5%) or less of the actual target values;
mid: average AUC in between lo and hi;
hi: average AUC with (1 - imbalance)% (e.g., 95%) or more of the actual target values;

Details

The ROC data and AUCROC values are calculated with aucroc().

Examples

# Remove rows with missing values from airquality dataset
airq <- airquality |>
  na.omit()

# Create binary version where the target variable 'Ozone' is dichotomized based on its median
airq_bin <- airq
airq_bin$Ozone <- airq_bin$Ozone >= median(airq_bin$Ozone)

# Create a generic regression model; use autogam
req_aq   <- autogam::autogam(airq, 'Ozone', family = gaussian())
#> Warning: basis dimension, k, increased to minimum possible
req_aq$perf$sa_wmae_mad  # Standardized accuracy for regression
#> NULL

# Create a generic classification model; use autogam
class_aq <- autogam::autogam(airq_bin, 'Ozone', family = binomial())
#> Warning: basis dimension, k, increased to minimum possible
class_aq$perf$auc  # AUC (standardized accuracy for classification)
#> NULL

# Compute AUC for regression predictions
reg_auc_aq <- reg_aucroc(
  airq$Ozone,
  predict(req_aq)
)

# Average AUC over the lo, mid, and hi quantiles of dichotomization:
reg_auc_aq$mean_auc
#>        lo       mid        hi 
#> 0.8541380 0.9398248 0.9876410