Random variable distributions of ALE statistics for generating p-values

ALE statistics are accompanied with two indicators of the confidence of their values. First, bootstrapping creates confidence intervals for ALE effects and ALE statistics to give a range of the possible ALE values. Second, we calculate p-values, an indicator of the probability that a given ALE statistic is random. An ALEpDist S7 object contains the necessary distribution data for generating such p-values.

Usage

ALEpDist(
  model,
  data = NULL,
  ...,
  y_col = NULL,
  rand_it = NULL,
  surrogate = FALSE,
  parallel = "all",
  model_packages = NULL,
  random_model_call_string = NULL,
  random_model_call_string_vars = character(),
  positive = TRUE,
  pred_fun = function(object, newdata, type = pred_type) {
     stats::predict(object =
    object, newdata = newdata, type = type)
 },
  pred_type = "response",
  output_residuals = FALSE,
  seed = 0,
  silent = FALSE,
  .skip_validation = FALSE
)

Arguments

model: See documentation for ALE()
data: See documentation for ALE()
...: not used. Inserted to require explicit naming of subsequent arguments.
y_col: See documentation for ALE()
rand_it: non-negative integer(1). Number of times that the model should be retrained with a new random variable. The default of NULL will generate 1000 iterations, which should give reasonably stable p-values; these are considered "exact" p-values. It can be reduced for approximate ("approx") p-values as low as 100 for faster test runs but then the p-values are not as stable. rand_it below 100 is not allowed as such p-values are inaccurate.
surrogate: logical(1). Create p-value distributions based on a surrogate linear model (TRUE) instead of on the original model (default FALSE). Note that while faster surrogate p-values are convenient for interactive analysis, they are not acceptable for definitive conclusions or publication. See details.
parallel: See documentation for ALE(). Note that for exact p-values, by default 1000 random variables are trained. So, even with parallel processing, the procedure is very slow.
model_packages: See documentation for ALE()
random_model_call_string: character(1). If NULL, the ALEpDist() constructor tries to automatically detect and construct the call for p-values. If it cannot, the constructor will fail. In that case, a character string of the full call for the model must be provided that includes the random variable. See details.
random_model_call_string_vars: See documentation for model_call_string_vars in ModelBoot(); their operation is very similar.
positive: See documentation for ModelBoot()
pred_fun, pred_type: See documentation for ALE()
output_residuals: logical(1). If TRUE, returns the residuals in addition to the raw data of the generated random statistics (which are always returned). The default FALSE does not return the residuals.
seed: See documentation for ALE()
silent: See documentation for ALE()
.skip_validation: Internal use only. logical(1). Skip non-mutating data validation checks. Changing the default FALSE risks crashing with incomprehensible error messages.

Value

An object of class ALEpDist with properties rand_stats, residual_distribution, residuals, and params.

Properties

rand_stats

A named list of tibbles. There is normally one element whose name is the same as y_col except if y_col is a categorical variable; in that case, the elements are named for each category of y_col. Each element is a tibble whose rows are each of the rand_it_ok iterations of the random variable analysis and whose columns are the ALE statistics obtained for each random variable.

residual_distribution

A univariateML object with the closest estimated distribution for the residuals as determined by univariateML::model_select(). This is the distribution used to generate all the random variables.

residuals

If output_residuals == TRUE, returns a matrix of the actual y_col values from data minus the predicted values from the model (without random variables) on the data. The rows correspond to each row of data. The columns correspond to the named elements (y_col or categories) described above for rand_stats. NULL if output_residuals == FALSE (default).

params

Parameters used to generate p-value distributions. Most of these repeat selected arguments passed to ALEpDist(). These are either values provided by the user or used by default if the user did not change them but the following additional or modified objects are notable:

* `model`: selected elements that describe the `model` used to generate the random distributions.
* `rand_it`: the number of random iterations requested by the user either explicitly (by specifying a whole number) or implicitly with the default `NULL`: exact p distributions imply 1000 iterations and surrogate distributions imply 100 unless an explicit number of iterations is requested.
* `rand_it_ok`: A whole number with the number of `rand_it` iterations that successfully generated a random variable, that is, those that did not fail for whatever reason. The `rand_it` - `rand_it_ok` failed attempts are discarded.
* `exactness`: A string. For regular p-values generated from the original model, `'exact'` if `rand_it_ok >= 1000` and `'approx'` otherwise. `'surrogate'` for p-values generated from a surrogate model. `'invalid'` if `rand_it_ok < 100`.

* `probs_inverted`: `TRUE` if the original probability values of the `ALEpDist` object have been inverted. This is accomplished using [invert_probs()] on an `ALE` object. `FALSE`, `NULL`, or absent otherwise.

Exact p-values for ALE statistics

Because ALE is non-parametric (that is, it does not assume any particular distribution of data), the {ale} package takes a literal frequentist approach to the calculation of empirical (Monte Carlo) p-values. That is, it literally retrains the model 1000 times, each time modifying it by adding a distinct random variable to the model. (The number of iterations is customizable with the rand_it argument.) The ALEs and ALE statistics are calculated for each random variable. The percentiles of the distribution of these random-variable ALEs are then used to determine p-values for non-random variables. Thus, p-values are interpreted as the frequency of random variable ALE statistics that exceed the value of ALE statistic of the actual variable in question. The specific steps are as follows:

The residuals of the original model trained on the training data are calculated (residuals are the actual y target value minus the predicted values).
The closest distribution of the residuals is detected with univariateML::model_select().
1000 new models are trained by generating a random variable each time with univariateML::rml() and then training a new model with that random variable added.
The ALEs and ALE statistics are calculated for each random variable.
For each ALE statistic, the empirical cumulative distribution function (stats::ecdf()) is used to create a function to determine p-values according to the distribution of the random variables' ALE statistics.

Because the ale package is model-agnostic (that is, it works with any kind of R model), the ALEpDist() constructor cannot always automatically manipulate the model object to create the p-values. It can only do so for models that follow the standard R statistical modelling conventions, which includes almost all base R algorithms (like stats::lm() and stats::glm()) and many widely used statistics packages (like mgcv and survival), but which excludes most machine learning algorithms (like tidymodels and caret). For non-standard algorithms, the user needs to do a little work to help the ALEpDist() constructor correctly manipulate its model object:

The full model call must be passed as a character string in the argument random_model_call_string, with two slight modifications as follows.
In the formula that specifies the model, you must add a variable named 'random_variable'. This corresponds to the random variables that the constructor will use to estimate p-values.
The dataset on which the model is trained must be named 'rand_data'. This corresponds to the modified datasets that will be used to train the random variables.

See the example below for how this is implemented.

If the model generation is unstable (because of a small dataset size or a finicky model algorithm), then one or more iterations might fail, possibly dropping the number of successful iterations to below 1000. Then the p-values are only considered approximate; they are no longer exact. If that is the case, then request rand_it at a sufficiently high number such that even if some iterations fail, at least 1000 will succeed. For example, for an ALEpDist object named p_dist, if p_dist@params$rand_it_ok is 950, you could rerun ALEpDist() with rand_it = 1100 or higher to allow for up to 100 possible failures.

Faster approximate and surrogate p-values

The procedure we have just described requires at least 1000 random iterations for p-values to be considered "exact". Unfortunately, this procedure is rather slow–it takes at least 1000 times as long as the time it takes to train the model once.

With fewer iterations (at least 100), p-values can only be considered approximate ("approx"). Fewer than 100 such p-values are invalid. There might be fewer iterations either because the user requests them with the rand_it argument or because some iterations fail for whatever reason. As long as at least 1000 iterations succeed, p-values will be considered exact.

Because the procedure can be very slow, a faster version of the algorithm generates "surrogate" p-values by substituting the original model with a linear model that predicts the same y_col outcome from all the other columns in data. By default, these surrogate p-values use only 100 iterations and if the dataset is large, the surrogate model samples 1000 rows. Thus, the surrogate p-values can be generated much faster than for slower model algorithms on larger datasets. Although they are suitable for model development and analysis because they are faster to generate, they are less reliable than approximate p-values based on the original model. In any case, definitive conclusions (e.g., for publication) always require exact p-values with at least 1000 iterations on the original model. Note that surrogate p-values are always marked as "surrogate"; even if they are generated based on over 1000 iterations, they can never be considered exact because they are not based on the original model.

References

Okoli, Chitu. 2023. "Statistical Inference Using Machine Learning and Classical Techniques Based on Accumulated Local Effects (ALE)." arXiv. doi:10.48550/arXiv.2310.09877.

Examples

# \donttest{
# Sample 1000 rows from the ggplot2::diamonds dataset (for a simple example)
set.seed(0)
diamonds_sample <- ggplot2::diamonds[sample(nrow(ggplot2::diamonds), 1000), ]

# Create a GAM with flexible curves to predict diamond price
# Smooth all numeric variables and include all other variables
gam_diamonds <- mgcv::gam(
  price ~ s(carat) + s(depth) + s(table) + s(x) + s(y) + s(z) +
    cut + color + clarity +
    ti(carat, by = clarity),  # a 2D interaction
  data = diamonds_sample
)
summary(gam_diamonds)
#> 
#> Family: gaussian 
#> Link function: identity 
#> 
#> Formula:
#> price ~ s(carat) + s(depth) + s(table) + s(x) + s(y) + s(z) + 
#>     cut + color + clarity + ti(carat, by = clarity)
#> 
#> Parametric coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  3001.89     260.67  11.516  < 2e-16 ***
#> cut.L         127.31     153.83   0.828 0.408134    
#> cut.Q          72.52     111.14   0.653 0.514238    
#> cut.C         -55.92      83.03  -0.673 0.500810    
#> cut^4          51.68      64.61   0.800 0.423979    
#> color.L     -1632.19      91.78 -17.785  < 2e-16 ***
#> color.Q      -472.22      83.82  -5.634 2.34e-08 ***
#> color.C       -35.91      76.97  -0.467 0.640944    
#> color^4        27.23      70.13   0.388 0.697875    
#> color^5      -142.75      65.77  -2.170 0.030237 *  
#> color^6         1.52      59.47   0.026 0.979618    
#> clarity.L   -1523.84    1015.27  -1.501 0.133718    
#> clarity.Q   -3469.55     893.39  -3.884 0.000110 ***
#> clarity.C   -1380.98     790.93  -1.746 0.081137 .  
#> clarity^4     678.13     761.51   0.890 0.373429    
#> clarity^5    1437.67     622.47   2.310 0.021127 *  
#> clarity^6    1305.16     379.50   3.439 0.000609 ***
#> clarity^7     653.21     159.89   4.085 4.78e-05 ***
#> ---
#> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#> 
#> Approximate significance of smooth terms:
#>                         edf Ref.df      F p-value    
#> s(carat)              8.594  8.888  6.954 < 2e-16 ***
#> s(depth)              2.177  2.874  0.778 0.42173    
#> s(table)              1.000  1.000  0.107 0.74371    
#> s(x)                  7.954  8.513  2.788 0.00401 ** 
#> s(y)                  5.234  6.467  8.077 < 2e-16 ***
#> s(z)                  3.404  4.479  1.623 0.15244    
#> ti(carat):claritySI2  3.737  3.948 14.107 < 2e-16 ***
#> ti(carat):claritySI1  1.000  1.000 52.139 < 2e-16 ***
#> ti(carat):clarityVS2  2.884  3.390 22.095 < 2e-16 ***
#> ti(carat):clarityVS1  3.948  3.996 33.200 < 2e-16 ***
#> ti(carat):clarityVVS2 2.279  2.660 55.728 < 2e-16 ***
#> ti(carat):clarityVVS1 3.924  3.994 36.573 < 2e-16 ***
#> ti(carat):clarityIF   3.911  3.992 38.323 < 2e-16 ***
#> ---
#> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#> 
#> R-sq.(adj) =  0.962   Deviance explained = 96.5%
#> GCV = 6.2787e+05  Scale est. = 5.8514e+05  n = 1000

# Create p_value distribution
pd_diamonds <- ALEpDist(
  gam_diamonds,
  diamonds_sample,
  # only 100 iterations for a quick demo; but usually should remain at 1000
  rand_it = 100
)

# Examine the structure of the returned object
print(pd_diamonds)
#> <ale::ALEpDist>
#>  @ rand_stats           :List of 1
#>  .. $ price: tibble [100 × 6] (S3: tbl_df/tbl/data.frame)
#>  ..  ..$ aled     : num [1:100] 22.5 55.5 21.9 14.9 23.8 ...
#>  ..  ..$ aler_min : num [1:100] -167 -446 -209 -143 -194 ...
#>  ..  ..$ aler_max : num [1:100] 210 367 230 112 273 ...
#>  ..  ..$ naled    : num [1:100] 0.33 0.655 0.323 0.246 0.351 ...
#>  ..  ..$ naler_min: num [1:100] -1.6 -6 -2.4 -1.4 -2 -5.1 -0.2 -1.2 -0.9 -0.5 ...
#>  ..  ..$ naler_max: num [1:100] 2 3.3 2.3 1.5 2.9 3 0.6 1.1 1.7 1 ...
#>  @ residual_distribution: 'univariateML' Named num [1:4] 7.19e-04 7.39e+02 7.16e-01 1.14
#>  .. - attr(*, "logLik")= num -7814
#>  .. - attr(*, "call")= language f(x = x, na.rm = na.rm)
#>  .. - attr(*, "n")= int 1000
#>  .. - attr(*, "model")= chr "Skew Generalized Error"
#>  .. - attr(*, "density")= chr "fGarch::dsged"
#>  .. - attr(*, "support")= num [1:2] -Inf Inf
#>  .. - attr(*, "names")= chr [1:4] "mean" "sd" "nu" "xi"
#>  .. - attr(*, "default")= num [1:4] 0 1 3 3
#>  .. - attr(*, "continuous")= logi TRUE
#>  @ residuals            : NULL
#>  @ params               :List of 11
#>  .. $ model                        :List of 4
#>  ..  ..$ class  : chr [1:3] "gam" "glm" "lm"
#>  ..  ..$ call   : chr "mgcv::gam(formula = price ~ s(carat) + s(depth) + s(table) + \n    s(x) + s(y) + s(z) + cut + color + clarity +"| __truncated__
#>  ..  ..$ print  : chr "\nFamily: gaussian \nLink function: identity \n\nFormula:\nprice ~ s(carat) + s(depth) + s(table) + s(x) + s(y)"| __truncated__
#>  ..  ..$ summary: chr "\nFamily: gaussian \nLink function: identity \n\nFormula:\nprice ~ s(carat) + s(depth) + s(table) + s(x) + s(y)"| __truncated__
#>  .. $ y_col                        : chr "price"
#>  .. $ rand_it                      : num 100
#>  .. $ parallel                     : Named int 4
#>  ..  ..- attr(*, "names")= chr "system"
#>  .. $ model_packages               : chr "mgcv"
#>  .. $ random_model_call_string     : NULL
#>  .. $ random_model_call_string_vars: chr(0) 
#>  .. $ positive                     : logi TRUE
#>  .. $ seed                         : num 0
#>  .. $ rand_it_ok                   : int 100
#>  .. $ exactness                    : chr "approx"
# In RStudio: View(pd_diamonds)

# Calculate ALEs with p-values
ale_gam_diamonds <- ALE(
  gam_diamonds,
  p_values = pd_diamonds
)
#> Warning: Could not recover model data from environment. Please make sure your
#>   data is available in your workspace.
#>   Trying to retrieve data from the model frame now.
#> Error in (function (.x, .f, ..., .progress = FALSE) {    map_("list", .x, .f, ..., .progress = .progress)})(.x = c("cut", "color"), .f = function (...) {    {        ...furrr_chunk_seeds_i <- ...furrr_chunk_seeds_env[["i"]]        ...furrr_chunk_seeds_env[["i"]] <- ...furrr_chunk_seeds_i +             1L        assign(x = ".Random.seed", value = ...furrr_chunk_seeds[[...furrr_chunk_seeds_i]],             envir = globalenv(), inherits = FALSE)    }    NULL    ...furrr_out <- ...furrr_fn(...)    ...furrr_out}): ℹ In index: 1.
#> Caused by error in `map()`:
#> ℹ In index: 1.
#> Caused by error in `predict.gam()`:
#> ! newdata is a model.frame: it should contain all required variables

# Plot the ALE data. The horizontal bands in the plots use the p-values.
plot(ale_gam_diamonds)
#> Error: object 'ale_gam_diamonds' not found


# For non-standard models that give errors with the default settings,
# you can use 'random_model_call_string' to specify a model for the estimation
# of p-values from random variables as in this example.
# See details above for an explanation.
pd_diamonds <- ALEpDist(
  gam_diamonds,
  diamonds_sample,
  random_model_call_string = 'mgcv::gam(
    price ~ s(carat) + s(depth) + s(table) + s(x) + s(y) + s(z) +
        cut + color + clarity + random_variable,
    data = rand_data
  )',
  # only 100 iterations for a quick demo; but usually should remain at 1000
  rand_it = 100
)

# Examine the structure of the returned object
print(pd_diamonds)
#> <ale::ALEpDist>
#>  @ rand_stats           :List of 1
#>  .. $ price: tibble [100 × 6] (S3: tbl_df/tbl/data.frame)
#>  ..  ..$ aled     : num [1:100] 22.43 69.28 3 24.61 3.15 ...
#>  ..  ..$ aler_min : num [1:100] -165.8 -556.8 -28.7 -236.9 -25.7 ...
#>  ..  ..$ aler_max : num [1:100] 208.8 458 31.5 186.1 36.1 ...
#>  ..  ..$ naled    : num [1:100] 0.3287 0.7999 0.0677 0.3541 0.071 ...
#>  ..  ..$ naler_min: num [1:100] -1.6 -7.5 -0.2 -2.9 -0.2 -6.2 -1.2 -1.3 -1 -0.2 ...
#>  ..  ..$ naler_max: num [1:100] 2 4.2 0.6 2 0.6 3.1 1.6 1.1 1.9 0.5 ...
#>  @ residual_distribution: 'univariateML' Named num [1:4] 7.19e-04 7.39e+02 7.16e-01 1.14
#>  .. - attr(*, "logLik")= num -7814
#>  .. - attr(*, "call")= language f(x = x, na.rm = na.rm)
#>  .. - attr(*, "n")= int 1000
#>  .. - attr(*, "model")= chr "Skew Generalized Error"
#>  .. - attr(*, "density")= chr "fGarch::dsged"
#>  .. - attr(*, "support")= num [1:2] -Inf Inf
#>  .. - attr(*, "names")= chr [1:4] "mean" "sd" "nu" "xi"
#>  .. - attr(*, "default")= num [1:4] 0 1 3 3
#>  .. - attr(*, "continuous")= logi TRUE
#>  @ residuals            : NULL
#>  @ params               :List of 11
#>  .. $ model                        :List of 4
#>  ..  ..$ class  : chr [1:3] "gam" "glm" "lm"
#>  ..  ..$ call   : chr "mgcv::gam(formula = price ~ s(carat) + s(depth) + s(table) + \n    s(x) + s(y) + s(z) + cut + color + clarity +"| __truncated__
#>  ..  ..$ print  : chr "\nFamily: gaussian \nLink function: identity \n\nFormula:\nprice ~ s(carat) + s(depth) + s(table) + s(x) + s(y)"| __truncated__
#>  ..  ..$ summary: chr "\nFamily: gaussian \nLink function: identity \n\nFormula:\nprice ~ s(carat) + s(depth) + s(table) + s(x) + s(y)"| __truncated__
#>  .. $ y_col                        : chr "price"
#>  .. $ rand_it                      : num 100
#>  .. $ parallel                     : Named int 4
#>  ..  ..- attr(*, "names")= chr "system"
#>  .. $ model_packages               : chr "mgcv"
#>  .. $ random_model_call_string     : chr "mgcv::gam(\n    price ~ s(carat) + s(depth) + s(table) + s(x) + s(y) + s(z) +\n        cut + color + clarity + "| __truncated__
#>  .. $ random_model_call_string_vars: chr(0) 
#>  .. $ positive                     : logi TRUE
#>  .. $ seed                         : num 0
#>  .. $ rand_it_ok                   : int 100
#>  .. $ exactness                    : chr "approx"
# In RStudio: View(pd_diamonds)

# }