Winsorize a numeric vector — winsorize • staccuracy

Winsorization means truncating the extremes of a numeric range by replacing extreme values with a predetermined minimum and maximum. winsorize() returns the input vector values with values less than or greater than the provided minimum or maximum replaced by the provided minimum or maximum, respectively.

win_mae() and win_rmse() return MAE and RMSE respectively with winsorized predictions. The fundamental idea underlying the winsorization of predictions is that if the actual data has well-defined bounds, then models should not be penalized for being overzealous in predicting beyond the extremes of the data. Models that are overzealous in the boundaries might sometimes be superior within normal ranges; the extremes can be easily corrected by winsorization.

Usage

winsorize(x, win_range)

win_mae(actual, pred, win_range = range(actual), na.rm = FALSE)

win_rmse(actual, pred, win_range = range(actual), na.rm = FALSE)

Arguments

x: numeric vector.
win_range: numeric(2). The minimum and maximum allowable values for the pred predictions or for x. For functions with pred, win_range defaults to the minimum and maximum values of the provided actual values. For functions with x, there is no default.
actual: numeric vector. Actual (true) values of target outcome data.
pred: numeric vector. Predictions corresponding to each respective element in actual.
na.rm: logical(1). TRUE if missing values should be removed; FALSE if they should be retained. If TRUE, then if any element of either actual or pred is missing, its paired element will be also removed.

Value

winsorize() returns a winsorized vector.

win_mae() returns the mean absolute error (MAE) of winsorized predicted values pred compared to the actual values. See mae() for details.

win_rmse() returns the root mean squared error (RMSE) of winsorized predicted values pred compared to the actual values. See rmse() for details.

Examples

a <- c(3, 5, 2, 7, 9, 4, 6, 8, 2, 10)
p <- c(2.5, 5.5, 1.5, 6.5, 10.5, 3.5, 6, 7.5, 0.5, 11.5)

a  # the original data
#>  [1]  3  5  2  7  9  4  6  8  2 10
winsorize(a, c(2, 8))  # a winsorized on defined boundaries
#>  [1] 3 5 2 7 8 4 6 8 2 8

# range of the original data
a
#>  [1]  3  5  2  7  9  4  6  8  2 10
range(a)
#> [1]  2 10

# some overzealous predictions
p
#>  [1]  2.5  5.5  1.5  6.5 10.5  3.5  6.0  7.5  0.5 11.5
range(p)
#> [1]  0.5 11.5

# MAE penalizes overzealous predictions
mae(a, p)
#> [1] 0.75

# Winsorized MAE forgives overzealous predictions
win_mae(a, p)
#> [1] 0.35

# RMSE penalizes overzealous predictions
rmse(a, p)
#> [1] 0.9082951

# Winsorized RMSE forgives overzealous predictions
win_rmse(a, p)
#> [1] 0.4743416