A Fully Bayesian Way of Estimating Insurance Relativities

Brayden Tang

27/11/2020

Some Context

I am not an actuarial student (anymore) but I recently completed a typical pricing assignment as part of an evaluation for a job application. I didn’t end up applying, however I enjoyed having a familiar (and fake) kind of dataset that I used to work with all the time in my past life. I wanted to see if I could get around some of the issues I had with these kinds of datasets from three years ago.

The objective of actuarial relativity analysis is, at a high level, to most accurately predict the pure premium as closely as possible for all combinations of rating variables. The pure premium is the amount of money needed, on average, that is required to cover the cost of claims only (so no profit allocations or expenses). Rating variables are simply policyholder characteristics (such as the amount of kilometers on a vehicle, age of the policyholder, color of the vehicle, engine size, etc.)

Naturally, pure premium is expressed simply as the rate at which a risk makes any kind of claim multiplied by the average amount per each claim. Thus, pure premium is defined as:

\[ \text{frequency} \times \text{severity} = \frac{\text{total claim counts}}{\text{total risks = exposures}} \times \frac{\text{total claim amounts (losses)}}{\text{total claim counts}} \]

\[ = \frac{\text{total claim amounts (losses)}}{\text{total risks}}.\]

We can model either the frequency and severity distributions separately, or we can model pure premium directly. The former approach is often preferred as it offers more flexibility and is at least logically, more robust.

Relativities

Relativities are simply the marginal impact a particular rating variable has on the response. This has the exact same interpretation as regression coefficients do in a multivariate linear model - holding all other variables constant, the regression coefficient represents the change in some quantity (say, the change in the log of severity or the log of frequency) per some change in said rating variable.

Relativities are typically calculated by predicting pure premiums across all of the levels of a particular rating variable, while simultaneously holding all of the other rating variables constant at the levels of some chosen base class. The base class represents one specific combination of rating variables that all other classes are compared against. Other combinations of rating variables are more risky (higher pure premium) or less risky (lower pure premium), relative to this base class.

Thus, each predicted pure premium is divided by the predicted pure premium of the relevant level from the base class to obtain these relativities.

Figure 1: Example relativities with two rating variables

City	Relativity
Small City	0.95
Medium City	1.00
Large City	1.50

Kilometres	Relativity
Less Than 10,000km	0.350
10,000-50,000km	0.760
50,001-100,000km	1.000
More Than 100,000km	1.378

Note that the vehicle described by the levels of each rating variable with relativities 1 (medium city & 50,001-100,000km) is the base class (in a linear model, this is the class described by the intercept). Relative to the base class, other vehicles have higher or lower premiums. The base class premium is multiplied by the corresponding relativities to derive the premiums for the other vehicle classes.

For example, suppose the base class pure premium (the premium for a policyholder who drives in the medium city and has a vehicle with 50,001-100,000km) is $1000. Then, the pure premium for a policyholder who drives in the small city with a vehicle that has more than 100,000km of mileage has a pure premium of $1000 * 0.95 * 1.378 = $1309.10.

The Dataset

I import the dataset below:

As is typical, one row represents one unique combination of rating variables, containing the total number of claims observed, amount of exposures (i.e. number of risks, which is typically defined as the amount of policy earned over a period of one year) and the total amount of claim payments.

While this dataset seems straightforward, it is deceivingly difficult to work with when compared to other more standard datasets. This is primarily due to the aggregation which causes rows and columns to be dependent. If we add more rating variables (which are columns) to the dataset, we naturally will get more rows since there will be more unique combinations of rating variables. In fact, the number of rows will grow exponentially (see the curse of dimensionality).

The consequences of this are dire. For one, model validation becomes very difficult. We can not naively split the dataset above because each row is unique. If we did, the model would be evaluated against combinations of rating variables that would be completely unseen in the training set, forcing the model to extrapolate (likely leading to poor predictions that aren’t reflective of how the model would perform in reality).

Second, explicit feature selection (which is emphasized a lot in pricing since there are many business reasons to have as simple of a rating algorithm as possible), is impossible. We cannot simply drop columns without affecting the number of rows (and therefore, completely changing the dataset in the process). Thus, fairly comparing models that have different underlying training datasets is not very clear.

Third, each row will become more sparse as well due to the curse of dimensionality, with some combinations of rating variables simply having zero claims or exposures. Naturally, this will lead to ill-defined models - the traditional gamma won’t work, for example, since we will have exact zero losses.

One solution to the sparsity problem is to group the sparse level of the rating variable with another. This seems harmless, but in reality this corresponds to a deceivingly strong prior, namely that the effects of each level on the response are the same. In math, this corresponds to setting $\beta_{1} - \beta_{2} \sim N(0, 2\epsilon)$ where $\epsilon$ is some number very close to 0, and $\beta_{1}, \beta_{2}$ are the effects of the two separate levels being grouped. See this video for more details. Regardless, grouping levels of rating variables together is a very strong prior that is not at all transparent. Rather than doing this, we can use mixed effects/hierarchical models to achieve partial pooling, which is far more flexible and robust.

Aside: The Ideal Dataset

The ideal dataset, at least hypothetically, would be a dataset where one row is equal to one policyholder. The total number of earned exposures for the policyholder would be one column, and the total claim counts and total claim costs (for that policyholder) would be additional columns (from which we could model). The rating variables relevant for that policyholder would then be represented as additional columns.

This would secure independence between the number of rating variables used and the number of rows. To obtain relativities, the aggregated dataset could be recreated (that is, one row is equal to one unique combination of exposures) from which we could then make our predictions of frequency and severity using our trained model (that was trained on the policyholder data). The model predicts it’s best guess of $E[Y | X = \text{rating variables}]$ learned over all observed policyholders.

Post-Adjustments

Traditionally, generalized linear models (Poisson and gamma linear models in particular) are used to predict the frequency and severity of claims. These models produce the expected frequency and severity of a particular rating combination, and assuming independence, these expectations are multiplied together to get the expected pure premium.

However, any models used to predict the frequency and severity of claims is susceptible to overfitting, leading to relativities that are possibly overstated or understated. This often leads to actuaries manually adjusting relativities to be more in line with what their apriori beliefs are. For example, a vehicle with a larger engine would be expected to be of higher risk than a vehicle with a smaller engine since presumably the vehicles with larger engines are more expensive on average. Sometimes, the relativities produced do not reflect this due to uncertainty/variance and so an actuary may decide to “manually” adjust the values to smooth said estimates. This is often done out of respect for the customer who may not appreciate being charged more in premium when they perceive themselves as lower risk. Depending on the actuary, they may just adjust relativities that are not monotonic.

However, these adjustments are arbitrary if we do not actually know how much uncertainty there is in the estimates. In the case of traditional pricing modeling, it is difficult to actually quantify how much uncertainty we are dealing with because the frequency and severity models are estimated completely independently of each other. The expectations produced by each individual model are used to produce the pure premiums, which are then used to calculate relativities.

However, it is clear that if we view frequency and severity as random variables that they are not independent. If we have a zero frequency, we must have a zero severity (or equivalently, if we have observed claims we must have non-zero severity, assuming that claims below a deductible are ignored). Estimating both models separately ignores this dependence which inevitably affects the variance of the pure premium.

In addition, other sources of variation, such as parameter and residual uncertainty, are ignored.

Bayesian modeling will allow us to quantify the vast majority of uncertainty in our data. By modeling the data as a generative process instead, we can account for nearly all sources of variation that are ignored in the traditional, frequentist approach. As a result, actuaries will have the ability to choose reasonable values that are in line with the amount of uncertainty that exists. In addition, it allows actuaries to quantify their confidence in allowing for a non-monotonic relativity, giving them the ability to defend their decisions.

The Bayesian/Generative Model

We describe the fully generative model below:

Let $X$ be the severity random variable, $R_{i}$ be the rating variables, and $N$ be the claim counts. Assume that exposures are a known constant.

Then, for the ith unique combination of rating variables, $\forall i \ X_{i} > 0$ let:

\[X_{i} \big|N_{i}, \phi, \eta_{i} \sim Gamma\left(\frac{N_{i}}{\phi}, \frac{N_{i}}{\eta_{i} \times \phi}\right),\]

\[\eta_{i} \big| \beta, R_{i} = exp(R_{i} \times \beta),\] \[N_{i} \big|\lambda_{i} \sim Poisson(\lambda_{i}),\] \[\lambda_{i} \big|\psi, R_{i} = exp(R_{i} \times \psi + log(\text{exposure}_{i})),\] \[\beta \sim Normal(0, \tau),\] \[\psi \sim Normal(0, \omega),\] \[\phi \sim \text{Half-Cauchy}(0, \alpha).\] Note that $\alpha, \tau, \omega$ are constants that must be specified by the user. In addition, note that \[E[X_{i} \big |N_{i}, \phi, \eta_{i}] = \frac{N_{i}}{\phi} \times \frac{\eta_{i} \times \phi}{N_{i}} = \eta_{i}\] but \[Var[X_{i} \big |N_{i}, \phi, \eta_{i}] = \frac{N_{i}}{\phi} \times \frac{\eta_{i}^2 \times \phi^2}{N_{i}^2} = \frac{\eta_{i}^2 \times \phi}{N_{i}}.\]

That is, as the number of claims that makes up a severity calculation increases, the lower the variance of the resulting severity distribution for the particular combination of rating variables. Equivalently, the more claims for a particular combination of rating variables, the more influence that particular combination of rating variables has on the overall model fit. This is pretty much the same thing as using weights in a gamma generalized linear model, however, in this case the weights are also random variables and everything is modeled simultaneously.

It follows that \[X_{i} \times \frac{N_{i}}{\text{number of exposures for ith combination}} = P_{i},\] where $P_{i}$ is the pure premium for the ith combination of rating variables, is also a random variable. Since we will have joint posterior draws of the two random variables that the pure premium is a function of, the posterior distribution of the pure premium (and relativities) is also known.

The above formulation should also give insight into what combination of rating variables should, in general, represent the base class pure premium. It is clear that the variance of the pure premium for the ith combination is $\frac{Var[X_{i}]}{\text{exposures}_{i}^2}$. Thus, choosing the base class with the largest amount of exposures will often yield a low varying (i.e. stable) pure premium (assuming the severity variance is not extremely large).

In addition, this Bayesian approach helps solve problems related to the aggregation of data.

Feature Selection

It was mentioned previously that feature selection is very difficult to carry out on an aggregated dataset where rows and columns are not independent. This is true if we are using explicit feature selection where rating variables are literally removed from the design matrix/training dataset.

However, we can achieve “feature selection” through regularization as well. It turns out that L1 (LASSO), and L2 (ridge) are equivalent to Laplace and Normal zero mean priors on regression coefficients, for example. L1 regularization actually has the potential to explicitly feature select as well, allowing for exactly zero coefficients (albeit, we are not maximizing but marginalizing in Bayesian approaches in general so we won’t ever get this exactly in the Bayesian approach). The cost parameter typically associated with these regularization techniques is exactly equal to the variance of the Laplace/Normal distributions - the larger the variance, the lower the cost/regularization. Bayesian approaches also allow for even more specific priors depending on the context, such as the horseshoe/spike and slab priors.

In the Bayesian context, the variance is chosen based on our confidence that the (predictive) effects of each level are non-zero, apriori. Alternatively, one can choose to use an empirical Bayes approach and tune this variance parameter using something like cross validation. This is equivalent to the typical machine learning approach - but in the Bayes context this isn’t as feasible since Bayesian models fit using MCMC are slow. An alternative is to estimate the expected out of sample log pointwise predictive density (ELPD) using importance sampling (counterfactuals of the loglikelihood), which is faster than true cross validation (since there is no actual data splitting) but still potentially too slow.

Regardless, the Bayesian model allows us to use regularization while still remaining completely probabilistic. This will pay off when we look at relativities.

Predictive Performance

The other main issue involved the issue of model validation in terms of predictive performance. We cannot just naively split the dataset, because each row represents a unique combination of rating variables.

We can, however, use likelihood based evaluation metrics if we are comparing models fit with the same underlying dataset (among other caveats like having the same raw target variable and avoiding comparisons between discrete and continuous distributions). One such metric is the aforementioned leave one out estimate of ELPD, which produces counterfactuals of the loglikelihood under scenarios where one observation at a time is excluded from the dataset (so that each resulting loglikelihood can be evaluated at the held out data point, simulating cross validation). However, just like true data splitting we still suffer from the same problem of each row being a unique combination of rating variables. Thus, the ELPD on the held out rating combination will be estimated through extrapolation, which is not true to how the model will behave in reality.

Unfortunately, it is unclear how to obtain absolute measures of predictive performance. The only alternative may be to perform posterior predictive checks, but these are known to be biased for predictive performance because this involves in-sample data. Despite this, some do suggest the use of 50% credible intervals to at least detect overfitting.

Some argue that Bayesian methods cannot really overfit providing that uninformative priors are avoided, and in general, if marginalizing rather than maximization is done as much as possible. Informative priors encourage estimators to not fully rely on the underlying training dataset by introducing bias via. the prior. Marginalization is preferable to maximization since marginalization weighs each outcome with its posterior probability.

Regardless, none of this allows for the ability to quantify how well the model actually predicts in absolute terms.

Stan Code

Below is the exact same model described above. In this case, we set $\tau$ and $\omega$ to 1 and 1, respectively, which are “weakly informative” priors. No strong apriori beliefs regarding the effects exist (there is no context regarding the data) but regularization to shrink effects to 0, if it is justifiable, is desired.

For the variance parameter, we use a half-Cauchy(0, 3) distribution. This is again a weakly informative prior.

data {
  int<lower=1> Nobs;
  int<lower=1> Nvar;
  matrix[Nobs, Nvar] X;
  int<lower=0> claim_count[Nobs];
  vector<lower=0>[Nobs] exposure;
  vector<lower=0>[Nobs] severity;
}

parameters {
  vector[Nvar] beta_frequency;
  vector[Nvar] beta_severity;
  real<lower=0> dispersion_severity;
}

model {
  
  beta_frequency ~ std_normal();
  beta_severity ~ std_normal();
  dispersion_severity ~ cauchy(0, 3);
  
  claim_count ~ poisson_log(X * beta_frequency + log(exposure));
  
  for (i in 1:Nobs) {
    if (claim_count[i] != 0) {
      severity[i] ~ gamma(
        claim_count[i] / dispersion_severity,
        claim_count[i] / (exp(X[i, ] * beta_severity) * dispersion_severity)
        );
    }
  }
  
}

Next, preprocess the data in a suitable format for Stan.

data_with_severity <- data %>%
    mutate(
        severity = ifelse(claim_count != 0, claim_payments / claim_count, 0), 
        kilometres = as.factor(kilometres),
        zone = as.factor(zone),
        bonus = as.factor(bonus),
        make = as.factor(make),
        observed_pp = claim_payments / vehicle_exposure_years
    )

preprocessing <- recipe(claim_count ~ ., data = data_with_severity) %>%
    step_dummy(all_nominal())

prepped_recipe <- prep(preprocessing, training = data_with_severity)

X <- juice(prepped_recipe) %>%
    mutate(intercept = rep(1, nrow(.))) %>%
    relocate(intercept) %>%
    select(
        -vehicle_exposure_years,
        -claim_payments,
        -severity, 
        -claim_count, 
        -observed_pp
        ) 

data_list <- list(
    Nobs = nrow(X),
    Nvar = ncol(X),
    X = as.matrix.data.frame(X),
    claim_count = data_with_severity$claim_count,
    exposure = data_with_severity$vehicle_exposure_years,
    severity = data_with_severity$severity
)

Now we sample from the posterior distribution as required:

if (refit == TRUE) {
    fit_poisson_gamma <- sampling(
        poisson_gamma_model,
        data = data_list,
        chains = 6,
        iter = 2000,
        seed = 200350623,
        cores = 6,
        verbose = TRUE
        )
    
    saveRDS(fit_poisson_gamma, "rds/fittedPGmodel.rds")
} else {
    fit_poisson_gamma <- readRDS("rds/fittedPGmodel.rds")
}

This function extracts the required samples from the MCMC sampler.

#' Extract the required mean/dispersion parameters for both candidate frequency
#` and severity distributions. 
#'
#' @param fitted_model A Stan fitted model (the result of rstan::sampling).
#`  This model MUST have named parameters beta_frequency, beta_severity, 
#`  and can optionally have parameters dispersion_frequency and 
#`  dispersion_severity.
#' @param exposures A vector of length N containing the exposures
#' @param X The design matrix of rating variables
#'
#' @return A list containing matrices/vectors of posterior samples of the 
#` required probability distributions. Both frequency and severity parameter
#` samples are returned on the predictor (not linked) scale.
#' @export
#'
#' @examples 
#` \dontrun{
#` extract_and_get_posterior_samples(
#` my_stan_fitted,
#` vehicle_exposure_years,
#` X_design
#` )
#`}
extract_and_get_posterior_samples <- function(fitted_model, exposures, X) {
    
    samples <- rstan::extract(fitted_model)
    exposure <- matrix(
        log(exposures),
        nrow = nrow(samples$beta_frequency),
        ncol = length(exposures),
        byrow = TRUE
        )
    
    if (is.null(samples$dispersion_frequency)) {
        dispersion_frequency_samples <- NULL
    } else {
        dispersion_frequency_samples <- samples$dispersion_frequency
    }
    
    if (is.null(samples$dispersion_severity)) {
        dispersion_severity_samples <- NULL
    } else {
        dispersion_severity_samples <- samples$dispersion_severity
    }
    
    return(list(
        frequency_samples = samples$beta_frequency %*% t(as.matrix.data.frame(X)) + 
            exposure,
        severity_samples = samples$beta_severity %*% t(as.matrix.data.frame(X)),
        dispersion_frequency_samples = dispersion_frequency_samples,
        dispersion_severity_samples = dispersion_severity_samples,
        exposures = exposure
        )
    )
    
}

We now extract the posterior samples.

# Extract the samples for the Poisson-Gamma model.
pg_posterior_samples <- extract_and_get_posterior_samples(
    fitted_model = fit_poisson_gamma,
    exposures = data_with_severity$vehicle_exposure_years,
    X = X
)

# Apply the link functions.
pg_posterior_samples$frequency_samples <- exp(pg_posterior_samples$frequency_samples)
pg_posterior_samples$severity_samples <- exp(pg_posterior_samples$severity_samples)

Posterior Predictive Checks

Does the underlying generative model fit the dataset well?

Posterior Predictive Check - Frequency

First, we generate samples from the posterior predictive distribution. Then, we perform posterior predictive checks for the frequency, severity, and pure premium distributions (i.e. compare the generated data against the observed data).

# Basically, flatten the matrix of frequencies to a vector, generate a vector 
# of length equal to this vector using each of these frequencies, then convert
# back
set.seed(200350623)
claim_count_posterior_pred <- rpois(
    n = length(pg_posterior_samples$frequency_samples),
    lambda = pg_posterior_samples$frequency_samples
    )

dim(claim_count_posterior_pred) <- dim(pg_posterior_samples$frequency_samples)

posterior_pred_check(
    claim_count_posterior_pred,
    y = data$claim_count,
    n = 50, 
    "claims (sqrt scaled)",
    limits = c(0, 700)
    )

The blue curves represent 50 simulated datasets generated from the posterior predictive. The black curve is the actual, observed dataset (in this case, claim counts). Ideally, we want the black curve to fall well within the blue region as this implies that the actual, observed dataset is a likely realization under our generative model.

The posterior predictive graphs are all zoomed in on the region where the density is not trivially zero for easier inspection.

Overall, we can see that the frequency model looks decent at most of the claim count ranges (especially the lower claim counts), but the rating combinations with larger claim counts seem to be more variable than what the model can explain.

That is, the distribution of claim counts exhibits possible overdispersion. Thus, a possible improvement to the model might be to use the negative binomial distribution instead, rather than the Poisson.

Posterior Predictive Check - Severity

Similarily, we get posterior predictive draws of severities per each combination of rating variables.

#' Computes the posterior parameters for the weighted Gamma distribution.
#'
#' @param posterior_samples Posterior parameter samples from the Stan model.
#' @param posterior_predictive_claim_counts Posterior predictive claim counts.
#'  Must be a matrix.
#'
#' @return A list with posterior shape and rate parameters.
#' @export
#'
#' @examples sample_weighted_gamma(
#' posterior_samples, claim_counts_nb)
#'
#'
#'
sample_weighted_gamma <- function(
    posterior_samples,
    posterior_predictive_claim_counts) {
    
    dispersion_severity_posterior <- matrix(
        posterior_samples$dispersion_severity_samples,
        nrow = nrow(posterior_samples$frequency_samples),
        ncol = ncol(posterior_samples$frequency_samples)
        )
    
    shapes_posterior <- posterior_predictive_claim_counts / 
        dispersion_severity_posterior
    
    rates_posterior <- posterior_predictive_claim_counts /
        (posterior_samples$severity_samples * dispersion_severity_posterior)
    
    return(list(shapes = shapes_posterior, rates = rates_posterior))
    
    }

posterior_sev_samples_pg <- sample_weighted_gamma(
    pg_posterior_samples,
    claim_count_posterior_pred  
)

set.seed(200350623)
simulate_severity <- rgamma(
    n = length(pg_posterior_samples$severity_samples),
    shape = posterior_sev_samples_pg$shapes,
    rate = posterior_sev_samples_pg$rates
    )

dim(simulate_severity) <- dim(pg_posterior_samples$severity_samples)

posterior_pred_check(
    simulate_severity,
    y = data_with_severity$severity,
    n = 50,
    variable = "severity (sqrt scaled)",
    limits = c(0, 40000)
    )

As we can see in this case, the generative model looks off here. The model underfits the most important areas of largest density (the large spike) and fails to account for uncertainty at higher and lower levels of severity.

A major contributor to this is the fact that there is an unusual amount of density in the 31,000 region for severity. It turns out that this corresponds to a bunch of rating combinations with one or two claims and total payments in multiples of 31,442. This is likely an artifact that has been inserted in the data that is incredibly suspicious. However, these rating combinations receive such little weight in the fit due to the weighting scheme employed and therefore the overall impact of these large severity amounts should be mitigated.

Ultimately, however, the goal is the pure premium since our relativities are based on these quantities.

Posterior Predictive Check - Pure Premium

simulate_pure_premium <- (
    claim_count_posterior_pred / exp(pg_posterior_samples$exposures)
    ) * simulate_severity

posterior_pred_check(
    simulate_pure_premium,
    y = data_with_severity$observed_pp,
    n = 50,
    variable = "pure premium (sqrt scaled)",
    limits = c(0, 4000)
)

The “peaks” in the distribution are not as extreme as what is observed in the data. The black curve does not fall in the blue region in important areas of the distribution, namely the single peak. Thus, under the generative model the observed data is simply not probable in some regions of pure premium, which is an issue.

Otherwise, the generative model at least has the right shape, for the most part.

Model Revision - Switch Frequency to Negative Binomial

We noticed in the previous section that the Poisson distribution does not account for as much variance as we would like. Indeed, the variance of the Poisson distribution is equal to its mean.

A negative binomial has variance greater than its mean and therefore could address issues where uncertainty is understated for particular rating variable combinations with large claim counts.

The generative model is quite similar to the previous model, but we switch the frequency distribution:

\[X_{i} \big|N_{i}, \phi, \eta_{i} \sim Gamma\left(\frac{N_{i}}{\phi}, \frac{N_{i}}{\eta_{i} \times \phi}\right),\]

\[\eta_{i} \big| \beta, R_{i} = exp(R_{i} \times \beta),\] \[N_{i} \big|\xi_{i}, \nu \sim \text{Negative Binomial}(\xi_{i}, \nu),\] \[\xi_{i} \big|\psi, R_{i} = exp(R_{i} \times \psi + log(\text{exposure}_{i})),\] \[\beta \sim Normal(0, 1),\] \[\psi \sim Normal(0, 1),\] \[\phi \sim \text{Half-Cauchy}(0, 3),\] \[\nu \sim \text{Half-Cauchy}(0, 10).\] Note that the Negative Binomial is using the mean, dispersion parameterization in Stan (i.e. NegativeBinomial2). The Half-Cauchy is set to be super wide, since stronger priors will heavily favor models that have significant amounts of over-dispersion (leading to sampling issues if the data does not exhibit strong overdispersion).

Compiling the model above in Stan:

data {
  int<lower=1> Nobs;
  int<lower=1> Nvar;
  matrix[Nobs, Nvar] X;
  int<lower=0> claim_count[Nobs];
  vector<lower=0>[Nobs] exposure;
  vector<lower=0>[Nobs] severity;
}

parameters {
  vector[Nvar] beta_frequency;
  vector[Nvar] beta_severity;
  real<lower=0> dispersion_severity;
  real<lower=0> dispersion_frequency;
}

model {
  
  beta_frequency ~ std_normal();
  beta_severity ~ std_normal();
  dispersion_severity ~ cauchy(0, 3);
  dispersion_frequency ~ cauchy(0, 10); /* Larger variance here for the negative binomial 
  because the data did not appear to be that overdispersed and we need to allow for
  larger dispersion parameters to consider these cases */
  
  claim_count ~ neg_binomial_2_log(
    X * beta_frequency + log(exposure), dispersion_frequency
  );
  
  for (i in 1:Nobs) {
    if (claim_count[i] != 0) {
      severity[i] ~ gamma(
        claim_count[i] / dispersion_severity,
        claim_count[i] / (exp(X[i, ] * beta_severity) * dispersion_severity)
        );
    }
  }
  
}

Now we sample from the posterior distribution as required:

if (refit == TRUE) {
    
    fit_nb_gamma <- sampling(
        nb_gamma,
        data = data_list,
        chains = 6,
        iter = 2000,
        seed = 200350623,
        cores = 6,
        verbose = TRUE
        )
    
    saveRDS(fit_nb_gamma, "rds/nb-weighted_gamma.rds")
} else {
    fit_nb_gamma <- readRDS("rds/nb-weighted_gamma.rds")
}

nb_posterior_samples <- extract_and_get_posterior_samples(
    fitted_model = fit_nb_gamma,
    exposures = data_with_severity$vehicle_exposure_years,
    X = X
)

nb_posterior_samples$frequency_samples <- exp(nb_posterior_samples$frequency_samples)
nb_posterior_samples$severity_samples <- exp(nb_posterior_samples$severity_samples)

dispersion_frequency_nb <- matrix(
    nb_posterior_samples$dispersion_frequency,
    nrow = length(nb_posterior_samples$dispersion_frequency),
    ncol = nrow(data)
)

Posterior Predictive Check - Revision I

We again simulate observations from the posterior predictive.

# Frequency
set.seed(200350623)
claim_count_posterior_pred_nb <- rnbinom(
    n = length(nb_posterior_samples$frequency_samples),
    mu = nb_posterior_samples$frequency_samples,
    size = dispersion_frequency_nb
    )

dim(claim_count_posterior_pred_nb) <- dim(dispersion_frequency_nb)

parameters_nb <- sample_weighted_gamma(
    posterior_samples = nb_posterior_samples,
    posterior_predictive_claim_counts = claim_count_posterior_pred_nb
)

# Severity
set.seed(200350623)
simulate_severity_nb <- rgamma(
    n = length(parameters_nb$shapes),
    shape = parameters_nb$shapes,
    rate = parameters_nb$rates
    )

dim(simulate_severity_nb) <- dim(nb_posterior_samples$severity_samples)

# Pure Premium
simulate_pure_premium_nb <- (
    claim_count_posterior_pred_nb / exp(nb_posterior_samples$exposures)
    ) * simulate_severity_nb

Next, complete the posterior predictive checks as before:

results <- mapply(
    posterior_pred_check,
    posterior_predictive_samples = list(
            claim_count_posterior_pred_nb,
            simulate_severity_nb,
            simulate_pure_premium_nb
            ),
    y = list(
            data_with_severity$claim_count,
            data_with_severity$severity,
            data_with_severity$observed_pp
            ),
    variable = c(
            "claims (sqrt scaled)",
            "severity (sqrt scaled)",
            "pure premium (sqrt scaled)"
            ),
    limits = list(c(0, 700), c(0, 40000), c(0, 4000)),
    MoreArgs = list(n = 50),
    SIMPLIFY = FALSE
    )

Finally:

(results[[1]] | results[[2]]) / results[[3]] + plot_annotation(
    title = "Posterior Predictive Check - Negative Binomial Frequency"
)

Overall, the frequency distribution (claims) looks to be better modeled with a Negative Binomial, albeit marginally.

The severity distribution still looks off, and as a result, the pure premium looks only marginally (if that) better than before.

Model Revision - Switch Severity to A More Heavy Tailed Distribution

It looks like the current generative model gives too much density to small severity amounts, but also too much density to larger amounts. Therefore, if we switch to a heavy tailed distribution we might make the tails even heavier (and they are already too heavy) even though we may improve the fit for smaller severity amounts.

Regardless, we can still explore these distributions and see how they compare.

There are many heavy tailed distributions that have heavier tails than the gamma.

Log-normal

The log-normal distribution is known to have heavier tails than the gamma distribution. This can be shown by calculating the limit of the ratio of survival functions.

Stan actually has a log-normal distribution built in. However, it is difficult to work with the log-normal distribution when linking the expected value of the log-normal with predictors (this is because the log-normal does not belong to the exponential dispersion family).

Instead of using the log-normal distribution, we can instead log the severity random variable to get a Normal($\mu$, $\sigma$) random variable (assuming that the severity random variable is log-normal). We can then simulate directly from the log-normal distribution using the posterior parameters to get the posterior predictive of the original (assumed) log-normal severity random variable.

data_list$severity <- ifelse(
    data_with_severity$claim_count != 0,
    log(data_with_severity$severity),
    0
)

The generative model is exactly the same as before, except now we replace $X_{i} \big | N_{i}, \phi, \eta_{i}$ with a log-normal that has parameters $R_{i}\beta, \frac{\sigma}{N_{i}^{0.5}}$.

data {
  int<lower=1> Nobs;
  int<lower=1> Nvar;
  matrix[Nobs, Nvar] X;
  int<lower=0> claim_count[Nobs];
  vector<lower=0>[Nobs] exposure;
  vector[Nobs] severity; /*Note the change here - we removed the lower 
  limit since now the severity can technically be < 0 now when logged */
}

parameters {
  vector[Nvar] beta_frequency;
  vector[Nvar] beta_severity;
  real<lower=0> dispersion_severity;
  real<lower=0> dispersion_frequency;
}

model {
  
  beta_frequency ~ std_normal();
  beta_severity ~ std_normal();
  dispersion_severity ~ cauchy(0, 3);
  dispersion_frequency ~ cauchy(0, 10);
  
  claim_count ~ neg_binomial_2_log(
    X * beta_frequency + log(exposure), dispersion_frequency
  );
  
  for (i in 1:Nobs) {
    if (claim_count[i] != 0) {
      severity[i] ~ normal(
        X[i, ] * beta_severity,
        (dispersion_severity / claim_count[i])^0.5
        );
    }
  }
  
}

Extracting the posterior samples:

if (refit == TRUE) {
    
    fit_nb_lognormal <- sampling(
        sm_nb_lognormal,
        data = data_list,
        chains = 6,
        iter = 2000,
        control = list(adapt_delta = 0.95),
        seed = 200350623,
        cores = 6,
        verbose = TRUE
        )

    saveRDS(fit_nb_lognormal, "rds/fitted_nb_lognormal.rds")
} else {
    fit_nb_lognormal <- readRDS("rds/fitted_nb_lognormal.rds")
}

lognormal_sev_posterior_samples <- extract_and_get_posterior_samples(
    fitted_model = fit_nb_lognormal,
    exposures = data_with_severity$vehicle_exposure_years,
    X = X 
)

# Just exponentiate the negative binomial here.
lognormal_sev_posterior_samples$frequency_samples <- exp(
    lognormal_sev_posterior_samples$frequency_samples
    )

Simulating from the posterior predictive:

# Frequency
dispersion_frequency_nb <- matrix(
    lognormal_sev_posterior_samples$dispersion_frequency,
    nrow = length(lognormal_sev_posterior_samples$dispersion_frequency),
    ncol = nrow(data)
)

set.seed(200350623)
claim_count_posterior_pred_nb <- rnbinom(
    n = length(lognormal_sev_posterior_samples$frequency_samples),
    mu = lognormal_sev_posterior_samples$frequency_samples,
    size = dispersion_frequency_nb
    )

dim(claim_count_posterior_pred_nb) <- dim(
    lognormal_sev_posterior_samples$frequency_samples
    )

# Severity

# First, calculate the normal standard deviations

dispersion_severity_posterior <- matrix(
        lognormal_sev_posterior_samples$dispersion_severity_samples,
        nrow = nrow(lognormal_sev_posterior_samples$frequency_samples),
        ncol = ncol(lognormal_sev_posterior_samples$frequency_samples)
        )

normal_sd <- (dispersion_severity_posterior / claim_count_posterior_pred_nb)^0.5

# Doesn't matter what we set the sd to here since if we simulate 0 claim counts,
# we simulate 0 claim payments. Set to 1 for convenience.
normal_sd[is.infinite(normal_sd)] <- 1

# Can now simulate as required
set.seed(200350623)
simulate_severity_ln <- rlnorm(
    n = length(dispersion_severity_posterior),
    mean = lognormal_sev_posterior_samples$severity_samples,
    sd = normal_sd
)

dim(simulate_severity_ln) <- dim(
    lognormal_sev_posterior_samples$severity_samples
    )

# If posterior predictive claim count is 0, then severity should be 0.
simulate_severity_ln[claim_count_posterior_pred_nb == 0] <- 0

Simulating pure premiums:

simulate_pure_premium_ln <- (
    claim_count_posterior_pred_nb / exp(lognormal_sev_posterior_samples$exposures)
    ) * simulate_severity_ln

Posterior Predictive Check - Revision II

results <- mapply(
    posterior_pred_check,
    posterior_predictive_samples = list(
            claim_count_posterior_pred_nb,
            simulate_severity_ln,
            simulate_pure_premium_ln
            ),
    y = list(
            data_with_severity$claim_count,
            data_with_severity$severity,
            data_with_severity$observed_pp
            ),
    variable = c(
            "claims (sqrt scaled)",
            "severity (sqrt scaled)",
            "pure premium (sqrt scaled)"
            ),
    limits = list(c(0, 700), c(0, 40000), c(0, 4000)),
    MoreArgs = list(n = 50),
    SIMPLIFY = FALSE
    )

Like before:

(results[[1]] | results[[2]]) / results[[3]] + plot_annotation(
    title = "Posterior Predictive Check - Log-normal Severity"
)

The fit for severity is mixed, as we expected. There is clearly more mass being put towards the tail and less towards small severity amounts which has improved the fit dramatically in this region. However, more density has been allocated to the larger severity amounts as a consequence which has made the generative model behave worse there.

The pure premium model looks better in some areas and slightly worse than others as a result. The region of highest density is being modeled a bit better than the gamma, but arguably worse than the gamma result at lower pure premium values.

Inverse-Gaussian

The inverse-gaussian is even more heavier tailed when compared to the log-normal.

This requires writing a custom Stan function since the inverse-gaussian is not actually implemented natively in Stan.

functions {

/* This is just the observed loglikelihood of the IG, with additive constants
removed */

    real IG_lpdf (real x, real mu, real lambda) {
    
        real lprob;
        
        if (!(x > 0)) {
            reject("x must be strictly greater than 0");
        } 
        
        if (!(mu > 0)) {
            reject("mu must be strictly greater than 0");
        } 
        
        if (!(lambda > 0)) {
            reject("lambda must be strictly greater than 0");
        }   
        
        lprob = 0.5 * log(lambda) - (lambda * (x-mu)^2/(2 * mu^2 * x));
        return lprob;
    }
}

data {
  int<lower=1> Nobs;
  int<lower=1> Nvar;
  matrix[Nobs, Nvar] X;
  int<lower=0> claim_count[Nobs];
  vector<lower=0>[Nobs] exposure;
  vector<lower=0>[Nobs] severity;
}

parameters {
  vector[Nvar] beta_frequency;
  vector[Nvar] beta_severity;
  real<lower=0> dispersion_severity;
  real<lower=0> dispersion_frequency;
}

model {
  
  beta_frequency ~ std_normal();
  beta_severity ~ std_normal();
  dispersion_severity ~ cauchy(0, 3);
  dispersion_frequency ~ cauchy(0, 10);
  
  claim_count ~ neg_binomial_2_log(
    X * beta_frequency + log(exposure), dispersion_frequency
  );
  
  for (i in 1:Nobs) {
    if (claim_count[i] != 0) {
      target += IG_lpdf(
        severity[i] | exp(X[i, ] * beta_severity), 
        claim_count[i] / dispersion_severity
        );
    } 
  }
  
}

Extracting posterior samples like before:

# Change the severity variable back to the original, non-logged version.
data_list$severity <- data_with_severity$severity

if (refit == TRUE) {
    
    fit_nb_IG <- sampling(
        sm_nb_ig,
        data = data_list,
        chains = 6,
        iter = 2000,
        control = list(adapt_delta = 0.97),
        seed = 200350623,
        cores = 6,
        verbose = TRUE
        )

    saveRDS(fit_nb_IG, "rds/fitted_nb_IG.rds")
} else {
    fit_nb_IG <- readRDS("rds/fitted_nb_IG.rds")
}

IG_sev_posterior_samples <- extract_and_get_posterior_samples(
    fitted_model = fit_nb_IG,
    exposures = data_with_severity$vehicle_exposure_years,
    X = X 
)

# Exponentiate the means to get the scalar parameters
IG_sev_posterior_samples$frequency_samples <- exp(
    IG_sev_posterior_samples$frequency_samples
    )

IG_sev_posterior_samples$severity_samples <- exp(
    IG_sev_posterior_samples$severity_samples
)

Again, simulating from the posterior predictive distributions:

# Frequency
dispersion_frequency_nb <- matrix(
    IG_sev_posterior_samples$dispersion_frequency,
    nrow = length(IG_sev_posterior_samples$dispersion_frequency),
    ncol = nrow(data)
)

set.seed(200350623)
claim_count_posterior_pred_nb <- rnbinom(
    n = length(IG_sev_posterior_samples$frequency_samples),
    mu = IG_sev_posterior_samples$frequency_samples,
    size = dispersion_frequency_nb
    )

dim(claim_count_posterior_pred_nb) <- dim(
    IG_sev_posterior_samples$frequency_samples
    )

# Severity

# lambda parameter
dispersion_severity_posterior <- matrix(
        IG_sev_posterior_samples$dispersion_severity_samples,
        nrow = nrow(IG_sev_posterior_samples$frequency_samples),
        ncol = ncol(IG_sev_posterior_samples$frequency_samples)
        )

# Note: like with the gamma, in this case if we have 0 claims then we get
# a lambda of 0. This is fine as the inverse_gaussian with 0 = lambda returns 0
# w.p. 1, regardless of mean.
lambda <- claim_count_posterior_pred_nb / dispersion_severity_posterior

# Simulate from severity as normal 
set.seed(200350623)
simulate_severity_IG <- rinvgauss(
    n = length(dispersion_severity_posterior),
    mean = IG_sev_posterior_samples$severity_samples,
    shape = lambda
)

dim(simulate_severity_IG) <- dim(
    IG_sev_posterior_samples$severity_samples
    )

simulate_pure_premium_IG <- (
    claim_count_posterior_pred_nb / exp(IG_sev_posterior_samples$exposures)
    ) * simulate_severity_IG

Posterior Predictive Check - Revision III

results <- mapply(
    posterior_pred_check,
    posterior_predictive_samples = list(
            claim_count_posterior_pred_nb,
            simulate_severity_IG,
            simulate_pure_premium_IG
            ),
    y = list(
            data_with_severity$claim_count,
            data_with_severity$severity,
            data_with_severity$observed_pp
            ),
    variable = c(
            "claims (sqrt scaled)",
            "severity (sqrt scaled)",
            "pure premium (sqrt scaled)"
            ),
    limits = list(c(0, 700), c(0, 40000), c(0, 4000)),
    MoreArgs = list(n = 50),
    SIMPLIFY = FALSE
    )

Like before:

(results[[1]] | results[[2]]) / results[[3]] + plot_annotation(
    title = "Posterior Predictive Check - Inverse-gaussian Severity"
)

Overall, the fit looks surprisingly better than the lognormal and overall, looks to be the best fitting model out of everything that has been tried up to this point. The resulting posterior predictive distribution of the pure premium has the ability to generate the actual observed data with reasonable probability in most ranges.

The Random Crossed Effects Model (hierarchical Bayes)

All of the models above have their parameters estimated using only observations that are relevant to that parameter. For example, only rating combinations with zone 1 contribute to the estimation of the coefficient relevant to zone 1. However, many of the rating levels within each rating variable have much smaller claim and/or exposure counts when compared to other, more popular levels. That is, many of the rating levels are sparse.

A common practice is to simply group specific levels together to get rid of this sparsity (discussed in “The Dataset”). However, as mentioned previously this decision corresponds directly with assuming an incredibly strong prior assumption which is not transparent.

Instead of committing to these post-hoc methods, we can instead take advantage of partial pooling. All levels will have their estimates computed as a weighted average between the global estimate over all levels and their individual estimate, where the weighting depends on the variation between all rating levels, the amount of variance within a level, and the amount of data that make up a level (though in this case, the amount of data per each level influences the amount of variation within a level to control pooling instead, due to the use of weighting in the models to address issues with data aggregation). This is often interpreted as a way to induce the sharing of information over different levels of particular rating variables. In the actuarial world, this is a very similar idea to credibility theory but credibility is often applied in an ad-hoc fashion.

In addition, the variation between rating levels accounts for extra variation between (and not just within each level) each individual level of a rating variable. This leads to larger credible intervals which allows for better inferences, if desired.

The Generative Random Effects Model

The full generative model is given below:

\[\forall X_{i} >0, N_{i} \in N_{0},\] \[X_{i_{jklm}} \ \big| \ N_{i_{jklm}}, \phi, \mu_{i_{jklm}} \sim \text{Inverse Gaussian}\left(\mu_{i_{jklm}}, \frac{N_{i_{jklm}}}{\phi}\right),\]

\[\mu_{i_{jklm}} \ \big| \ \beta, b_{ij}, b_{ik}, b_{il}, b_{im} = exp(\beta + b_{ij} + b_{ik} + b_{il} + b_{im}),\] \[b_{i_{[jklm]}} \sim \text{Normal}(0, \sigma_{[jklm]}^2),\] \[\beta \sim \text{Normal}(8, 2.5),\] \[N_{i_{jklm}} \ \big| \ \xi_{i_{jklm}}, \nu \sim \text{Negative Binomial}(\xi_{i_{jklm}}, \nu),\] \[\xi_{i_{[jklm]}} \ \big| \ \psi, t_{ij}, t_{ik}, t_{il}, t_{im} = exp(\psi + t_{ij} + t_{ik} + t_{il} + t_{im} + log(\text{exposure}_{i})),\] \[t_{i_{[jklm]}} \sim \text{Normal}(0, \tau_{[jklm]}^2),\] \[\psi \sim \text{Normal}(1.6, 2.5),\] \[\phi \sim \text{Half-Cauchy}(0, 5),\] \[\nu \sim \text{Half-Cauchy}(0, 12),\] \[\sigma^2_{[jklm]} \sim \text{Half Student's T(3, 0, 2.5)},\] \[\tau_{[jklm]}^2 \sim \text{Half Student's T(3, 0, 2.5)},\] where j, k, l, and m are the kilometre, zone, bonus, and make rating levels for the ith rating combination.

Equivalently, to illustrate the hierarchical structure that is going on:

\[X_{i_{[jklm]}} \ \big| \ \mu_{i_{[jklm]}}, N_{i_{[jklm]}}, \phi \sim \text{Inverse Gaussian}\left(\mu_{i_{[jklm]}}, \frac{N_{i_{[jklm]}}}{\phi}\right),\] \[log(\mu_{i_{[jklm]}}) \ \big| \ \beta, \sigma_{[jklm]}^2 \sim \text{Normal}(\beta, \sigma_{[jklm]}^2),\] \[\beta \sim \text{Normal}(0, 1),\] \[\sigma^2_{[jklm]} \sim \text{Half Student's T(3, 0, 2.5)},\]

\[N_{i_{[jklm]}} \ \big| \ \xi_{i_{[jklm]}}, \nu \sim \text{Negative Binomial}(\xi_{i_{[jklm]}}, \nu),\] \[log(\xi_{i_{[jklm]}}) \ \big| \ \psi, \tau^2_{[jklm]} \sim \text{Normal}(\psi, \tau^2_{[jklm]}),\] \[\psi \sim \text{Normal}(0, 1),\] \[\tau^2_{[jklm]} \sim \text{Half Student's T}(3,0, 2.5).\]

The average severity of any particular level of a rating variable comes from the same distribution with a common mean $\beta$ and variance $\sigma^2_{[jklm]}$. $\beta$ is the overall grand mean over all levels of a particular rating variable (which is the average severity ignoring all rating variables in this case). $\sigma^2_{[jklm]}$ represents the observed variation between levels for a particular rating variable. Thus, individual levels within a rating variable are “pulled” towards the overall group mean, acting as additional regularization. More or less shrinkage is induced depending on the variation between levels, the number of observations for a level, and the amount of variance within a level. All rating variables share the same grand mean, since the effects are crossed.

Priors are given to $\beta$ and $\sigma^2$ so that these parameters are inferred from the data. These parameters are inferred using all of the data regardless of the observation’s level, which is where the “information sharing” aspect of partial pooling comes into play.

The frequency model for $N_{i_{[jklm]}}$ follows similarly to the formulation for severity.

Stan Code and Model Evaluation

functions {

/* This is just the observed loglikelihood of the IG, with additive constants
removed */

    real IG_lpdf (real x, real mu, real lambda) {
    
        real lprob;
        
        if (!(x > 0)) {
            reject("x must be strictly greater than 0");
        } 
        
        if (!(mu > 0)) {
            reject("mu must be strictly greater than 0");
        } 
        
        if (!(lambda > 0)) {
            reject("lambda must be strictly greater than 0");
        }   
        
        lprob = 0.5 * log(lambda) - (lambda * (x-mu)^2/(2 * mu^2 * x));
        return lprob;
    }
}

data {
  int<lower=1> Nobs;
  int<lower=1> Nvar;
  int<lower=1> N_lev_km;
  int<lower=1> N_lev_zone;
  int<lower=1> N_lev_bonus;
  int<lower=1> N_lev_make;
  int X[Nobs, Nvar];
  int<lower=0> claim_count[Nobs];
  vector<lower=0>[Nobs] exposure;
  vector<lower=0>[Nobs] severity;
}

parameters {

  real<lower=0> dispersion_severity;
  real<lower=0> dispersion_frequency;
  
  /* Frequency random effects mu and sigma */
  real mu_freq;
  real<lower=0> freq_sigma_km;
  real<lower=0> freq_sigma_zone;
  real<lower=0> freq_sigma_bonus;
  real<lower=0> freq_sigma_make;
  
  vector[N_lev_km] freq_eta_km;
  vector[N_lev_zone] freq_eta_zone;
  vector[N_lev_bonus] freq_eta_bonus;
  vector[N_lev_make] freq_eta_make;
  
  /* Severity random effects mu and sigma */
  real mu_sev;
  real<lower=0> sev_sigma_km;
  real<lower=0> sev_sigma_zone;
  real<lower=0> sev_sigma_bonus;
  real<lower=0> sev_sigma_make;
  
  vector[N_lev_km] sev_eta_km;
  vector[N_lev_zone] sev_eta_zone;
  vector[N_lev_bonus] sev_eta_bonus;
  vector[N_lev_make] sev_eta_make;
  
}

transformed parameters {

    /* Frequency random effects */
    vector[N_lev_km] freq_re_km;
    vector[N_lev_zone] freq_re_zone;
    vector[N_lev_bonus] freq_re_bonus;
    vector[N_lev_make] freq_re_make;
    
    /* Severity random effects */
    vector[N_lev_km] sev_re_km;
    vector[N_lev_zone] sev_re_zone;
    vector[N_lev_bonus] sev_re_bonus;
    vector[N_lev_make] sev_re_make;
    
    freq_re_km = freq_sigma_km * freq_eta_km;
    freq_re_zone = freq_sigma_zone * freq_eta_zone;
    freq_re_bonus = freq_sigma_bonus * freq_eta_bonus;
    freq_re_make = freq_sigma_make * freq_eta_make;

    sev_re_km = sev_sigma_km * sev_eta_km;
    sev_re_zone = sev_sigma_zone * sev_eta_zone;
    sev_re_bonus = sev_sigma_bonus * sev_eta_bonus;
    sev_re_make = sev_sigma_make * sev_eta_make;    
    
}

model {

    vector[Nobs] mu_nb = mu_freq + rep_vector(0, Nobs) + log(exposure);
    vector[Nobs] mu_ig = mu_sev + rep_vector(0, Nobs);
    
    mu_nb += freq_re_km[X[, 1]] + freq_re_zone[X[, 2]] + freq_re_bonus[X[, 3]] + freq_re_make[X[, 4]];
    mu_ig += sev_re_km[X[, 1]] + sev_re_zone[X[, 2]] + sev_re_bonus[X[, 3]] + sev_re_make[X[, 4]];
  
  mu_freq ~ normal(1.6, 2.5);
  mu_sev ~ normal(8, 2.5);
  dispersion_severity ~ cauchy(0, 5);
  dispersion_frequency ~ cauchy(0, 12);
  
  /* Eta parameters: all must be N(0, 1)  */
  freq_eta_km ~ std_normal();
  freq_eta_zone ~ std_normal();
  freq_eta_bonus ~ std_normal();
  freq_eta_make ~ std_normal();
  
  sev_eta_km ~ std_normal();
  sev_eta_zone ~ std_normal();
  sev_eta_bonus ~ std_normal();
  sev_eta_make ~ std_normal();
  
  /* Variation between levels for each rating variable */
  /* Default weakly informative again */
  freq_sigma_km ~ student_t(3, 0, 2.5);
  freq_sigma_zone ~ student_t(3, 0, 2.5);
  freq_sigma_bonus ~ student_t(3, 0, 2.5);
  freq_sigma_make ~ student_t(3, 0, 2.5);
  
  sev_sigma_km ~ student_t(3, 0, 2.5);
  sev_sigma_zone ~ student_t(3, 0, 2.5);
  sev_sigma_bonus ~ student_t(3, 0, 2.5);
  sev_sigma_make ~ student_t(3, 0, 2.5);
  
  claim_count ~ neg_binomial_2_log(
    mu_nb, dispersion_frequency
  );
  
  for (i in 1:Nobs) {
  
    if (claim_count[i] != 0) {
      target += IG_lpdf(
        severity[i] | exp(mu_ig[i]), 
        claim_count[i] / dispersion_severity
        );
    } 
  }
  
}

We have to redefine the data list here since the random effects model is parameterized differently.

data_re <- list(
    Nobs = nrow(data_with_severity),
    Nvar = 4, 
    N_lev_km = length(unique(data_with_severity$kilometres)),
    N_lev_zone = length(unique(data_with_severity$zone)),
    N_lev_bonus = length(unique(data_with_severity$bonus)),
    N_lev_make = length(unique(data_with_severity$make)),
    X = data_with_severity[, 1:4],
    claim_count = data_with_severity$claim_count,
    exposure = data_with_severity$vehicle_exposure_years,
    severity = data_with_severity$severity
)

Sampling from the posterior again:

if (refit == TRUE) {

    fit_re <- sampling(
            sm_random_effects,
            data = data_re,
            chains = 6,
            iter = 2000,
            control = list(adapt_delta = 0.98),
            seed = 200350623,
            cores = 6,
            verbose = TRUE
            )
    
    saveRDS(fit_re, file = "rds/fitted_re.rds")
    
} else {
    
    fit_re <- readRDS("rds/fitted_re.rds")
    
}

Extracting the needed samples to simulate from the posterior predictive:

re_samples <- rstan::extract(
    fit_re,
    pars = c(
        "dispersion_severity",
        "dispersion_frequency",
        "mu_freq",
        "mu_sev",
        "freq_sigma_km",
        "freq_sigma_zone",
        "freq_sigma_make",
        "freq_sigma_bonus",
        "sev_sigma_km",
        "sev_sigma_zone",
        "sev_sigma_make",
        "sev_sigma_bonus",
        "freq_re_km",
        "freq_re_zone",
        "freq_re_bonus",
        "freq_re_make",
        "sev_re_km",
        "sev_re_zone",
        "sev_re_bonus",
        "sev_re_make"
        )
    )

Computing the required posterior parameters for the frequency distribution, and then generating posterior predictive samples:

nb_mu <- matrix(
    data = re_samples$mu_freq, nrow = length(re_samples[[1]]), ncol = nrow(data_with_severity)
    ) + 
    matrix(
        data = log(data_with_severity$vehicle_exposure_years),
        nrow = length(re_samples[[1]]),
        ncol = nrow(data_with_severity),
        byrow = TRUE
    ) + 
    re_samples$freq_re_km[, data_with_severity$kilometres] +
    re_samples$freq_re_zone[, data_with_severity$zone] +
    re_samples$freq_re_bonus[, data_with_severity$bonus] + 
    re_samples$freq_re_make[, data_with_severity$make]

nb_disp <- matrix(
    data = re_samples$dispersion_frequency,
    nrow = length(re_samples[[1]]),
    ncol = nrow(data_with_severity)
)

set.seed(200350623)
claim_count_posterior_pred_nb <- rnbinom(
    n = length(nb_mu),
    mu = exp(nb_mu),
    size = nb_disp
    )

dim(claim_count_posterior_pred_nb) <- dim(nb_mu)

ig_mu <- matrix(
    data = re_samples$mu_sev, nrow = length(re_samples[[1]]), ncol = nrow(data_with_severity)
) +
    re_samples$sev_re_km[, data_with_severity$kilometres] +
    re_samples$sev_re_zone[, data_with_severity$zone] +
    re_samples$sev_re_bonus[, data_with_severity$bonus] + 
    re_samples$sev_re_make[, data_with_severity$make]

ig_lambda <- claim_count_posterior_pred_nb / 
    matrix(
        data = re_samples$dispersion_severity,
        nrow = length(re_samples[[1]]),
        ncol = nrow(data_with_severity)
        )

set.seed(200350623)
severity_posterior_predictive <- rinvgauss(
    n = length(ig_lambda),
    mean = exp(ig_mu),
    shape = ig_lambda
)

dim(severity_posterior_predictive) <- dim(ig_lambda)

posterior_predictive_pp <- claim_count_posterior_pred_nb / 
    matrix(
        data = data_with_severity$vehicle_exposure_years,
        nrow = length(re_samples[[1]]),
        ncol = nrow(data_with_severity),
        byrow = TRUE
    ) * severity_posterior_predictive

Finally:

results <- mapply(
    posterior_pred_check,
    posterior_predictive_samples = list(
            claim_count_posterior_pred_nb,
            severity_posterior_predictive,
            posterior_predictive_pp
            ),
    y = list(
            data_with_severity$claim_count,
            data_with_severity$severity,
            data_with_severity$observed_pp
            ),
    variable = c(
            "claims (sqrt scaled)",
            "severity (sqrt scaled)",
            "pure premium (sqrt scaled)"
            ),
    limits = list(c(0, 700), c(0, 40000), c(0, 4000)),
    MoreArgs = list(n = 50),
    SIMPLIFY = FALSE
    )

(results[[1]] | results[[2]]) / results[[3]] + plot_annotation(
    title = "Posterior Predictive Check - Random Effects Model"
)

Overall, we can see some improvement particularly at pure premium regions of low density. The variability is noticeably wider than the previous model in certain regions of the pure premium density, while regions of high density are left basically unchanged from the previous unpooled model.

This isn’t really that surprising. The rating combinations corresponding with the most frequent pure premium values are made up of many exposures and claims, causing small within group variation (this is explicitly formulated in both generative models) and therefore less pooling.

Relativities

Assuming that the hierarchical Bayesian model above is sufficient, we can calculate relativities by using the simulated pure premiums from the model above.

As mentioned previously, a base class (the class where all other rating combinations are compared against) is required. In the data, this corresponds to the rating combination corresponding to row 252 (1 = km, 4 = zone, 7 = bonus, 9 = make) as this rating combination has the largest number of exposures.

We can now calculate posterior predictive distributions for the relativities for each rating variable.

base_class_row <- which(
    data$vehicle_exposure_years == max(data$vehicle_exposure_years)
    )

find_rels <- function(column_indices, posterior_pred_pp, base_class_row) {
    
    posterior_premium <- posterior_predictive_pp[, column_indices]
    
    relativities <- posterior_premium /
        posterior_premium[, which(column_indices == base_class_row)]
    
    return(relativities)
}

graph_rels <- function(rels, rv_name) {
    
    base_level <- which(sapply(X = rels, FUN = function(x) all(x == 1)) == TRUE)
    
    error_bars <- tibble(
        Level = as.character(1:ncol(rels)),
        median = apply(rels, MARGIN = 2, FUN = function(x) median(x)),
        lower = apply(rels, MARGIN = 2, FUN = function(x) quantile(x, 0.05)),
        upper = apply(rels, MARGIN = 2, FUN = function(x) quantile(x, 0.95))
    ) %>% 
        filter(Level != as.character(base_level))
    
    graph <- rels %>%
        gather(key = "Level", value = "Relativity") %>%
        filter(Level != as.character(base_level)) %>%
    ggplot(data = ., aes(x = Level, y = Relativity)) + 
        geom_violin() +
        theme_bw() +
        labs(title = paste(rv_name, "Relativity Densities")) +
        geom_errorbar(
            data = error_bars,
            mapping = aes(x = Level, ymin = lower, ymax = upper), 
            width = 0.2,
            inherit.aes = FALSE
            ) +
        coord_flip() +
        stat_summary(fun.y = median, geom = "point", size = 2, color = "red") +
        geom_text(
            data = error_bars,
            mapping = aes(x = Level, y = median, label = round(median, 2)),
            vjust = -0.5,
            size = 4,
            inherit.aes = FALSE
        )
    
    return(graph)
    
}

create_tables <- function(rels, rv_name) {
    
    table <- tibble(
        temp = 1:ncol(rels),
        `Estimated Relativity (median)` = apply(
            X = rels, MARGIN = 2, FUN = function(x) median(x)
            ),
        `Lower 90 Credible Interval` = apply(
            X = rels, MARGIN = 2, FUN = function(x) quantile(x, 0.05)
            ),
        `Upper 90 Credible Interval` = apply(
            X = rels, MARGIN = 2, FUN = function(x) quantile(x, 0.95)
            ),
        `Standard Deviation` = apply(
            X = rels, MARGIN = 2, FUN = function(x) sd(x)
            )
    )
    
    colnames(table)[1] <- rv_name
    base_level <- which(sapply(X = rels, FUN = function(x) all(x == 1)) == TRUE)
    
    table %>%
        kable(digits = 3) %>%
        kable_styling() %>%
        row_spec(base_level, bold = T, color = "white", background = "teal") %>%
        return()
    
}

Kilometres

# Get the column indices of the needed pure premium distributions
desired_km_columns <- which(
    data$zone == 4 & data$bonus == 7 & data$make == 9
)

km_rels <- as_tibble(find_rels(
    column_indices = desired_km_columns,
    posterior_pred_pp = posterior_predictive_pp,
    base_class_row = base_class_row
    )) 

colnames(km_rels) <- 1:ncol(km_rels)

We can plot the posterior distribution of relativities for each non-base rating level:

graph_rels(km_rels, rv_name = "Kilometres")

The above plot in table format, with base class highlighted in blue:

create_tables(km_rels, "Kilometres")

Kilometres	Estimated Relativity (median)	Lower 90 Credible Interval	Upper 90 Credible Interval	Standard Deviation
1	1.000	1.000	1.000	0.000
2	1.229	0.957	1.591	0.193
3	1.338	1.034	1.735	0.214
4	1.458	1.112	1.911	0.244
5	1.719	1.293	2.247	0.293

In this case, we can see that the estimated relativities (the median of each posterior predictive distribution) follows a monotonic pattern and likely no adjustments would be needed.

The 90% credible intervals describe a range in which the actual relativity of a future level K policyholder will fall with 90% probability, holding all other rating variables (bonus, make, and zone) constant. These are prediction intervals.

The point is to emphasize that while our single “best prediction” might be 1, 1.231, 1.339, 1.458, and 1.717 for each rating level, values such as 1, 1.3, 1.6, 1.8, 2 could reasonably be observed as well. Another set could be 1, 0.97, 1.4, 1.3, 1.5. These combinations aren’t as likely, but plausible given the data.

In fact, we can calculate the probability of any future scenario occurring (with respect to the relativities):

Some examples:

The exact probability that a future level 2 policyholder will have a higher relativity than a future level 3 or 4 or 5 policyholder (holding all other rating variables constant):

mean(!apply(
    X = km_rels[, 2:5],
    MARGIN = 1,
    FUN = function(x) min(x)) == km_rels[, 2]
    )

## [1] 0.3578333

The probability that a future level 2 policyholder has the largest relativity out of a single group of four future policyholders with levels 2, 3, 4, and 5 (holding all other rating variables constant):

mean(apply(
    X = km_rels[, 2:5],
    MARGIN = 1,
    FUN = function(x) max(x)) == km_rels[, 2]
    )

## [1] 0.009

The probability that a future level 3 policyholder exceeds the relativity of a future level 2 policyholder by at least 0.1 (or any arbitrary constant, and again holding all other rating variables constant):

mean(km_rels[, 3] - km_rels[, 2] > 0.1)

## [1] 0.517

…and so on

Zone

# Get the column indices of the needed pure premium distributions
desired_zone_columns <- which(
    data$kilometres == 1 & data$bonus == 7 & data$make == 9
)

zone_rels <- as_tibble(find_rels(
    column_indices = desired_zone_columns,
    posterior_pred_pp = posterior_predictive_pp,
    base_class_row = base_class_row
    )) 

colnames(zone_rels) <- 1:ncol(zone_rels)

graph_rels(zone_rels, rv_name = "Zone")

The above plot in a table:

create_tables(zone_rels, "Zone")

Zone	Estimated Relativity (median)	Lower 90 Credible Interval	Upper 90 Credible Interval	Standard Deviation
1	1.531	1.187	1.984	0.244
2	1.251	0.967	1.631	0.200
3	1.094	0.844	1.417	0.176
4	1.000	1.000	1.000	0.000
5	1.144	0.851	1.529	0.206
6	1.049	0.791	1.379	0.179
7	0.760	0.466	1.194	0.233

Bonus

# Get the column indices of the needed pure premium distributions
desired_bonus_columns <- which(
    data$kilometres == 1 & data$zone == 4 & data$make == 9
)

bonus_rels <- as_tibble(find_rels(
    column_indices = desired_bonus_columns,
    posterior_pred_pp = posterior_predictive_pp,
    base_class_row = base_class_row
    )) 

colnames(bonus_rels) <- 1:ncol(bonus_rels)

graph_rels(bonus_rels, rv_name = "Bonus")

The above plot in a table:

create_tables(bonus_rels, "Bonus")

Bonus	Estimated Relativity (median)	Lower 90 Credible Interval	Upper 90 Credible Interval	Standard Deviation
1	3.365	2.604	4.373	0.547
2	2.249	1.702	2.966	0.387
3	1.812	1.349	2.415	0.327
4	1.562	1.138	2.125	0.306
5	1.387	1.011	1.881	0.266
6	1.326	0.989	1.757	0.236
7	1.000	1.000	1.000	0.000

Make

# Get the column indices of the needed pure premium distributions
desired_make_columns <- which(
    data$kilometres == 1 & data$zone == 4 & data$bonus == 7
)

make_rels <- as_tibble(find_rels(
    column_indices = desired_make_columns,
    posterior_pred_pp = posterior_predictive_pp,
    base_class_row = base_class_row
    )) 

colnames(make_rels) <- 1:ncol(make_rels)

graph_rels(make_rels, rv_name = "Make")

The above plot in a table:

create_tables(make_rels, "Make")

Make	Estimated Relativity (median)	Lower 90 Credible Interval	Upper 90 Credible Interval	Standard Deviation
1	1.152	0.852	1.542	0.215
2	1.171	0.678	1.937	0.391
3	0.949	0.500	1.686	0.370
4	0.499	0.314	0.753	0.134
5	1.230	0.800	1.803	0.307
6	0.768	0.532	1.088	0.169
7	0.942	0.577	1.471	0.273
8	1.180	0.443	2.696	0.728
9	1.000	1.000	1.000	0.000

Credible Intervals for Pure Premiums

If the actual predicted pure premiums are desired instead of relativities, we can derive these easily as well:

tibble(
    kilometres = data$kilometres,
    zone = data$zone,
    bonus = data$bonus,
    make = data$make,
    mean_pred = apply(
        X = posterior_predictive_pp, MARGIN = 2, FUN = function(x) mean(x)
        ),
    median_pred = apply(
        X = posterior_predictive_pp, MARGIN = 2, FUN = function(x) median(x)
        ),
    lower = apply(
        X = posterior_predictive_pp, MARGIN = 2, FUN = function(x) quantile(x, 0.05)
        ),
    upper = apply(
        X = posterior_predictive_pp, MARGIN = 2, FUN = function(x) quantile(x, 0.95)
        )
) %>%
    datatable(
        rownames = FALSE,
        colnames = c(
            "Kilometres",
            "Zone",
            "Bonus",
            "Make",
            "Predicted Pure Premium (mean)",
            "Predicted Pure Premium (median)",
            "Lower 90% Credible Interval",
            "Upper 90% Credible Interval"
        ),
        filter = list(position = 'top', clear = FALSE),
        options = list(
            scrollY = 300, height = "700px", scrollX = 300)
        ) %>%
    formatRound(columns = 5:8, digits = 0)

As with the relativities, the 90% credible intervals describe a range in which the future pure premium of any policyholder with a specific set of rating variables will lie with 90% probability. For example: for a policyholder with kilometres = 1, zone = 4, bonus = 2, and make = 9, there is a 90% chance that their future pure premium will fall in the interval [228, 345].

Note that for some of the rating combinations, there exists credible intervals of [0, 0]. This is not an error, as these rating combinations essentially have close to 100% probability of observing no claims. Thus, at least 95% of the time there will be 0 claims, and therefore, a pure premium of 0. That being said, the posterior means reveal that the model does not assign 0 probability to the event where pure premium is > 0. Indeed, some of the posterior distributions exhibit significant right skewness.

The table below provides the probability that the predicted pure premium is greater than 0 for the rating combinations that have 90% credible intervals of [0, 0]:

related_0_medians_lower <- which(
    apply(
        X = posterior_predictive_pp, MARGIN = 2, FUN = function(x) quantile(x, 0.05)
        ) == 0 
)

related_0_medians_upper <- which(
    apply(
        X = posterior_predictive_pp, MARGIN = 2, FUN = function(x) quantile(x, 0.95)
        ) == 0 
)

related_0_medians <- intersect(related_0_medians_lower, related_0_medians_upper)

tibble(
    kilometres = data$kilometres[related_0_medians],
    zone = data$zone[related_0_medians],
    bonus = data$bonus[related_0_medians],
    make = data$make[related_0_medians],
    median_pred = apply(
        X = posterior_predictive_pp[, related_0_medians],
        MARGIN = 2,
        FUN = function(x) median(x)
        ),
    prob_gt_0 = apply(
        X = posterior_predictive_pp[, related_0_medians],
        MARGIN = 2,
        FUN = function(x) mean(x > 0)
        )
) %>%
    datatable(
        rownames = FALSE,
        colnames = c(
            "Kilometres",
            "Zone",
            "Bonus",
            "Make",
            "Predicted Pure Premium (median)",
            "Pr(pure premium > 0)"
        ),
        filter = list(position = 'top', clear = FALSE),
        options = list(scrollY = 300, searching = FALSE, scrollX = 300)) %>%
    formatRound(columns = 5, digits = 3) %>%
    formatRound(columns = 6, digits = 5)

The last column illustrates the posterior predictive probability that the future pure premium of a policyholder exceeds 0 for a particular set of rating combinations. Notice how all of these probabilities are greater than 0.

By simply taking the compliment of these probabilities, 1 - Pr(pure premium > 0), we get the probability that the pure premium is exactly 0. Notice how all of these values are less than 0.05, which is what we used as the cutoff for our credible intervals. That is, for the 51 rating combinations above, there is <0.05 probability of actually observing a non-zero pure premium.

Appendix: Model Convergence Checking

For the sake of brevity, only the convergence of the most complicated model is illustrated (the hierarchical Bayesian model). The other models demonstrated convergence and appear to have mixed well, based on the same checks done in this section. This isn’t that surprising because all of the models except the hierarchical Bayesian model are relatively simple in terms of parameterization.

Checking the number of effective samples and Gelman-Rubin’s R

Autocorrelation within the MCMC process increases the variance of Monte Carlo estimates and thus, any inferences drawn from highly autocorrelated chains may be unreliable. The effective sample size attempts to correct for this autocorrelation, by restating (or in some cases with autocorrelation < 1, increasing) the actual number of posterior samples by the amount of autocorrelation exhibited.

Burkner (2017) recommends effective samples sizes of at least 1000 to get stable inferences.

Viewing the summary statistics:

table_check_n_eff <- as_tibble(summary(fit_re)$summary[, 9:10]) %>%
    bind_cols(parameter = rownames(summary(fit_re)$summary), .) %>%
    arrange(n_eff)

table_check_n_eff %>%
    datatable(
        rownames = FALSE,
        colnames = c(
            "Parameter",
            "Effective Sample Size",
            "Rhat"
        ),
        options = list(scrollY = 300, scrollX = 300)) %>%
    formatRound(columns = 2, digits = 0) %>%
    formatRound(columns = 3, digits = 4)

As we can see here, the minimum effective size we have obtained is 1285 which is larger than 1000. Therefore, Monte Carlo estimates are likely precise enough.

Gelman-Rubin’s Rhat tries to assess convergence by analyzing the variance between and within chains. Due to the law of total variance, the marginal variance of a parameter can be estimated as a weighted average of the average within chain variance and the variance of the averages between chains.

However, if all of the chains have converged to the same thing, the between chain variation should be negligible and therefore, the ratio of the weighted average estimator and the average variation within each individual chain should approximately be 1.

Gelman and Rubin recommend that all parameters have Rhat values less than 1.01. As we can see here, the largest Rhat we observe is 1.0041183, which does not exceed 1.01. This is good evidence to suggest that all of the separate Markov Chains we have run have converged to the same posterior distribution. Thus, there is good evidence to suggest that there are no convergence issues.

Traceplots

We can further verify that each separate chain has converged to the same posterior distribution, even though the chains were initialized at different starting points.

Similar to the Rhat metric, the traceplots demonstrate separate chains converging to the same distribution (if there are no issues). These are often described as “fuzzy caterpillars” if the chains have converged. Each “fuzzy caterpillar” should overlap for the most part, and be randomly distributed about a constant (the posterior mean). If there are differences between chains, or if there is a clear systematic deviation from the average posterior parameter, it is very likely that some (or all) of the chains have not sampled from the true posterior distribution.

Graphing the traceplots for the parameters that generate our predicted pure premiums (and therefore, relativities) below:

traceplot(
    fit_re, 
    pars = c(
        "dispersion_frequency",
        "dispersion_severity",
        "mu_freq",
        "freq_sigma_km"
        )
)

traceplot(
    fit_re, 
    pars = c(
        "freq_sigma_zone",
        "freq_sigma_bonus",
        "freq_sigma_make",
        "mu_sev",
        "sev_sigma_km",
        "sev_sigma_zone",
        "sev_sigma_bonus",
        "sev_sigma_make"
        )
)

traceplot(
    fit_re, 
    pars = c(
        "freq_re_km",
        "freq_re_zone"
        )
)

traceplot(
    fit_re, 
    pars = c(
        "freq_re_bonus",
        "freq_re_make"
        )
)

traceplot(
    fit_re, 
    pars = c(
        "sev_re_km",
        "sev_re_zone"
        )
)

traceplot(
    fit_re, 
    pars = c(
        "sev_re_bonus",
        "sev_re_make"
        )
)

The traceplots look ideal.

In conclusion, there are likely no convergence issues.