Getting Model Fits

Sometimes our likelihood function involves calculating a series of intermediate values. A good example is when we have a model that describes our fit to the data, which may be combined with an error term to capture random noise around this prediction. Once the MCMC has finished, we may want to extract this model fit so that we can see how good it really is. Having already written down the code for calculating the model fit once within the likelihood, we don’t want to have to write it down again when visualising results as this duplication of work is not only a waste of time but also can introduce errors.

This vignette demonstrates how intermediate values can be obtained from MCMC output using a single likelihood function.

A simple model of population growth

For this example we will use some data on population growth and we will attempt to fit a simple curve through these points. The data can be loaded directly from within the drjacoby package, and takes the form of a simple data.frame of sampling times and population sizes:

# load example data
data("population_data")
head(population_data)
#>   pop time
#> 1  14   33
#> 2  26   37
#> 3  36   41
#> 4  30   43
#> 5  37   44
#> 6  57   48

# plot points
plot(population_data$time, population_data$pop, ylim = c(0, 120), pch = 20,
     xlab = "Time (days)", ylab = "Population size")

For our model we will assume a discrete-time logistic growth model, in which the population starts at size N0 and grows at a rate r each timestep, but also experiences density-dependent death causing it to plateau out at a maximum population size Nmax. If we use Nt to denote the population size at time t then we can write this model as follows:

$$ \begin{align} N_{t+1} = rN_t(1 - N_t / K), \hspace{10mm} \mbox{where } K = N_{\mbox{max}}\left(\frac{r}{r-1}\right). \end{align} $$

There are three parameters in this model, and so our parameters dataframe should contain the three variables with suitable ranges as follows:

# define parameters dataframe
df_params <- define_params(name = "N_0", min = 1, max = 100,
                           name = "r", min = 1, max = 2,
                           name = "N_max", min = 50, max = 200)

Within our likelihood function we need to calculate our expected population growth curve. This curve is perfectly smooth, but the real data is noisy so we will use a Poisson distribution to link the data to the model.

Crucially, when defining the likelihood function we will create an option to return the model prediction, rather than the log-likelihood. We use the misc input to the function to control whether this option is active or not. If we set misc$output_type = 1 then we will obtain the log-likelihood, but if we set misc$output_type = 2 then we will obtain the complete curve of model-predicted values:

# define log-likelihood function
loglike <- function(params, data, misc) {
  
  # extract parameter values
  N_0 <- params["N_0"]
  r <- params["r"]
  N_max <- params["N_max"]
  
  # calculate expected population growth
  K <- N_max * r / (r - 1)
  x <- rep(0, 100)
  x[1] <- N_0
  for (i in 2:length(x)) {
    x[i] <- r * x[i-1] * (1 - x[i-1] / K)
  }
  
  # option to return model prediction rather than log-likelihood
  if (misc$output_type == 2) {
    return(x)
  }
  
  # calculate log-likelihood
  ret <- sum(dpois(data$pop, lambda = x[data$time], log = TRUE))
  
  # return
  return(ret)
}

# define R logprior function
logprior <- function(params, misc) {
  dunif(params["N_0"], 1, 100, log = TRUE) +
    dunif(params["r"], 1, 2, log = TRUE) +
    dunif(params["N_max"], 50, 200, log = TRUE)
}

When running the MCMC we want to make sure misc$output_type = 1:

# run MCMC
mcmc <- run_mcmc(data = population_data,
                 df_params = df_params,
                 misc = list(output_type = 1),
                 loglike = loglike,
                 logprior = logprior,
                 burnin = 1e3,
                 samples = 1e4,
                 chains = 3,
                 pb_markdown = TRUE)
#> MCMC chain 1
#> burn-in
#>   |                                                                              |======================================================================| 100%
#> acceptance rate: 42.5%
#> sampling phase
#>   |                                                                              |======================================================================| 100%
#> acceptance rate: 43.9%
#> chain completed in 2.880754 seconds
#> MCMC chain 2
#> burn-in
#>   |                                                                              |======================================================================| 100%
#> acceptance rate: 43.1%
#> sampling phase
#>   |                                                                              |======================================================================| 100%
#> acceptance rate: 44%
#> chain completed in 2.877564 seconds
#> MCMC chain 3
#> burn-in
#>   |                                                                              |======================================================================| 100%
#> acceptance rate: 42.9%
#> sampling phase
#>   |                                                                              |======================================================================| 100%
#> acceptance rate: 44%
#> chain completed in 2.979498 seconds
#> total MCMC run-time: 8.74 seconds

Once we have MCMC output we should perform our usual checks to make sure that everything has converged nicely (we will skip this step here in the interest of time). Assuming everything is OK, we can move on to exploring model fits. We will do this by running the same likelihood function but with misc$output_type = 2, using the posterior parameter draws as inputs. We may not want to use every posterior parameter draw, in which case we can use the sample_chains() function to sub-sample the output down as needed:

# sample from posterior
param_draws <- sample_chains(mcmc, 1000)

# get matrix of model predictions
model_matrix <- apply(param_draws, 1, function(x) {
  loglike(params = x, data = population_data, misc = list(output_type = 2))
})

Plotting these against the data we can see that - reassuringly - the posterior model fits make a smooth curve through the data points:

matplot(model_matrix, type = 'l', lty = 1, col = "#99000010", ylim = c(0, 120),
        xlab = "Time (days)", ylab = "Population size")
points(population_data$time, population_data$pop, pch = 20)

Notice that we only had to define the model fitting computation once in this pipeline, as the same likelihood function was used for both inference and obtaining the fit. There are more complicated versions of this approach that may be useful in some settings, for example using switch functions or multiple if statements to allow for several different types of output from the likelihood function, or even varying aspects of the core model such as the error distribution using flags contained in misc.