Parallelisation of inference

In the SIR model vignette we ran four separate MCMC chains in serial. For more computationally intensive models you can speed this up if you have many CPU cores available.

Within-model parallelism

Because mcstate uses dust, it can run each particle in parallel, between the timesteps that you have data for. For this to be possible, openmp must be enabled in the compilation step, which you can check by using the has_openmp() method on your model:

model$public_methods$has_openmp()
#> [1] TRUE

When constructing the particle filter object, you must specify the n_threads argument, for example to use 4 threads:

filter <- mcstate::particle_filter$new(data = data,
                                       model = model,
                                       n_particles = 100,
                                       compare = compare,
                                       n_threads = 4)

When you pass this filter object through to mcstate::mcmc, the particle filter will then run using 4 threads.

You can wrap this n_threads argument with dust::dust_openmp_threads, for example

dust::dust_openmp_threads(100, action = "fix")
#> Requested number of threads '100' exceeds a limit of '1'
#>  See dust::dust_openmp_threads() for details
#> [1] 1

which will adjust the number of threads used (alternative actions include “message” or “error”).

Between-chain parallelism

If your model is quick to run, like the SIR example, it may be more efficient to use cores to run parallel chains instead of parallel particles. This is because relatively little time is spent in the parallelised fraction of the code, and more in the serial parts (the mcmc bookkeeping, your compare functions in the particle filter, copying data between the R and C++ interface; basically anything other than the number crunching in the core of the model).

We can do this by passing the n_workers argument to mcstate::pmcmc_control; this will create separate worker processes and share chains out across them. You can have fewer workers than chains, and each worker can use more than one core if needed (subject to a few constraints documented in the help page).

In this case, the number of threads used to create the particle filter is ignored (because the particle filter will be created on each process) and the total number of threads to use should be provided by the n_threads_total argument to mcstate::pmcmc_control.

For example, if you had 16 cores available, running 4 chains over 2 workers, each using 8 cores you might write:

control <- mcstate::pmcmc_control(1000, n_chains = 4, n_workers = 2,
                                  n_threads_total = 16)

or, if your system does not support OpenMP you could use 4 workers for these chains:

control <- mcstate::pmcmc_control(1000, n_chains = 4, n_workers = 4,
                                  n_threads_total = 16)

If your system does support OpenMP, then once chains start finishing their threads will be allocated to the ongoing chains.

The random numbers are configured in such a way that the final result will be dependent only on the seed to create your filter object and not the number of worker processes or threads used. However, the algorithm does differ slightly by default to that used without workers. To use this parallel behaviour even when n_workers is 1, set use_parallel_seed = TRUE within mcstate::pmcmc_control.

Considerations

Parallelism at the level of particles (increasing the number of threads available to the particle filter) has very low overhead and is potentially very efficient as the “serial” part of the calculation here is just the comparison functions. On Linux we see linear scaling of performance with cores; i.e., that if you double the number of cores you halve the running time.

However, on Windows we do not realise this level of efficiency (for reasons under investigation). In that case you will want to use multiple worker processes; these are essentially isolated from each other and on windows can lead to higher efficiencies. You should only need 2-4 worker processes in order to use 100% of your CPU, with the total number of threads set to the number of cores you have available. This approach may take a few seconds longer over the whole run than a perfectly efficient run, but potentially significantly less time than the realised efficiency we see on windows.