With the core approach in dust, you can run models in parallel
efficiently up to the number of cores your workstation has available.
Getting more than 32 or 64 cores is hard though, and dust
provides no multi-node parallelism (e.g., MPI).
Instead, we have developed a system for running dust
models
on GPUs (graphical processing units), specifically NVIDIA GPUs via CUDA
(Compute Unified Device Architecture).
This vignette is written in reverse-order, starting with how to run a
model on a GPU, before covering how to write a dust
model
that can be run on a GPU. The reason for this is that we do not expect
that people will write these directly; instead we expect that most
people will use the odin.dust
package to generate these interfaces automatically, without having to
write a single line of C++ or CUDA code.
real_type = double
we want to get the same results.The sirs
model includes GPU support, and will be the
focus of this vignette. However, the installed version cannot be run
directly on the GPU for a couple of reasons:
float
) mode rather than double precision
(double
), and changing that requires re-compilationIn addition, you will also need:
You can check with the command-line tool nvidia-smi
if
you have suitable hardware and drivers and with
dust::dust_cuda_configuration(quiet = FALSE, forget = TRUE)
if you have suitable build tools.
So here, rather than using dust::dust_example
, we
compile the model and pass in arguments:
path <- system.file("examples/sirs.cpp", package = "dust")
sirs <- dust::dust(path, gpu = TRUE, real_type = "float")
#> ℹ 34 functions decorated with [[cpp11::register]]
#> ✔ generated file 'cpp11.R'
#> ✔ generated file 'cpp11.cpp'
#> Re-compiling sirs817bc717gpu
#> ─ installing *source* package ‘sirs817bc717gpu’ ...
#> ** using staged installation
#> ** libs
#> make[1]: Entering directory '/tmp/RtmppiOue1/fileb32bb676504e5/src'
#> make[1]: Leaving directory '/tmp/RtmppiOue1/fileb32bb676504e5/src'
#> make[1]: Entering directory '/tmp/RtmppiOue1/fileb32bb676504e5/src'
#> g++ -std=gnu++14 -I"/usr/share/R/include" -DNDEBUG -I'/home/rfitzjoh/lib/R/library/cpp11/include' -g -Wall -Wextra -pedantic -Wmaybe-uninitialized -Wno-unused-parameter -Wno-cast-function-type -Wno-missing-field-initializers -O2 -I/home/rfitzjoh/lib/R/library/dust/include -DHAVE_INLINE -fopenmp -fpic -g -O2 -fdebug-prefix-map=/build/r-base-QwogzP/r-base-4.1.1=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g -c cpp11.cpp -o cpp11.o
#> nvcc -std=c++14 -O2 -I. -I/usr/share/R/include -I/home/rfitzjoh/lib/R/library/dust/include -I/home/rfitzjoh/.local/share/R/dust/cub -I'/home/rfitzjoh/lib/R/library/cpp11/include' -gencode=arch=compute_75,code=sm_75 -Xcompiler -fPIC -Xcompiler -fopenmp -x cu -c dust.cu -o dust.o
#> g++ -std=gnu++14 -shared -L/usr/lib/R/lib -Wl,-Bsymbolic-functions -Wl,-z,relro -o sirs817bc717gpu.so cpp11.o dust.o -lcudart -fopenmp -L/usr/lib/R/lib -lR
#> make[1]: Leaving directory '/tmp/RtmppiOue1/fileb32bb676504e5/src'
#> installing to /tmp/RtmppiOue1/devtools_install_b32bb69aa2d4d/00LOCK-fileb32bb676504e5/00new/sirs817bc717gpu/libs
#> ** checking absolute paths in shared objects and dynamic libraries
#> ─ DONE (sirs817bc717gpu)
#>
#> ℹ Loading sirs817bc717gpu
Notice in compilation that nvcc
is used to compile the
model, rather than g++
or clang++
. The
additional option -gencode=arch=compute_XX,code=sm_XX
was
added by dust
and will include the CUDA compute version
supported by the graphics cards found on the current system. You can use
dust::dust_cuda_options
to set additional options, passing
in the return value for the gpu
argument above.
Once compiled with GPU support, the static method
has_gpu_support
will report TRUE
:
and the static method gpu_info
will report on the GPU
devices available:
sirs$public_methods$gpu_info()
#> $has_cuda
#> [1] TRUE
#>
#> $cuda_version
#> [1] '10.1.243'
#>
#> $devices
#> id name memory version
#> 1 0 GeForce RTX 2080 Ti 11016.31 75
If you have more than one GPU, the id
in the
devices
section will be useful for targeting the correct
device.
The object is initialised as usual but with slight differences:
gpu_config
argument needs to be provided to
indicate which GPU device we are running on. Minimally this is an
integer indicating the device that you want to use (on this machine the
only option is 0
), but you can also provide a list with
elements device_id
and run_block_size
.pars <- list()
n_particles <- 8192
model_gpu <- sirs$new(pars, 0, n_particles, gpu_config = 0L, seed = 1L)
Once initialised, a model can only be run on either the GPU or CPU, so we’ll create a CPU version here for comparison:
By leaving gpu_config
as NULL
we indicate
that the model should run on the CPU.
Once created, the uses_gpu
method indicates if the model
is set up to run on the GPU (rather than CPU):
Running the model on the CPU is fairly slow as this is a lot of particles and we’re not running in parallel:
Running the model on the GPU however:
This is much faster! However, ~8,000 particles is unlikely to saturate a modern GPU and (overhead-aside) this will run about as quickly for potentially a hundred thousand particles. For example running 2^17 (131,072) particles only takes a little longer
model_large <- sirs$new(list(), 0, 2^17, gpu_config = 0L, seed = 1L)
(t_large <- system.time(model_large$run(400)))
#> user system elapsed
#> 0.033 0.000 0.033
This is heaps faster, the GPU model ran in 7.3% of the time as the CPU model but simulated 16 times as many particles (i.e., 219 times as fast per particle). With the relatively low times here, much of this time is just moving the data around, and with over a hundred thousand particles this is nontrivial. Of course, doing anything quickly with all these particles is its own problem.
All methods will automatically run on the GPU; this includes
run
, simulate
, compare_data
and
filter
. The last two are typically used from the mcstate
interface.
The sirs model from above:
class sirs {
public:
using real_type = double;
using internal_type = dust::no_internal;
using rng_state_type = dust::random::generator<real_type>;
// __align__(16) is required before the data_type definition when using NVCC
// This is so when loaded into shared memory it is aligned correctly
struct __align__(16) data_type {
double incidence;
};
struct shared_type {
real_type S0;
real_type I0;
real_type R0;
real_type alpha;
real_type beta;
real_type gamma;
real_type dt;
size_t freq;
real_type exp_noise;
};
sirs(const dust::pars_type<sirs>& pars): shared(pars.shared) {
}
size_t size() {
return 4;
}
std::vector<real_type> initial(size_t time) {
std::vector<real_type> state(4);
state[0] = shared->S0;
state[1] = shared->I0;
state[2] = shared->R0;
state[3] = 0;
return state;
}
void update(size_t time, const real_type * state, rng_state_type& rng_state,
real_type * state_next) {
real_type S = state[0];
real_type I = state[1];
real_type R = state[2];
real_type N = S + I + R;
real_type p_SI = 1 - exp(- shared->beta * I / N);
real_type p_IR = 1 - exp(-(shared->gamma));
real_type p_RS = 1 - exp(- shared->alpha);
real_type dt = shared->dt;
real_type n_SI = dust::random::binomial<real_type>(rng_state, S, p_SI * dt);
real_type n_IR = dust::random::binomial<real_type>(rng_state, I, p_IR * dt);
real_type n_RS = dust::random::binomial<real_type>(rng_state, R, p_RS * dt);
state_next[0] = S - n_SI + n_RS;
state_next[1] = I + n_SI - n_IR;
state_next[2] = R + n_IR - n_RS;
state_next[3] = (time % shared->freq == 0) ? n_SI : state[3] + n_SI;
}
real_type compare_data(const real_type * state, const data_type& data,
rng_state_type& rng_state) {
const real_type incidence_modelled = state[3];
const real_type incidence_observed = data.incidence;
const real_type lambda = incidence_modelled +
dust::random::exponential<real_type>(rng_state, shared->exp_noise);
return dust::density::poisson(incidence_observed, lambda, true);
}
private:
dust::shared_ptr<sirs> shared;
};
// Helper function for accepting values with defaults
inline double with_default(double default_value, cpp11::sexp value) {
return value == R_NilValue ? default_value : cpp11::as_cpp<double>(value);
}
namespace dust {
template <>
dust::pars_type<sirs> dust_pars<sirs>(cpp11::list pars) {
// Initial state values
sirs::real_type I0 = 10.0;
sirs::real_type S0 = 1000.0;
sirs::real_type R0 = 0.0;
// Time scaling
// [[dust::param(freq, required = FALSE, default = 1)]]
size_t freq = std::max(1.0, with_default(1.0, pars["freq"]));
sirs::real_type dt = 1 / static_cast<sirs::real_type>(freq);
sirs::real_type exp_noise = 1e6;
// [[dust::param(alpha, required = FALSE, default = 0.1)]]
sirs::real_type alpha = with_default(0.1, pars["alpha"]);
// [[dust::param(beta, required = FALSE, default = 0.2)]]
sirs::real_type beta = with_default(0.2, pars["beta"]);
// [[dust::param(gamma, required = FALSE, default = 0.1)]]
sirs::real_type gamma = with_default(0.1, pars["gamma"]);
sirs::shared_type shared{S0, I0, R0, alpha, beta, gamma, dt, freq, exp_noise};
return dust::pars_type<sirs>(shared);
}
template <>
sirs::data_type dust_data<sirs>(cpp11::list data) {
return sirs::data_type{cpp11::as_cpp<double>(data["incidence"])};
}
namespace gpu {
template <>
size_t shared_int_size<sirs>(dust::shared_ptr<sirs> shared) {
return 1;
}
template <>
size_t shared_real_size<sirs>(dust::shared_ptr<sirs> shared) {
return 5;
}
template <>
void shared_copy<sirs>(dust::shared_ptr<sirs> shared,
int * shared_int,
sirs::real_type * shared_real) {
using real_type = sirs::real_type;
using dust::gpu::shared_copy_data;
shared_real = shared_copy_data<real_type>(shared_real, shared->alpha);
shared_real = shared_copy_data<real_type>(shared_real, shared->beta);
shared_real = shared_copy_data<real_type>(shared_real, shared->gamma);
shared_real = shared_copy_data<real_type>(shared_real, shared->dt);
shared_real = shared_copy_data<real_type>(shared_real, shared->exp_noise);
shared_int = shared_copy_data<int>(shared_int, shared->freq);
}
template <>
__device__
void update_gpu<sirs>(size_t time,
const dust::gpu::interleaved<sirs::real_type> state,
dust::gpu::interleaved<int> internal_int,
dust::gpu::interleaved<sirs::real_type> internal_real,
const int * shared_int,
const sirs::real_type * shared_real,
sirs::rng_state_type& rng_state,
dust::gpu::interleaved<sirs::real_type> state_next) {
using dust::random::binomial;
using real_type = sirs::real_type;
const real_type alpha = shared_real[0];
const real_type beta = shared_real[1];
const real_type gamma = shared_real[2];
const real_type dt = shared_real[3];
const int freq = shared_int[0];
const real_type S = state[0];
const real_type I = state[1];
const real_type R = state[2];
const real_type N = S + I + R;
const real_type p_SI = 1 - exp(- beta * I / N);
const real_type p_IR = 1 - exp(- gamma);
const real_type p_RS = 1 - exp(- alpha);
const real_type n_SI = binomial<real_type>(rng_state, S, p_SI * dt);
const real_type n_IR = binomial<real_type>(rng_state, I, p_IR * dt);
const real_type n_RS = binomial<real_type>(rng_state, R, p_RS * dt);
state_next[0] = S - n_SI + n_RS;
state_next[1] = I + n_SI - n_IR;
state_next[2] = R + n_IR - n_RS;
state_next[3] = (time % freq == 0) ? n_SI : state[3] + n_SI;
}
template <>
__device__
sirs::real_type compare_gpu<sirs>(const dust::gpu::interleaved<sirs::real_type> state,
const sirs::data_type& data,
dust::gpu::interleaved<int> internal_int,
dust::gpu::interleaved<sirs::real_type> internal_real,
const int * shared_int,
const sirs::real_type * shared_real,
sirs::rng_state_type& rng_state) {
using real_type = sirs::real_type;
const real_type exp_noise = shared_real[4];
const real_type incidence_modelled = state[3];
const real_type incidence_observed = data.incidence;
const real_type lambda = incidence_modelled +
dust::random::exponential<real_type>(rng_state, exp_noise);
return dust::density::poisson(incidence_observed, lambda, true);
}
}
}
This is somewhat more complicated than the models described in
vignette("dust.Rmd")
. There are several important
components required to run on the GPU.
Within the dust::gpu
namespace, we declare the size of
the shared parameters for the model. These are parameters that
will be the same across all instances of a parameter set, as opposed to
quantities that change between particles.
namespace dust {
namespace gpu {
template <>
size_t shared_int_size<sirs>(dust::shared_ptr<sirs> shared) {
return 1;
}
template <>
size_t shared_real_size<sirs>(dust::shared_ptr<sirs> shared) {
return 5;
}
}
}
These can be omitted if your model has no such parameters.
The next definition makes the above a bit clearer, it defines the method for copying these parameters
namespace dust {
namespace gpu {
template <>
void shared_copy<sirs>(dust::shared_ptr<sirs> shared,
int * shared_int,
sirs::real_type * shared_real) {
using real_type = sirs::real_type;
using dust::gpu::shared_copy_data;
shared_real = shared_copy_data<real_type>(shared_real, shared->alpha);
shared_real = shared_copy_data<real_type>(shared_real, shared->beta);
shared_real = shared_copy_data<real_type>(shared_real, shared->gamma);
shared_real = shared_copy_data<real_type>(shared_real, shared->dt);
shared_real = shared_copy_data<real_type>(shared_real, shared->exp_noise);
shared_int = shared_copy_data<int>(shared_int, shared->freq);
}
}
}
In the CPU version of the model we have a nice smart pointer to a
struct (dust::shared_ptr<sirs>
) from which we can
access parameters by name (e.g., shared->alpha
). No such
niceties in CUDA where we need access to a single contiguous block of
memory. The dust::shared_copy_data
is a small utility to
make the bookkeeping here a bit easier, but this could have been written
out as:
namespace dust {
template <>
void shared_copy<sirs>(dust::shared_ptr<sirs> shared,
int * shared_int,
sirs::real_type * shared_real) {
using real_type = sirs::real_type;
shared_real[0] = shared->alpha;
shared_real[1] = shared->beta;
shared_real[2] = shared->gamma;
shared_real[3] = shared->dt;
shared_real[4] = shared->exp_noise;
shared_int[0] = shared->freq;
}
}
The dust::shared_copy_data
template has specialisations
where the object being copied is a vector.
There are two methods that are not used here, but could be included,
to define the size of per-particle internal storage space. This is
required if your model needs to store intermediate calculations during
the update
time if those will not fit on the stack.
namespace dust {
namespace gpu {
template <>
size_t dust::internal_real_size<sirs>(dust::shared_ptr<sirs> shared) {
return 0;
}
template <>
size_t dust::internal_int_size<sirs>(dust::shared_ptr<sirs> shared) {
return 0;
}
}
}
Most interestingly we have the update_gpu
method that
actually does the update on the GPU
template <>
__device__
void update_gpu<sirs>(size_t time,
const dust::gpu::interleaved<sirs::real_type> state,
dust::gpu::interleaved<int> internal_int,
dust::gpu::interleaved<sirs::real_type> internal_real,
const int * shared_int,
const sirs::real_type * shared_real,
sirs::rng_state_type& rng_state,
dust::gpu::interleaved<sirs::real_type> state_next) {
using real_type = sirs::real_type;
const real_type alpha = shared_real[0];
const real_type beta = shared_real[1];
const real_type gamma = shared_real[2];
const real_type dt = shared_real[3];
const int freq = shared_int[0];
const real_type S = state[0];
const real_type I = state[1];
const real_type R = state[2];
const real_type N = S + I + R;
const real_type p_SI = 1 - exp(- beta * I / N);
const real_type p_IR = 1 - exp(- gamma);
const real_type p_RS = 1 - exp(- alpha);
const real_type n_SI = dust::random::binomial<real_type>(rng_state, S, p_SI * dt);
const real_type n_IR = dust::random::binomial<real_type>(rng_state, I, p_IR * dt);
const real_type n_RS = dust::random::binomial<real_type>(rng_state, R, p_RS * dt);
state_next[0] = S - n_SI + n_RS;
state_next[1] = I + n_SI - n_IR;
state_next[2] = R + n_IR - n_RS;
state_next[3] = (time % freq == 0) ? n_SI : state[3] + n_SI;
}
Note that this totally duplicates the logic from the CPU version
(which was a method of the sirs
object)
void update(size_t time, const real_type * state, rng_state_type& rng_state,
real_type * state_next) {
real_type S = state[0];
real_type I = state[1];
real_type R = state[2];
real_type N = S + I + R;
real_type p_SI = 1 - exp(- shared->beta * I / N);
real_type p_IR = 1 - exp(-(shared->gamma));
real_type p_RS = 1 - exp(- shared->alpha);
real_type dt = shared->dt;
real_type n_SI = dust::random::binomial<real_type>(rng_state, S, p_SI * dt);
real_type n_IR = dust::random::binomial<real_type>(rng_state, I, p_IR * dt);
real_type n_RS = dust::random::binomial<real_type>(rng_state, R, p_RS * dt);
state_next[0] = S - n_SI + n_RS;
state_next[1] = I + n_SI - n_IR;
state_next[2] = R + n_IR - n_RS;
state_next[3] = (time % shared->freq == 0) ? n_SI : state[3] + n_SI;
}
Differences include:
internal_int
, internal_real
,
shared_int
and shared_real
)dust::gpu::interleaved<>
type, which prevents slow
uncoalesced reads from global memory on the GPUshared_int
and
shared_real
elements are now by position in the array,
rather than the previous pointer/name based access__device__
annotation, which compiles the function
for use on the GPUFinally, if running a particle filter on the GPU, a version of the
compare_data
function is required that can run on the
GPU:
template <>
__device__
sirs::real_type compare_gpu<sirs>(const dust::gpu::interleaved<sirs::real_type> state,
const sirs::data_type& data,
dust::gpu::interleaved<int> internal_int,
dust::gpu::interleaved<sirs::real_type> internal_real,
const int * shared_int,
const sirs::real_type * shared_real,
sirs::rng_state_type& rng_state) {
using real_type = sirs::real_type;
const real_type exp_noise = shared_real[4];
const real_type incidence_modelled = state[3];
const real_type incidence_observed = data.incidence;
const real_type lambda = incidence_modelled +
dust::random::exponential<real_type>(rng_state, exp_noise);
return dust::density::poisson(incidence_observed, lambda, true);
}
This is very similar to the CPU version
real_type compare_data(const real_type * state, const data_type& data,
rng_state_type& rng_state) {
const real_type incidence_modelled = state[3];
const real_type incidence_observed = data.incidence;
const real_type lambda = incidence_modelled +
dust::random::exponential<real_type>(rng_state, shared->exp_noise);
return dust::density::poisson(incidence_observed, lambda, true);
}
with similar differences to the update function:
__device__
annotationDebugging on a GPU is a pain, especially because there are typically
many particles, and error recovery is not straightforward. In addition,
most continuous integration systems do not provide GPUs, so testing your
GPU code becomes difficult. To make this easier, dust
allows running GPU code on the CPU - this will be typically slower than
the CPU code, but allows easier debugging and verification that the
model is behaving. We use this extensively in dust
’s tests
and also in models built using dust
that will run on the
GPU.
To do this, compile the model with your preferred real type, but set
the gpu
argument to FALSE
sirs_cpu <- dust::dust(path, gpu = FALSE, real_type = "float")
#> ℹ 34 functions decorated with [[cpp11::register]]
#> ✔ generated file 'cpp11.R'
#> ✔ generated file 'cpp11.cpp'
#> Re-compiling sirs817bc717
#> ─ installing *source* package ‘sirs817bc717’ ...
#> ** using staged installation
#> ** libs
#> make[1]: Entering directory '/tmp/RtmppiOue1/fileb32bb452d008e/src'
#> make[1]: Leaving directory '/tmp/RtmppiOue1/fileb32bb452d008e/src'
#> make[1]: Entering directory '/tmp/RtmppiOue1/fileb32bb452d008e/src'
#> g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG -I'/home/rfitzjoh/lib/R/library/cpp11/include' -g -Wall -Wextra -pedantic -Wmaybe-uninitialized -Wno-unused-parameter -Wno-cast-function-type -Wno-missing-field-initializers -O2 -I/home/rfitzjoh/lib/R/library/dust/include -DHAVE_INLINE -fopenmp -fpic -g -O2 -fdebug-prefix-map=/build/r-base-QwogzP/r-base-4.1.1=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g -c cpp11.cpp -o cpp11.o
#> g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG -I'/home/rfitzjoh/lib/R/library/cpp11/include' -g -Wall -Wextra -pedantic -Wmaybe-uninitialized -Wno-unused-parameter -Wno-cast-function-type -Wno-missing-field-initializers -O2 -I/home/rfitzjoh/lib/R/library/dust/include -DHAVE_INLINE -fopenmp -fpic -g -O2 -fdebug-prefix-map=/build/r-base-QwogzP/r-base-4.1.1=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g -c dust.cpp -o dust.o
#> g++ -std=gnu++11 -shared -L/usr/lib/R/lib -Wl,-Bsymbolic-functions -Wl,-z,relro -o sirs817bc717.so cpp11.o dust.o -fopenmp -L/usr/lib/R/lib -lR
#> make[1]: Leaving directory '/tmp/RtmppiOue1/fileb32bb452d008e/src'
#> installing to /tmp/RtmppiOue1/devtools_install_b32bb3d78d7ef/00LOCK-fileb32bb452d008e/00new/sirs817bc717/libs
#> ** checking absolute paths in shared objects and dynamic libraries
#> ─ DONE (sirs817bc717)
#>
#> ℹ Loading sirs817bc717
Note that above the model is compiled with g++
, not
nvcc
. However, the “GPU” code is still compiled into the
model. We can then initialise this with and without a
gpu_config
argument
model1 <- sirs_cpu$new(pars, 0, 10, seed = 1L)
model2 <- sirs_cpu$new(pars, 0, 10, gpu_config = 0L, seed = 1L)
And run the models using the “GPU” code and the normal CPU code, but this time both on the CPU
model1$run(10)
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#> [1,] 981 987 977 986 994 948 958 961 976 968
#> [2,] 23 16 26 10 7 48 37 31 21 28
#> [3,] 6 7 7 14 9 14 15 18 13 14
#> [4,] 5 5 5 2 0 11 6 5 3 6
model2$run(10)
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#> [1,] 981 987 977 986 994 948 958 961 976 968
#> [2,] 23 16 26 10 7 48 37 31 21 28
#> [3,] 6 7 7 14 9 14 15 18 13 14
#> [4,] 5 5 5 2 0 11 6 5 3 6
These should be exactly the same, and this can form the basis of
tests. Note that using float
rather than
double
can throw up a few issues with algorithms, so it’s
worth checking with single-precision in your tests.
If you hit problems, then R -d cuda-gdb
can work in
place of the usual R -d gdb
to work with a debugger, though
because of the huge numbers of particles you will typically work with,
debugging remains a challenge. Our usual strategy has been to try and
recreate any issue purely in CPU code and debug as normal (see this
blog post for hints on doing this effectively).