--- title: "Introduction to dust" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introduction to dust} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( error = FALSE, collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 5) lang_output <- function(x, lang) { cat(c(sprintf("```%s", lang), x, "```"), sep = "\n") } cc_output <- function(x) lang_output(x, "cc") r_output <- function(x) lang_output(x, "r") plain_output <- function(x) lang_output(x, "plain") ``` Stochastic models can be used in statistical inference via methods such as [particle filtering](https://en.wikipedia.org/wiki/Particle_filter) but in practice doing so requires that the models can be run over and over again very quickly. While R has excellent support for sampling from distributions, it is not necessarily well suited for this sort of problem because R is single threaded, so we are forced to evaluate realisations of the stochastic process in series (one after another) rather than in parallel. The `dust` package provides tools to help write stochastic models that can be evaluated in parallel. It does not directly provide statistical methods; see the [mcstate](https://mrc-ide.github.io/mcstate/) package for that. Instead, it focuses on providing: * a fast random number generator that can be evaluated in parallel (see `vignette("rng")` for details) * a way of wrapping a user-provided model, itself written as a C++ class * a lightweight interface that drives this model from R * a set of useful primitives for developing sequential Monte Carlo methods. ## A simple example - random walk Consider a unbiased random walk; at each time step we move our position with a draw from a normal distribution with mean 0 and some standard deviation. We consider a single individual moving but will eventually simulate a family of these individuals, each independent. All simulations have a sense of time - a unitless measure of time "step" will be used but it's up to you how it is interpreted (time is a non-negative integer, implemented using `size_t`). ### Model code To implement this model in dust we write out a C++ file that looks like: ```{r walk_code, echo = FALSE, results = "asis"} cc_output(readLines(dust:::dust_file("examples/walk.cpp"))) ``` There are two main parts here; the class (`walk`) and an interface function (`dust::dust_pars`). The class is what does most of the work. It is important that this does not use anything in the R API as it may be called in parallel. Therefore use only C++ standard library types. It has to provide every public type or method as the example above. First, the five types: * the type `real_type` has to exist and indicate what sort of numeric type you use for "real numbers". Typically this will be `double`, but if you need a different type (typically `float`) you can use it here. * the type `data_type` describes data that the model may be compared with as it runs. Here we use `dust::no_data` as the model has no such data but see `vignette("data")` for the interface here. * the type `shared_type` is whatever read-only data your model needs that is shared across particles. Typically these reflect *parameters* of the model (see `dust::dust_pars` below), and while these will typically be shared across all particles that is is not always the case (see `vignette("multi")`) * the type `internal_type` is whatever internal data or space your model needs to run. This is read-write and per-particle (in contrast with `shared_type` above) * the type `rng_state_type` is one of the valid RNG types. Typically `using rng_state_type = dust::random::generator;` will select a reasonable choice, but you could also force a specific generator such as `dust::random::xoshiro256starstar` here (see `vignette("rng")`) **The constructor** must take only one argument, being `const dust::pars_type&`. You can do what you want in the constructor - here we just copy `shared` (containing parameters) into the object with the `pars` argument (we use the private member `shared` here but this can be anything you want, though the type is important). The `dust::pars_type<>` template is a wrapper that contains two fields `shared` (as `dust::shared_ptr`) and `internal` (as `internal_type`). **The method `size()`** must return the "size" of the system - how many state variables it has. Here we're considering a single individual moving randomly so the size is always 1. Many models will have a size that depends on their parameters, though the size cannot be changed after construction. There are other forms of the `dust::pars_type` constructor which would allow specifying the internal data; typically this is used to create scratch space (allocating vectors as needed) rather than compute values because the scratch space and model state can become separated from each other (by direct setting of state, or by shuffling model state between particles). ### Constructing a model The function `dust::dust` will "compile" a model for us to use: ```r path_walk <- system.file("examples/walk.cpp", package = "dust") walk <- dust::dust(path_walk) ``` However, this is also bundled into the package and can be loaded with: ```{r walk_load} walk <- dust::dust_example("walk") ``` The object itself is an "R6ClassGenerator" object: ```{r walk_object} walk ``` Create an instance of the model using `walk$new`. There are three required arguments: * `pars`: passed to `dust_pars` to initialise the model * `time`: the initial time (0 seems like a sensible choice here given our model has no internal sense of time) * `n_particles`: the number of particles to create ```{r walk_create} model <- walk$new(list(sd = 1), 0, 20) model ``` This returns an R6 object that can be used to simulate from or interact with the model. For example, our initial model state is ```{r walk_initial_state} model$state() ``` Here there is one row per model state variable (there is only one here) and one column per particle (there are 20) and we can run the model for 100 time steps, which returns the state at the end of the walk (and not at any intermediate times): ```{r walk_run_1} model$run(100) ``` We can also directly retrieve the state from our object ```{r walk_state_1} model$state() ``` At this point our particles have been run for 100 time steps with standard deviation 1 at each step so they [will be distributed following Normal(0, 10)](https://en.wikipedia.org/wiki/Random_walk#Gaussian_random_walk). This is easier to see if we simulate a lot of particles, here 20,000: ```{r walk_simulate_many} model <- walk$new(list(sd = 1), 0, 20000) invisible(model$run(100)) hist(model$state(), freq = FALSE, las = 1, col = "steelblue2", main = "", ylim = c(0., 0.04), xlab = "State") curve(dnorm(x, 0, 10), col = "orange", add = TRUE, lwd = 2) ``` ### Running a model in parallel The approach above still runs everything in serial, one particle after another. We can configure this system to run in parallel by providing the extra argument `n_threads` to the constructor. Provided that your system can compile with openmp the following code will execute in parallel using 2 threads ```{r, walk_parallel} model <- walk$new(list(sd = 1), 0, 20, n_threads = 2) model$run(100) ``` Running this same code in series will give the same results: ```{r, walk_serial} model <- walk$new(list(sd = 1), 0, 20, n_threads = 1) model$run(100) ``` We use as many random number generators as there are particles, so if you run fewer particles or more, increase the threads or decrease, the results will be the same (see `vignette("design")` for more on this). You should be careful when selecting the number of threads. `dust` will never use more than one thread at a time without it being requested, but avoid using `parallel::detectCores()` to work out how many threads you have available as it will often return an overestimate. This is particularly the case in a shared-use system such as a cluster or CRAN's servers. We provide a helper function `dust::dust_openmp_threads` which can be used to try and find a safe number of threads available to you while respecting various environment variables which seek to control this (`MC_CORES`, `OMP_THREAD_LIMIT`, etc). For example, ```{r} dust::dust_openmp_threads(100, "fix") ``` ## A more interesting example Consider now an SIR model (Susceptible - Infected - Recovered). This sort of model is common in epidemiology, and is often extended to add additional compartments (e.g., SEIR which adds an Exposed compartment) or by structuring each compartment based on properties such as age. Here, we show a simple example with just 3 compartments: ```{r sir_code, echo = FALSE, results = "asis"} cc_output(readLines(dust:::dust_file("examples/sir.cpp"))) ``` There a few changes here compared to the version that was shown before for the walk model, but the core parts are only slightly different: * the `initial` method is slightly more complicated as we initialise a 3-element vector * our `shared_type` object has significantly more elements * the `dust_pars` function is much more complicated * there is a `compare_data` method and a nontrivial `data_type` definition, but ignore these for now (see `vignette("data")`) The other difference is a new template specialisation of `dust_info` - this is optional but can be used to report back to R information about the model based on the input parameters. Here it returns information about variable order and core parameters. We also have added C++ pseudo-attributes with `[[dust::param]]` to describe the parameters that this function takes. This is optional, but allows use of a `coef` method with the class (see below). As before, we'll use the version bundled with the package ```{r sir_compile} sir <- dust::dust_example("sir") ``` To get information about the parameters, use `coef()` on the generator ```{r sir_coef} coef(sir) ``` The model is initialised the same way as before: ```{r sir_create} model <- sir$new(list(), 0, 20) ``` We can use the `$info()` method to retrieve information about our model: ```{r sir_info} model$info() ``` and get its state as before: ```{r sir_initial_state} model$state() ``` Because we have 5 states per particle, this is a 5 x 20 matrix. Suppose that as we run the model we mostly want information on the "I" compartment. So we'll run the model for a bit and retrieve the number of infected individuals, then continue, etc. To do this we use the `$set_index()` method to indicate which state elements should be returned after using `$run()`: ```{r sir_set_index} model$set_index(2L) ``` (you can pass any integer vector here, of any length, provided all the indices lie between 1 and the length of your state vector) Now, when using `run`, the number of infected individuals is returned ```{r sir_run} model$run(10) model$run(20) ``` This is useful when you have many compartments or variables that you are not that interested in during running the model. For a particle filter you might be fitting to the sum of all infected individuals over a number of compartments, so rather than returning hundreds of values back (times hundreds of particles) you can return back much less data and keep things nice and fast. If the index vector is named, then those names will be used as row names on the returned object: ```{r sir_run_named} model$set_index(c(I = 2L)) model$run(30) ``` We can always use `$state()` to get the whole state vector: ```{r sir_state} model$state() ``` or select only variables of interest: ```{r sir_state_select} model$state(c(S = 1L, R = 3L)) ``` Again, this copies names from the index if they are present. In order to run the simulation beginning-to-end, we use the `$simulate` method on a dust object, which runs over a set of time steps and records the state at each. ```{r sir_run_collect} model <- sir$new(list(), 0, 200) model$set_index(2L) times <- seq(0, 600, by = 5) state <- model$simulate(times) ``` The output here is a 1 x 200 x 121 matrix (n state x n particles x n times) ```{r sir_run_dim} dim(state) ``` we need to drop the first dimension and transpose the others for ease of plotting: ```{r sir_transform} traces <- t(drop(state)) ``` Plotting this over time (with 4 time steps per day - see the sir code above) ```{r sir_average} day <- times / 4 matplot(day, traces, type = "l", lty = 1, col = "#00000022", xlab = "Day", ylab = "Number infected (I)") lines(day, rowMeans(traces), col = "red", lwd = 2) ``` ## Other methods There are a few other methods on the dust objects that may be useful. ### Reordering particles This method exists to support particle filtering, and allows resampling or reordering of particles. ```{r reorder_setup} model <- walk$new(list(sd = 1), 0, 20) model$run(1) ``` Suppose that we wanted to reorder these particles so that they were in decreasing order: ```{r reorder_index} index <- order(model$state()) index ``` We then pass this `index` to the reorder method: ```{r reorder_apply} model$reorder(index) model$state() ``` We can then continue our random walk. There is no need to sample every particle and particles can appear multiple times in the sample, but the total number must be conserved. Suppose that we want to to sample particles based on how close they are to 0: ```{r reorder_weighted_index} p <- dnorm(model$state()) index <- sample(length(p), replace = TRUE , prob = p) index ``` We can then apply this sampling: ```{r reorder_weighted_apply} model$reorder(index) model$state() ``` This is not terribly useful on its own but is a key part of a particle filter. When this reordering happens, only the model state is copied around; the `internal` data and random number state are left behind. ### Set particle state A particle state is determined by three mutable things; `pars`, `state` and `time`; these can all be updated for a model after it has been created. We have found setting one or more of these at a time important; * Resetting the model with a new set of parameters (`pars`), initial conditions (`state`) and times (`time`) * Changing `pars` at some point in the simulation to introduce some new aspect of the model * Changing `state` to manually move around some individuals within a model * Setting `time` along with `state` when initialising the model from a previously saved state The `update_state` method allows setting any or all of these components. By default every particle starts from the initial condition specified by your model classes `initial()` method. However, you can specify a state directly using the `$update_state()` method. Here, we initialise our SIR model with only 1 infected individual rather than 10: ```{r update_state} model <- sir$new(list(), 0, 20) model$update_state(state = c(1000, 1, 0, 0, 0)) model$state() ``` Now, when we run the model, far more of the epidemics fail to take off as the infected individual goes disappears before infecting anyone. ```{r} times <- seq(0, 600, by = 5) state <- model$simulate(times) day <- times / 4 matplot(day, t(state[2, , ]), type = "l", lty = 1, col = "#00000022", xlab = "Day", ylab = "Number infected (I)") ``` You can optionally set the initial time along with the state. This is useful if your model depends on time (e.g., you use the time step in a calculation by transforming it into some more meaningful measure of time). You can also set the initial state to a range of different values. Suppose we set the initial number of infections to be Poisson distributed with a mean of 10, we might write: ```{r} I0 <- rpois(20, 10) state0 <- rbind(1010 - I0, I0, 0, 0, 0, deparse.level = 0) model$update_state(state = state0, time = 0L) model$time() model$state() ``` ### Reset the model One particularly common case is to "reset" the model in order to run it again. you should not simply recreate the model from its constructor as that will re-seed a new set of random state -- it would be preferable continue with the generator state in your old model. Suppose we run our model with sd of 1 for 100 time steps ```{r} model <- walk$new(list(sd = 1), 0, 10, seed = 1L) y1 <- model$run(100) ``` we then use `reset` to set new parameters into the model and set the time back to zero and can run again ```{r} model$update_state(pars = list(sd = 2), time = 0) y2 <- model$run(100) ``` The state created in `y2` will have started from our new starting point and time zero, but have used the same random number state through both simulations, which is generally what we want. ## Use within a package You should not use `dust::dust()` within a package, because that would cause the model to compile each time you use it, rather than when the package builds. It may also cause issues when trying to use the model in parallel (e.g., with the `parallel` package). Instead you should use `dust::dust_package()` which will generate appropriate code for you. To use dust in a package, put your dust models in `inst/dust` and run `dust::dust_package()` on the package's root directory. ```{r pkg_setup, include = FALSE} desc <- c( "Package: example", "Title: Example Dust in a Package", "Version: 0.0.1", "LinkingTo: cpp11, dust", "Authors@R: c(person('A', 'Person', role = c('aut', 'cre')", " email = 'person@example.com'))", "License: CC0") ns <- "useDynLib('example', .registration = TRUE)" path <- tempfile() dir.create(path) dir.create(file.path(path, "inst/dust"), FALSE, TRUE) writeLines(desc, file.path(path, "DESCRIPTION")) writeLines(ns, file.path(path, "NAMESPACE")) path_ex <- system.file("examples", package = "dust", mustWork = TRUE) file.copy(file.path(path_ex, "walk.cpp"), file.path(path, "inst/dust")) ``` A skeleton package might contain: ```{r pkg_tree, echo = FALSE} withr::with_dir(path, fs::dir_tree()) ``` This is the normal R package skeleton, though missing R and src directories (for now). The DESCRIPTION file contains ```{r pkg_desc, echo = FALSE, results = "asis"} plain_output(readLines(file.path(path, "DESCRIPTION"))) ``` The important things here are: * the package name (`Package`). We're using `example`, and names with a dot may not work as expected * including [`cpp11`](https://github.com/r-lib/cpp11) and `dust` in `LinkingTo`, which allows package compilation to find their respective header files * a `useDynLib` call to your package in the `NAMESPACE` file Our `NAMESPACE` file contains: ```{r pkg_ns, echo = FALSE, results = "asis"} plain_output(readLines(file.path(path, "NAMESPACE"))) ``` The files in `inst/dust` are the same files as seen above, with `walk.cpp` containing ```{r pkg_walk, echo = FALSE, results = "asis"} cc_output(readLines(file.path(path, "inst/dust/walk.cpp"))) ``` There can be as many of these files as you want within the directory `inst/dust`. To prepare the package, run `dust::dust_package()`: ```{r pkg_generate} dust::dust_package(path) ``` The directory structure now has more files: ```{r pkg_tree_after, echo = FALSE} withr::with_dir(path, fs::dir_tree()) ``` The file `src/walk.cpp` are generated by dust and should not be edited. They include your model, but also a bit of helper code: ```{r pkg_walk_c, echo = FALSE, results = "asis"} cc_output(readLines(file.path(path, "src/walk.cpp"))) ``` The file `R/dust.R` contains the R interface generated by dust with the constructor objects (all models' constructors will be collected into this file, which also should not be edited). ```{r pkg_dust_r, echo = FALSE, results = "asis"} r_output(readLines(file.path(path, "R/dust.R"))) ``` Finally, `R/cpp11.R` and `src/cpp11.cpp` are files created by cpp11 that should not be edited. Your package can include as much R code as you want, and can be developed like any other R package. But any time you change the code in `inst/dust` you should rerun `dust::dust_package()`.