--- title: "Running models on GPUs with CUDA" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Running models on GPUs with CUDA} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- With the core approach in dust, you can run models in parallel efficiently up to the number of cores your workstation has available. Getting more than 32 or 64 cores is hard though, and `dust` provides no multi-node parallelism (e.g., [MPI](https://en.wikipedia.org/wiki/Message_Passing_Interface)). Instead, we have developed a system for running `dust` models on GPUs (graphical processing units), specifically NVIDIA GPUs via CUDA (Compute Unified Device Architecture). This vignette is written in reverse-order, starting with how to run a model on a GPU, before covering how to write a `dust` model that can be run on a GPU. The reason for this is that we do not expect that people will write these directly; instead we expect that most people will use the [`odin.dust`](https://mrc-ide.github.io/odin.dust/) package to generate these interfaces automatically, without having to write a single line of C++ or CUDA code. ## Principles * A GPU model can be run on the CPU, and vice-versa. * The same random number generator sequence will be used in both cases. * Differences in results, if they exist, come from different precision/rounding between compute platforms. When using `real_type = double` we want to get the same results. * Just the model update can be defined for the GPU, and comparisons and shuffling could still happen on the CPU. Defining a comparison function is also possible, allowing a full particle filter run on a GPU. ## Running a model with GPU support The `sirs` model includes GPU support, and will be the focus of this vignette. However, the installed version cannot be run directly on the GPU for a couple of reasons: * this would complicate distribution of binaries as we'd depend on all systems having a copy of the CUDA runtime * we would have to compile support for many possible GPU architectures at once * it is very likely that we'd want to run the model in single precision (`float`) mode rather than double precision (`double`), and changing that requires re-compilation In addition, you will also need: * CUDA toolkit v10.2 or higher, v11.1 or higher preferred (compile time and run time) * CUDA capable GPU (run time) * nvidia drivers (run time) You can check with the command-line tool `nvidia-smi` if you have suitable hardware and drivers and with `dust::dust_cuda_configuration(quiet = FALSE, forget = TRUE)` if you have suitable build tools. So here, rather than using `dust::dust_example`, we compile the model and pass in arguments: ```r path <- system.file("examples/sirs.cpp", package = "dust") sirs <- dust::dust(path, gpu = TRUE, real_type = "float") #> ℹ 34 functions decorated with [[cpp11::register]] #> ✔ generated file 'cpp11.R' #> ✔ generated file 'cpp11.cpp' #> Re-compiling sirs817bc717gpu #> ─ installing *source* package ‘sirs817bc717gpu’ ... #> ** using staged installation #> ** libs #> make[1]: Entering directory '/tmp/RtmppiOue1/fileb32bb676504e5/src' #> make[1]: Leaving directory '/tmp/RtmppiOue1/fileb32bb676504e5/src' #> make[1]: Entering directory '/tmp/RtmppiOue1/fileb32bb676504e5/src' #> g++ -std=gnu++14 -I"/usr/share/R/include" -DNDEBUG -I'/home/rfitzjoh/lib/R/library/cpp11/include' -g -Wall -Wextra -pedantic -Wmaybe-uninitialized -Wno-unused-parameter -Wno-cast-function-type -Wno-missing-field-initializers -O2 -I/home/rfitzjoh/lib/R/library/dust/include -DHAVE_INLINE -fopenmp -fpic -g -O2 -fdebug-prefix-map=/build/r-base-QwogzP/r-base-4.1.1=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g -c cpp11.cpp -o cpp11.o #> nvcc -std=c++14 -O2 -I. -I/usr/share/R/include -I/home/rfitzjoh/lib/R/library/dust/include -I/home/rfitzjoh/.local/share/R/dust/cub -I'/home/rfitzjoh/lib/R/library/cpp11/include' -gencode=arch=compute_75,code=sm_75 -Xcompiler -fPIC -Xcompiler -fopenmp -x cu -c dust.cu -o dust.o #> g++ -std=gnu++14 -shared -L/usr/lib/R/lib -Wl,-Bsymbolic-functions -Wl,-z,relro -o sirs817bc717gpu.so cpp11.o dust.o -lcudart -fopenmp -L/usr/lib/R/lib -lR #> make[1]: Leaving directory '/tmp/RtmppiOue1/fileb32bb676504e5/src' #> installing to /tmp/RtmppiOue1/devtools_install_b32bb69aa2d4d/00LOCK-fileb32bb676504e5/00new/sirs817bc717gpu/libs #> ** checking absolute paths in shared objects and dynamic libraries #> ─ DONE (sirs817bc717gpu) #> #> ℹ Loading sirs817bc717gpu ``` Notice in compilation that `nvcc` is used to compile the model, rather than `g++` or `clang++`. The additional option `-gencode=arch=compute_XX,code=sm_XX` was added by `dust` and will include the CUDA compute version supported by the graphics cards found on the current system. You can use `dust::dust_cuda_options` to set additional options, passing in the return value for the `gpu` argument above. Once compiled with GPU support, the static method `has_gpu_support` will report `TRUE`: ```r sirs$public_methods$has_gpu_support() #> [1] TRUE ``` and the static method `gpu_info` will report on the GPU devices available: ```r sirs$public_methods$gpu_info() #> $has_cuda #> [1] TRUE #> #> $cuda_version #> [1] '10.1.243' #> #> $devices #> id name memory version #> 1 0 GeForce RTX 2080 Ti 11016.31 75 ``` If you have more than one GPU, the `id` in the `devices` section will be useful for targeting the correct device. The object is initialised as usual but with slight differences: * you will probably want a (much) larger number of particles to take advantage of your GPU. As a rule of thumb we would suggest at least 10,000, but depending on model and card you may still see per-particle increases in compute speed as you use up to 1,000,000 particles. See below for more discussion of this. * the `gpu_config` argument needs to be provided to indicate which GPU device we are running on. Minimally this is an integer indicating the device that you want to use (on this machine the only option is `0`), but you can also provide a list with elements `device_id` and `run_block_size`. ```r pars <- list() n_particles <- 8192 model_gpu <- sirs$new(pars, 0, n_particles, gpu_config = 0L, seed = 1L) ``` Once initialised, a model can only be run on either the GPU or CPU, so we'll create a CPU version here for comparison: ```r model_cpu <- sirs$new(pars, 0, n_particles, seed = 1L) ``` By leaving `gpu_config` as `NULL` we indicate that the model should run on the CPU. Once created, the `uses_gpu` method indicates if the model is set up to run on the GPU (rather than CPU): ```r model_gpu$uses_gpu() #> [1] TRUE model_cpu$uses_gpu() #> [1] FALSE ``` Running the model on the CPU is fairly slow as this is a lot of particles and we're not running in parallel: ```r (t_cpu <- system.time(model_cpu$run(400))) #> user system elapsed #> 0.451 0.000 0.451 ``` Running the model on the GPU however: ```r (t_gpu <- system.time(model_gpu$run(400))) #> user system elapsed #> 0.008 0.000 0.008 ``` This is much faster! However, ~8,000 particles is unlikely to saturate a modern GPU and (overhead-aside) this will run about as quickly for potentially a hundred thousand particles. For example running 2^17 (131,072) particles only takes a little longer ```r model_large <- sirs$new(list(), 0, 2^17, gpu_config = 0L, seed = 1L) (t_large <- system.time(model_large$run(400))) #> user system elapsed #> 0.033 0.000 0.033 ``` This is **heaps** faster, the GPU model ran in 7.3% of the time as the CPU model but simulated 16 times as many particles (i.e., 219 times as fast per particle). With the relatively low times here, much of this time is just moving the data around, and with over a hundred thousand particles this is nontrivial. Of course, _doing_ anything quickly with all these particles is its own problem. All methods will automatically run on the GPU; this includes `run`, `simulate`, `compare_data` and `filter`. The last two are typically used from the [`mcstate` interface](https://mrc-ide.github.io/mcstate/reference/particle_filter.html). ## Writing a GPU-capable model The sirs model from above: ```cc class sirs { public: using real_type = double; using internal_type = dust::no_internal; using rng_state_type = dust::random::generator; // __align__(16) is required before the data_type definition when using NVCC // This is so when loaded into shared memory it is aligned correctly struct __align__(16) data_type { double incidence; }; struct shared_type { real_type S0; real_type I0; real_type R0; real_type alpha; real_type beta; real_type gamma; real_type dt; size_t freq; real_type exp_noise; }; sirs(const dust::pars_type& pars): shared(pars.shared) { } size_t size() { return 4; } std::vector initial(size_t time) { std::vector state(4); state[0] = shared->S0; state[1] = shared->I0; state[2] = shared->R0; state[3] = 0; return state; } void update(size_t time, const real_type * state, rng_state_type& rng_state, real_type * state_next) { real_type S = state[0]; real_type I = state[1]; real_type R = state[2]; real_type N = S + I + R; real_type p_SI = 1 - exp(- shared->beta * I / N); real_type p_IR = 1 - exp(-(shared->gamma)); real_type p_RS = 1 - exp(- shared->alpha); real_type dt = shared->dt; real_type n_SI = dust::random::binomial(rng_state, S, p_SI * dt); real_type n_IR = dust::random::binomial(rng_state, I, p_IR * dt); real_type n_RS = dust::random::binomial(rng_state, R, p_RS * dt); state_next[0] = S - n_SI + n_RS; state_next[1] = I + n_SI - n_IR; state_next[2] = R + n_IR - n_RS; state_next[3] = (time % shared->freq == 0) ? n_SI : state[3] + n_SI; } real_type compare_data(const real_type * state, const data_type& data, rng_state_type& rng_state) { const real_type incidence_modelled = state[3]; const real_type incidence_observed = data.incidence; const real_type lambda = incidence_modelled + dust::random::exponential(rng_state, shared->exp_noise); return dust::density::poisson(incidence_observed, lambda, true); } private: dust::shared_ptr shared; }; // Helper function for accepting values with defaults inline double with_default(double default_value, cpp11::sexp value) { return value == R_NilValue ? default_value : cpp11::as_cpp(value); } namespace dust { template <> dust::pars_type dust_pars(cpp11::list pars) { // Initial state values sirs::real_type I0 = 10.0; sirs::real_type S0 = 1000.0; sirs::real_type R0 = 0.0; // Time scaling // [[dust::param(freq, required = FALSE, default = 1)]] size_t freq = std::max(1.0, with_default(1.0, pars["freq"])); sirs::real_type dt = 1 / static_cast(freq); sirs::real_type exp_noise = 1e6; // [[dust::param(alpha, required = FALSE, default = 0.1)]] sirs::real_type alpha = with_default(0.1, pars["alpha"]); // [[dust::param(beta, required = FALSE, default = 0.2)]] sirs::real_type beta = with_default(0.2, pars["beta"]); // [[dust::param(gamma, required = FALSE, default = 0.1)]] sirs::real_type gamma = with_default(0.1, pars["gamma"]); sirs::shared_type shared{S0, I0, R0, alpha, beta, gamma, dt, freq, exp_noise}; return dust::pars_type(shared); } template <> sirs::data_type dust_data(cpp11::list data) { return sirs::data_type{cpp11::as_cpp(data["incidence"])}; } namespace gpu { template <> size_t shared_int_size(dust::shared_ptr shared) { return 1; } template <> size_t shared_real_size(dust::shared_ptr shared) { return 5; } template <> void shared_copy(dust::shared_ptr shared, int * shared_int, sirs::real_type * shared_real) { using real_type = sirs::real_type; using dust::gpu::shared_copy_data; shared_real = shared_copy_data(shared_real, shared->alpha); shared_real = shared_copy_data(shared_real, shared->beta); shared_real = shared_copy_data(shared_real, shared->gamma); shared_real = shared_copy_data(shared_real, shared->dt); shared_real = shared_copy_data(shared_real, shared->exp_noise); shared_int = shared_copy_data(shared_int, shared->freq); } template <> __device__ void update_gpu(size_t time, const dust::gpu::interleaved state, dust::gpu::interleaved internal_int, dust::gpu::interleaved internal_real, const int * shared_int, const sirs::real_type * shared_real, sirs::rng_state_type& rng_state, dust::gpu::interleaved state_next) { using dust::random::binomial; using real_type = sirs::real_type; const real_type alpha = shared_real[0]; const real_type beta = shared_real[1]; const real_type gamma = shared_real[2]; const real_type dt = shared_real[3]; const int freq = shared_int[0]; const real_type S = state[0]; const real_type I = state[1]; const real_type R = state[2]; const real_type N = S + I + R; const real_type p_SI = 1 - exp(- beta * I / N); const real_type p_IR = 1 - exp(- gamma); const real_type p_RS = 1 - exp(- alpha); const real_type n_SI = binomial(rng_state, S, p_SI * dt); const real_type n_IR = binomial(rng_state, I, p_IR * dt); const real_type n_RS = binomial(rng_state, R, p_RS * dt); state_next[0] = S - n_SI + n_RS; state_next[1] = I + n_SI - n_IR; state_next[2] = R + n_IR - n_RS; state_next[3] = (time % freq == 0) ? n_SI : state[3] + n_SI; } template <> __device__ sirs::real_type compare_gpu(const dust::gpu::interleaved state, const sirs::data_type& data, dust::gpu::interleaved internal_int, dust::gpu::interleaved internal_real, const int * shared_int, const sirs::real_type * shared_real, sirs::rng_state_type& rng_state) { using real_type = sirs::real_type; const real_type exp_noise = shared_real[4]; const real_type incidence_modelled = state[3]; const real_type incidence_observed = data.incidence; const real_type lambda = incidence_modelled + dust::random::exponential(rng_state, exp_noise); return dust::density::poisson(incidence_observed, lambda, true); } } } ``` This is somewhat more complicated than the models described in `vignette("dust.Rmd")`. There are several important components required to run on the GPU. Within the `dust::gpu` namespace, we declare the size of the *shared parameters* for the model. These are parameters that will be the same across all instances of a parameter set, as opposed to quantities that change between particles. ```cc namespace dust { namespace gpu { template <> size_t shared_int_size(dust::shared_ptr shared) { return 1; } template <> size_t shared_real_size(dust::shared_ptr shared) { return 5; } } } ``` These can be omitted if your model has no such parameters. The next definition makes the above a bit clearer, it defines the method for copying these parameters ```cc namespace dust { namespace gpu { template <> void shared_copy(dust::shared_ptr shared, int * shared_int, sirs::real_type * shared_real) { using real_type = sirs::real_type; using dust::gpu::shared_copy_data; shared_real = shared_copy_data(shared_real, shared->alpha); shared_real = shared_copy_data(shared_real, shared->beta); shared_real = shared_copy_data(shared_real, shared->gamma); shared_real = shared_copy_data(shared_real, shared->dt); shared_real = shared_copy_data(shared_real, shared->exp_noise); shared_int = shared_copy_data(shared_int, shared->freq); } } } ``` In the CPU version of the model we have a nice smart pointer to a struct (`dust::shared_ptr`) from which we can access parameters by name (e.g., `shared->alpha`). No such niceties in CUDA where we need access to a single contiguous block of memory. The `dust::shared_copy_data` is a small utility to make the bookkeeping here a bit easier, but this could have been written out as: ```cc namespace dust { template <> void shared_copy(dust::shared_ptr shared, int * shared_int, sirs::real_type * shared_real) { using real_type = sirs::real_type; shared_real[0] = shared->alpha; shared_real[1] = shared->beta; shared_real[2] = shared->gamma; shared_real[3] = shared->dt; shared_real[4] = shared->exp_noise; shared_int[0] = shared->freq; } } ``` The `dust::shared_copy_data` template has specialisations where the object being copied is a vector. There are two methods that are not used here, but could be included, to define the size of per-particle internal storage space. This is required if your model needs to store intermediate calculations during the `update` time if those will not fit on the stack. ```cc namespace dust { namespace gpu { template <> size_t dust::internal_real_size(dust::shared_ptr shared) { return 0; } template <> size_t dust::internal_int_size(dust::shared_ptr shared) { return 0; } } } ``` Most interestingly we have the `update_gpu` method that actually does the update on the GPU ```cc template <> __device__ void update_gpu(size_t time, const dust::gpu::interleaved state, dust::gpu::interleaved internal_int, dust::gpu::interleaved internal_real, const int * shared_int, const sirs::real_type * shared_real, sirs::rng_state_type& rng_state, dust::gpu::interleaved state_next) { using real_type = sirs::real_type; const real_type alpha = shared_real[0]; const real_type beta = shared_real[1]; const real_type gamma = shared_real[2]; const real_type dt = shared_real[3]; const int freq = shared_int[0]; const real_type S = state[0]; const real_type I = state[1]; const real_type R = state[2]; const real_type N = S + I + R; const real_type p_SI = 1 - exp(- beta * I / N); const real_type p_IR = 1 - exp(- gamma); const real_type p_RS = 1 - exp(- alpha); const real_type n_SI = dust::random::binomial(rng_state, S, p_SI * dt); const real_type n_IR = dust::random::binomial(rng_state, I, p_IR * dt); const real_type n_RS = dust::random::binomial(rng_state, R, p_RS * dt); state_next[0] = S - n_SI + n_RS; state_next[1] = I + n_SI - n_IR; state_next[2] = R + n_IR - n_RS; state_next[3] = (time % freq == 0) ? n_SI : state[3] + n_SI; } ``` Note that this totally duplicates the logic from the CPU version (which was a method of the `sirs` object) ```cc void update(size_t time, const real_type * state, rng_state_type& rng_state, real_type * state_next) { real_type S = state[0]; real_type I = state[1]; real_type R = state[2]; real_type N = S + I + R; real_type p_SI = 1 - exp(- shared->beta * I / N); real_type p_IR = 1 - exp(-(shared->gamma)); real_type p_RS = 1 - exp(- shared->alpha); real_type dt = shared->dt; real_type n_SI = dust::random::binomial(rng_state, S, p_SI * dt); real_type n_IR = dust::random::binomial(rng_state, I, p_IR * dt); real_type n_RS = dust::random::binomial(rng_state, R, p_RS * dt); state_next[0] = S - n_SI + n_RS; state_next[1] = I + n_SI - n_IR; state_next[2] = R + n_IR - n_RS; state_next[3] = (time % shared->freq == 0) ? n_SI : state[3] + n_SI; } ``` Differences include: * the presence of the four auxiliary data elements (`internal_int`, `internal_real`, `shared_int` and `shared_real`) * the data types that vary across particles are a special `dust::gpu::interleaved<>` type, which prevents slow uncoalesced reads from global memory on the GPU * All accesses into the `shared_int` and `shared_real` elements are now by position in the array, rather than the previous pointer/name based access * The `__device__` annotation, which compiles the function for use on the GPU ## Data comparison functions Finally, if running a particle filter on the GPU, a version of the `compare_data` function is required that can run on the GPU: ```cc template <> __device__ sirs::real_type compare_gpu(const dust::gpu::interleaved state, const sirs::data_type& data, dust::gpu::interleaved internal_int, dust::gpu::interleaved internal_real, const int * shared_int, const sirs::real_type * shared_real, sirs::rng_state_type& rng_state) { using real_type = sirs::real_type; const real_type exp_noise = shared_real[4]; const real_type incidence_modelled = state[3]; const real_type incidence_observed = data.incidence; const real_type lambda = incidence_modelled + dust::random::exponential(rng_state, exp_noise); return dust::density::poisson(incidence_observed, lambda, true); } ``` This is very similar to the CPU version ```cc real_type compare_data(const real_type * state, const data_type& data, rng_state_type& rng_state) { const real_type incidence_modelled = state[3]; const real_type incidence_observed = data.incidence; const real_type lambda = incidence_modelled + dust::random::exponential(rng_state, shared->exp_noise); return dust::density::poisson(incidence_observed, lambda, true); } ``` with similar differences to the update function: * argument types are different (interleaved types, internals and shared data passed explicitly) * Accessing of shared parameters is by position, not name * The `__device__` annotation ## Developing a GPU model Debugging on a GPU is a pain, especially because there are typically many particles, and error recovery is not straightforward. In addition, most continuous integration systems do not provide GPUs, so testing your GPU code becomes difficult. To make this easier, `dust` allows running GPU code on the CPU - this will be typically slower than the CPU code, but allows easier debugging and verification that the model is behaving. We use this extensively in `dust`'s tests and also in models built using `dust` that will run on the GPU. To do this, compile the model with your preferred real type, but set the `gpu` argument to `FALSE` ```r sirs_cpu <- dust::dust(path, gpu = FALSE, real_type = "float") #> ℹ 34 functions decorated with [[cpp11::register]] #> ✔ generated file 'cpp11.R' #> ✔ generated file 'cpp11.cpp' #> Re-compiling sirs817bc717 #> ─ installing *source* package ‘sirs817bc717’ ... #> ** using staged installation #> ** libs #> make[1]: Entering directory '/tmp/RtmppiOue1/fileb32bb452d008e/src' #> make[1]: Leaving directory '/tmp/RtmppiOue1/fileb32bb452d008e/src' #> make[1]: Entering directory '/tmp/RtmppiOue1/fileb32bb452d008e/src' #> g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG -I'/home/rfitzjoh/lib/R/library/cpp11/include' -g -Wall -Wextra -pedantic -Wmaybe-uninitialized -Wno-unused-parameter -Wno-cast-function-type -Wno-missing-field-initializers -O2 -I/home/rfitzjoh/lib/R/library/dust/include -DHAVE_INLINE -fopenmp -fpic -g -O2 -fdebug-prefix-map=/build/r-base-QwogzP/r-base-4.1.1=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g -c cpp11.cpp -o cpp11.o #> g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG -I'/home/rfitzjoh/lib/R/library/cpp11/include' -g -Wall -Wextra -pedantic -Wmaybe-uninitialized -Wno-unused-parameter -Wno-cast-function-type -Wno-missing-field-initializers -O2 -I/home/rfitzjoh/lib/R/library/dust/include -DHAVE_INLINE -fopenmp -fpic -g -O2 -fdebug-prefix-map=/build/r-base-QwogzP/r-base-4.1.1=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g -c dust.cpp -o dust.o #> g++ -std=gnu++11 -shared -L/usr/lib/R/lib -Wl,-Bsymbolic-functions -Wl,-z,relro -o sirs817bc717.so cpp11.o dust.o -fopenmp -L/usr/lib/R/lib -lR #> make[1]: Leaving directory '/tmp/RtmppiOue1/fileb32bb452d008e/src' #> installing to /tmp/RtmppiOue1/devtools_install_b32bb3d78d7ef/00LOCK-fileb32bb452d008e/00new/sirs817bc717/libs #> ** checking absolute paths in shared objects and dynamic libraries #> ─ DONE (sirs817bc717) #> #> ℹ Loading sirs817bc717 ``` Note that above the model is compiled with `g++`, not `nvcc`. However, the "GPU" code is still compiled into the model. We can then initialise this with and without a `gpu_config` argument ```r model1 <- sirs_cpu$new(pars, 0, 10, seed = 1L) model2 <- sirs_cpu$new(pars, 0, 10, gpu_config = 0L, seed = 1L) ``` And run the models using the "GPU" code and the normal CPU code, but this time both on the CPU ```r model1$run(10) #> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] #> [1,] 981 987 977 986 994 948 958 961 976 968 #> [2,] 23 16 26 10 7 48 37 31 21 28 #> [3,] 6 7 7 14 9 14 15 18 13 14 #> [4,] 5 5 5 2 0 11 6 5 3 6 model2$run(10) #> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] #> [1,] 981 987 977 986 994 948 958 961 976 968 #> [2,] 23 16 26 10 7 48 37 31 21 28 #> [3,] 6 7 7 14 9 14 15 18 13 14 #> [4,] 5 5 5 2 0 11 6 5 3 6 ``` These should be exactly the same, and this can form the basis of tests. Note that using `float` rather than `double` can throw up a few issues with algorithms, so it's worth checking with single-precision in your tests. If you hit problems, then `R -d cuda-gdb` can work in place of the usual `R -d gdb` to work with a debugger, though because of the huge numbers of particles you will typically work with, debugging remains a challenge. Our usual strategy has been to try and recreate any issue purely in CPU code and debug as normal (see [this blog post](https://reside-ic.github.io/blog/debugging-memory-errors-with-valgrind-and-gdb/) for hints on doing this effectively).