Environments

It is likely that the code that you want to run uses other code written by someone else, or code that you have written yourself. This vignette covers how you can make this code available to your tasks, covering:

  • Loading packages automatically
  • Using environments to avoid sending large files over the network
library(hipercow)
hipercow_init(driver = "example")
#> ✔ Initialised hipercow at '.' (/tmp/Rtmpehsscb/hv-20241209-12d81201e93)
#> ✔ Configured hipercow to use 'example'

Basics

There are two main things that make up an “environment”:

  • a set of packages that you would like attached (loaded via library())
  • a set of source files that you would like loaded into the session (via source())

Because you can do anything you want within a source file (including loading packages!) this is very flexible.

Here’s what is going on with the default environment:

id <- task_create_expr(list(session = sessionInfo(), objects = ls()))
#> ✔ Submitted task 'b3f22ca77a03494dba1d539d31299a22' using 'example'
task_wait(id)
#> [1] TRUE
task_result(id)
#> $session
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> loaded via a namespace (and not attached):
#> [1] compiler_4.4.2  cli_3.6.3       withr_3.0.2     hipercow_1.0.52
#> [5] rlang_1.1.4    
#> 
#> $objects
#> character(0)

You can see that there are few attached packages (these are packages that are loaded via library and whose functions are available) but there are a couple of extra loaded packages, for examplehipercow and its dependencies such as cli (these are packages that R has loaded but functions of which are not available without ::).

The objects field shows the objects loaded into the session; this is empty because the session has nothing in it.

Loading packages

Your code might want to use functions from a package. For example, we might want to use the say function from the cowsay package

id <- task_create_expr(say("Moooo", by = "cow"))
#> ✔ Submitted task 'c709d29144ba821930828d62e4e1d34f' using 'example'
task_wait(id)
#> [1] FALSE
task_result(id)
#> <simpleError in say("Moooo", by = "cow"): could not find function "say">

This task fails and the error message here indicates the problem: the say() function is not found. We could run this as cowsay::say but this might be inconvenient or just not what we prefer to do. We need to arrange for library(cowsay) to be run before trying to run our expression.

To do this, we update the default environment:

hipercow_environment_create(packages = "cowsay")
#> ✔ Created environment 'default'

This time our task succeeds:

id <- task_create_expr(say("Moooooo", by = "cow"))
#> ✔ Submitted task 'c7487225026d45703d31c1b29827a12f' using 'example'
task_wait(id)
#> [1] FALSE

Looking at the logs, we can see the glorious output of our task:

task_log_show(id)
#> 
#> ── hipercow 1.0.52 running at '/tmp/Rtmpehsscb/hv-20241209-12d81201e93' ────────
#> ℹ library paths:
#> • /tmp/RtmpxLCJQj/Rinst12221aef0f75
#> • /github/workspace/pkglib
#> • /usr/local/lib/R/site-library
#> • /usr/lib/R/site-library
#> • /usr/lib/R/library
#> ℹ id: c7487225026d45703d31c1b29827a12f
#> ℹ starting at: 2024-12-09 19:21:58.909533
#> ℹ Task type: expression
#> • Expression: say("Moooooo", by = "cow")
#> • Locals: (none)
#> • Environment: default
#>   R_GC_MEM_GROW: 3
#> ℹ Loading environment 'default'...
#> • packages: cowsay
#> • sources: (none)
#> • globals: (none)
#> ✖ status: failure
#> ✖ Error: there is no package called ‘cowsay’
#> ℹ finishing at: 2024-12-09 19:21:58.909533 (elapsed: 0.3269 secs)

We can also see the logs say:

ℹ Loading environment 'default'...
• packages: cowsay
• sources: (none)
• globals: (none)

As the task starts up, and before it runs our code, it has loaded default and that has attached cowsay.

You can see what is in the default environment by using hipercow_environment_show()

hipercow_environment_show("default")
#> 
#> ── hipercow environment 'default' ──────────────────────────────────────────────
#> • packages: cowsay
#> • sources: (none)
#> • globals: (none)

which prints the same information.

You can list as many packages as you want, and they will be loaded in order.

Loading your own functions

Now, suppose we want to change the length of the moo that the cow produces. We might write a function.

moo <- function(n) {
  sprintf("M%s!", strrep("o", n))
}

Ignoring the cluster for a minute, if you wanted to use this code you would ordinarily use source to load it:

source("moo.R")
moo(5)
#> [1] "Mooooo!"

We need to do the same thing on the cluster, as by default it would not be found:

id <- task_create_expr(say(moo(5), by = "cow"))
#> ✔ Submitted task '9df62aa7fbe162d8c9f43f1ebf44a9ed' using 'example'
task_wait(id)
#> [1] FALSE
task_result(id)
#> <packageNotFoundError in library(p, character.only = TRUE): there is no package called ‘cowsay’>

The same error as before, but the consequence is different - this is because our code is not present. We update our environment call to add moo.R:

hipercow_environment_create(packages = "cowsay", sources = "moo.R")
#> ✔ Updated environment 'default'

(note that this replaces the previous definition of default). Now, we can run our task:

id <- task_create_expr(say(moo(5), by = "cow"))
#> ✔ Submitted task '7bf86e9838c2cfdd5c9357296c6fb8de' using 'example'
task_wait(id)
#> [1] FALSE
task_log_show(id)
#> 
#> ── hipercow 1.0.52 running at '/tmp/Rtmpehsscb/hv-20241209-12d81201e93' ────────
#> ℹ library paths:
#> • /tmp/RtmpxLCJQj/Rinst12221aef0f75
#> • /github/workspace/pkglib
#> • /usr/local/lib/R/site-library
#> • /usr/lib/R/site-library
#> • /usr/lib/R/library
#> ℹ id: 7bf86e9838c2cfdd5c9357296c6fb8de
#> ℹ starting at: 2024-12-09 19:22:01.051555
#> ℹ Task type: expression
#> • Expression: say(moo(5), by = "cow")
#> • Locals: (none)
#> • Environment: default
#>   R_GC_MEM_GROW: 3
#> ℹ Loading environment 'default'...
#> • packages: cowsay
#> • sources: moo.R
#> • globals: (none)
#> ✖ status: failure
#> ✖ Error: there is no package called ‘cowsay’
#> ℹ finishing at: 2024-12-09 19:22:01.051555 (elapsed: 0.3064 secs)

You can list as many source files as you want. If you are brave, you could use dir() to do:

hipercow_environment_create(
  sources = dir(glob2rx("R/", pattern = "*.R", full.names = TRUE)))

which would load all .R files within the directory R/ of your hipercow root. Be careful not to start recursively loading files though; this would happen if you included source() commands within your R files which, when followed, result in a cycle.

Defining globals

A more advanced use of environments is to load large objects so that they are ready when your task runs. Suppose we have a large object we need to work with, for example a large loaded geojson file or a fit from a model run, and we want to run a function on it.

The obvious thing to do is to write:

myfn(my_big_data, another_argument)

where

  • myfn is our analysis function
  • my_big_data is the large object
  • another_argument is some other argument we’re passing to the call

When we do this, hipercow will save all the inputs to disk, and this could be quite big. This could be even worse if this is part of a bulk call, where multiple copies may be saved.

This problem can be so bad that we prevent it by default:

task_create_expr(length(my_big_data))
#> Error in `task_create_expr()`:
#> ! Object too large to save with task: 'my_big_data'
#> ✖ Objects saved with a hipercow task can only be 1 MB
#> ℹ You can increase the limit by increasing the value of the option
#>   'hipercow.max_size_local', even using 'Inf' to disable this check entirely
#>   (see the `vignette(hipercow::details)` vignette)
#> ℹ Better again, create large objects from your 'sources' argument to your
#>   environment, and then advertise this using the 'globals' argument (see the
#>   `vignette(hipercow::environments)` vignette)

So, how do we get this object into our session? Suppose that read_big_data() is the function that we used to read the object in the first place. We might write a source file data.R (the name is not important):

my_big_data <- read_big_data("bigfile.shp")

and add this to our sources, but also add my_big_data to globals. This tells hipercow that the object my_big_data will be globally available and that if it sees it in an expression that you create it should not try and save it:

hipercow_environment_create(sources = "data.R", globals = "my_big_data")
#> ✔ Updated environment 'default'

We can now create the task and run it:

id <- task_create_expr(length(my_big_data))
#> ✔ Submitted task '9373c2e0df356488db8f1009d8d4994e' using 'example'
task_wait(id)
#> [1] TRUE
task_result(id)
#> [1] 1000000

If you use this approach, you should be aware of a few things:

  • By default, there is no guarantee that the object you see when creating the task is the same as the object loaded by sources. You can enable this by setting the option hipercow.validate_globals (see vignette("details"))
  • You should generally load (via source()) your data into the local session too so you know what you are working with on the cluster
  • Don’t update the definitions in the files you source while some tasks are running or you will get confused about what was loaded

Other points

Don’t rely on the environment being the same when you submitted your tasks and when you load them, if you change the environment. That is, don’t write:

hipercow_environment_create(packages = "pkg.a")
task_create_expr(use_pkg_a())
hipercow_environment_create(packages = "pkg.b")
task_create_expr(use_pkg_b())

If you do this then the second call to hipercow_environment_create overwrites the default environment, and if the first task has not started yet, it will load the second environment and not the first!

You can have as many named environments as you want, so we might write that as:

hipercow_environment_create(packages = "pkg.a", name = "env_a")
task_create_expr(use_pkg_a(), environment = "env_a")
hipercow_environment_create(packages = "pkg.b", name = "env_a")
task_create_expr(use_pkg_b(), environment = "env_b")

Relationship with provisioning

When you provision, you install packages. This defines the set of packages that are available, but none will be loaded by default. You will be able to load any package that you have installed via library(), or access the functions by using the namespace operator ::, but by default only the normal minimal set of packages will be loaded.

If you specify packages within your provision script but do not load them, that’s is fine - the packages exist in the library but may not be used.

If you use packages in an environment that you do not provision, then your task will fail to start once the package is not found.

When you provision a set of packages, it provisions all the dependencies. Some people will add tidyverse to their pkgdepends.txt, and then use (say) ggplot2 in their environment because that is a dependency of tidyverse. This is fine.