This vignette primarily exists so that we can point to it from other bits of documentation, or requests for help, with details about doing particular chores with hipercow.
There are a few influential options that control hipercow’s behaviour, and you may want to set these if you wish to depart from the defaults that we have set.
hipercow.auto_install_missing_packages
Logical, controlling if hipercow
should install missing
packages as it needs them on your machine. The default is
TRUE
, which will generally cause
hipercow.windows
and conan2
to be installed
the first time you try and use hipercow
.
hipercow.progress
Logical, controlling if we should display a progress bar in some
contexts (e.g., task_wait()
). This affects the default
behaviour of any hipercow function accepting a progress
argument, with this option becoming the default value in the absence of
an explicit argument. That is, you can always disable individual
progress bars but this option lets you disable them by default. The
default value of this option is TRUE
(i.e., display
progress bars by default).
hipercow.timeout
Numeric, controlling the default timeout for
task_wait()
, task_log_watch()
,
hipercow_bundle_wait()
and hipercow_hello()
.
If not set, the default is Inf
(i.e., wait forever).
hipercow.validate_globals
Logical, controlling if we validate the value of “global” variables
when a task starts. If TRUE
, then when we save a task
(e.g., with task_create_expr()
) and where you have
created an environment with hipercow_environment_create()
and specified values for the globals
argument,
then we hash the values that we find in your current session and when
the task starts on the cluster we validate that the values that we find
are the same as in your local session. The default is FALSE
(i.e., this validation is not done).
hipercow.max_size_local
A number, being the maximum size object (in bytes) that we will save
when creating a task (e.g., with task_create_expr()
). When
we save a task we need to save all local variables that you reference in
the call; often these are small and it’s no problem. However, if you
pass in a 5GB shapefile then you will end up filling up the disk with
copies of this file, and your task will spend a lot of time reading and
writing them. It’s usually better to arrange for large objects to be
found from your scripts via your environment. The default value is
1e6
(1,000,000 bytes, or 1 MB). Set to Inf
to
disable this check.
hipercow.default_envvars
A hipercow_envvars
object, with environment variables
that will be added to every task created. Hipercow sets a few
environment variables by default to improve the user experience. These
variables can be overridden per-task by passing a envvars
parameter when creating tasks, or globally by defining this option.
Environment variables for a task are computed by collecting (in increasing order of preference):
hipercow
default environment variables (defined in
hipercow’s DEFAULT_ENVVARS
); currently
R_GC_MEM_GROW=3
default_envvars
argument to hipercow_driver
);
see vignette("windows")
for the windows defaultshipercow.default_envvars
envvars
to a task creation functionEnvironment variables set later override those earlier.
hipercow.development
Use the development library (if it exists) for bootstrapping. We may ask you to set this temporarily if we are diagnosing a bug with you. Don’t set this yourself generally as the library will often not exist, or be out of date!
hipercow.timeout
A number, representing seconds, for timing out when performing web requests. The default is 10s but if we have chosen this poorly you may need to increase or decrease it.
These options are only relevant when using hipercow’s rrq integration.
hipercow.rrq_offload_threshold_size
Objects passed to and from rrq tasks are usually stored in Redis. However since these are all stored in memory, larger objects are offloaded to the disk instead. This option controls the threshold used to decide whether or not to offload objects. Objects larger than the configured value (in bytes) are offloaded to disk.
The default value is 100000, ie. 100kB.
This option is used when the queue is first created. Changing it afterwards will have no effect.
We make heavy use of the cli
package, see its documentation on
options. Particular options that you might care about:
cli.progress_show_after
: Delay in
seconds before showing the progress bar; you might reduce this to make
bars appear more quickly.
cli.progress_clear
: Retain the
progress bar on screen after completion.
You can set options either within a session by using the
options()
function or for all future sessions by editing
your .Rprofile
.
Changing an option with options()
looks like:
You might put this at the top of your cluster script and it will affect any future tasks that you submit within that session after you run it. You’ll probably run this line next time, so it would have an effect there too. It won’t affect any other project.
If you always want that value, you can add it to your
.Rprofile
. The easiest way of doing this is by using the
usethis
package and running:
and then adding a call to options()
anywhere in that
file. If you are comfortable with using (and exiting) vi
on
the command line you may prefer using vi ~/.Rprofile
.
After changing your .Rprofile
you will need to restart R
for the changes to take effect.
You can set multiple options at once if you want by running (for example):
By default, we try and track a “suitable” copy of R based on your current version. We will try and use exactly the same version if available, otherwise the oldest cluster version that is newer than yours, or failing that the most recent cluster version.
You can control the R version used when configuring; for example to
use 4.3.0
exactly you can use:
Not all versions are installed; at the moment we have
4.0.5
, 4.1.3
, 4.2.3
4.3.0
and 4.3.2
installed on the Windows
cluster but will add more (we’ll also sort out some sort of general
mechanism for you to see what is installed)
Try not to use versions that are more than one “minor” version old
(the middle version number). You can see the current version on CRAN’s landing page. If it is
4.3.x
then versions in the 4.2.x
and
4.3.x
series will likely work well for you, but
4.1.x
and earlier will not work well because binary
versions of packages are no longer available and because some packages
will start depending on features present in newer versions, so may not
be available.
The R session that you use to send tasks to the cluster is not important; only the directory from which you send the tasks. So you can stop the R session, reboot your computer or shut down your laptop and your tasks will keep running (or keep waiting for the cluster to pick them up). If you come back to the same working directory later you can ask about a task’s status, fetch its results etc easily.
If you run out of disk space, terrible things will happen, and the error messages that you get may be misleading. Worse, you may run out of space, then have some space gets freed up so that it’s not obvious what the original problem was.
The underlying reason for this is that we store information about the
state of tasks on your network share; if you run out of space while
writing the final result of a task, then that task will remain as
running
indefinitely because we fail to write out that it
succeeded.