The orderly.db
package is an orderly2
plugin which provides support for
interacting with databases. It includes all core features that
orderly
(version 1) supported, but with the new syntax.
This documentation is based off the original orderly vignette. There is a section at the end of the vignette with some comparisons of old and new syntaxes.
The root orderly_config.yml
configuration specifies the
locations of databases (there can be any number), within the plugins
section. For example,
plugins:
orderly.db:
source:
driver: RPostgres::Postgres
args:
host: dbhost.example.org
port: 5432
user: myusername
password: s3cret
dbname: mydb
This database will be referred to elsewhere as source
and it will be connected with the RPostgres::Postgres
driver (from the RPostgres
package). Arguments within the args
block will be passed to
the driver, in this case being the equivalent of:
DBI::dbConnect(RPostgres::Postgres, host = "dbhost.example.org", port = 5432,
user = "myusername", password = "s3cret", dbname = "mydb")
The values used in the args
blocks can be environment
values (e.g., password: $DB_PASSWORD
) in which case they
will be resolved from the environment before connecting (including from
orderly_envir.yml
). This will be useful for keeping secrets
out of source control.
For SQLite
databases, the args
block will typically contain only
dbname
which is the path to the database file.
Within a report’s orderly.R
file, you can extract data
from the database as:
orderly.db::orderly_db_data(as = "cars", database = "source",
query = "SELECT * FROM mtcars WHERE cyl = 4")
In this case, the query
SELECT * FROM mtcars WHERE cyl = 4
will be run against the
source
database to create an object cars
in
the report environment. The actual report code can use that object
without having ever created the database connection or evaluating the
query.
You can interpolate parameters into the query; for example you might have:
orderly2::orderly_parameter(cyl = NULL)
orderly.db::orderly_db_data(as = "cars", database = "source",
query = "SELECT * FROM mtcars WHERE cyl = ?cyl")
In the above, we declare an orderly parameter cyl
with
no default (see ?orderly2::orderly_parameter
) and use
?cyl
to interpolate that into the query (see
?DBI::sqlInterpolate
which does the interpolation for us).
You can only use orderly varibles for this; if you need more complex
substitution you can create strings using paste
or
sprintf
as suits your needs.
You can establish temporary views in the database in order to simplify queries. For example, you might write:
orderly.db::orderly_db_view(as = "cars4", database = "source",
query = "SELECT * FROM mtcars WHERE cyl = 4")
orderly.db::orderly_db_data(as = "cars", database = "source",
query = "SELECT * FROM cars4")
Here, we establish a view called cars4
within the
database; this is scoped to the current connection
(orderly.db
ensures that the same connection is used for
all calls to the database within a report run). Then we query against
that view to pull the data down as the cars
R object. In a
real case this view could be more complicated, involving joins with
other tables.
If you need to perform complicated SQL queries, then you can export the database connection directly by writing
which will save the connection to the source
database as
the R object con
. We have used this where a report requires
running queries in a loop that depend on the results of a previous query
or additional data loaded into a report. Note that this reduces the
amount of tracking that orderly
can do, as we have no way
of knowing what is done with the connection once passed to the
script.
The contents of orderly_config.yml
may contain things
like secrets (passwords) or hostnames that vary depending on deployment
(e.g., testing locally vs running on a remote system). To customise
this, you can use environment variables within the configuration. So
rather than writing
plugins:
orderly.db:
source:
driver: RPostgres::Postgres
args:
host: localhost
port: 5432
user: myuser
dbname: databasename
password: p4ssw0rd
you might write
plugins:
orderly.db:
source:
driver: RPostgres::Postgres
args:
host: $MY_DBHOST
port: $MY_DBPORT
user: $MY_DBUSER
dbname: $MY_DBNAME
password: $MY_PASSWORD
environment variables, as used this way must begin
with a dollar sign and consist only of uppercase letters, numbers and
the underscore character. You can then set the environment variables in
an .Renviron
(either within the project or in your home
directory) file or your .profile
file. Alternatively, you
can create a file orderly_envir.yml
in the same directory
as orderly_config.yml
with key-value pairs, such as
MY_DBHOST: localhost
MY_DBPORT: 5432
MY_DBUSER: myuser
MY_DBNAME: databasename
MY_PASSWORD: p4ssw0rd
This is read every time that orderly_config.yml
is read
(in contrast with .Renviron
which is read-only at the start
of a session). This will likely be more pleasant to work with.
The advantage of using environment variables is that you can add the
orderly_envir.yml
file to your .gitignore
and
avoid committing system-dependent data to the central repository.
In general, you can ignore this section if you only use one global database.
The above approach can be used to switch databases by using different
environmental variables, but that can become tiresome. If you have
multiple database “instances” corresponding to different realisations of
the same logical database (e.g., production and staging), then you can
configure and switch between these directly from orderly
commands. At VIMC we have
several copies of our main database: one called production
,
which is the canonical copy, and then several staging
copies that we use for experimentation.
To configure this situation, list common arguments within the
args
block as before, then add logical databases as named
entries in an instances
field:
plugins:
orderly.db:
source:
driver: RPostgres::Postgres
args:
port: 5432
user: user
dbname: mydb
instances:
production:
host: production.example.org
password: $PASSWORD_PRODUCTION
staging:
host: staging.example.org
password: $PASSWORD_STAGING
default_instance: $DEFAULT_INSTANCE
Here - staging and production have different hostnames
(production.example.org
and
staging.example.org
) and different passwords (retrieved
using environment variables) and the default instance is set with
another environment variable ($DEFAULT_INSTANCE
, which must
be one of production
or staging
). To switch
between databases, you can set that variable, or pass the
instance
argument to
orderly.db::orderly_db_data
and friends, as:
orderly.db::orderly_db_data(as = "cars", database = "source", query = query,
instance = "production")
or
If you pass a value of NULL
for instance
,
then we fall back on the default instance (and if none is specified,
throw an error). You might like to make this a parameter to your report
to allow changing the instance dynamically when running.
Querying for data
becomes
orderly.db::orderly_db_data(as = "cars", database = "source",
query = "SELECT * FROM mtcars WHERE cyl = 4")
Establishing views
becomes
orderly.db::orderly_db_view(as = "cars4", database = "source",
query = "SELECT * FROM mtcars WHERE cyl = 4")
Exporting connections
becomes
In orderly
v1, we saved a copy of the captured data
within orderly
’s data/
directory with the idea
that this would help with future reproducibility, however in practice
this did not prove useful. We do store some basic metrics within the
packet metadata about the extraction (number of rows fetched, names of
the columns etc), which can be used to track what was extracted, or at
least monitor for changes between different versions.
The control over different database instances used to be a top-level
argument to orderly_run
, but now that the database is no
longer a core part of orderly that is not possible. Instead, instance is
just something that is passed through to the
orderly.db::orderly_db_*
calls, and can be derived from any
variable you fancy. You can then configure its behaviour from an orderly
parameter, or from environment variables, or via some other
mechanism.