The EpiEstim
package
allows you to estimate the time varying reproduction number
(Rt) from epidemic data. This vignette will take you through
the key stages in the process of estimating Rt and also
introduce some additional packages (incidence
and
projections
) that can work alongside EpiEstim in a typical
outbreak analysis workflow. Please also see the ‘Examples’ and ‘FAQs’
sections.
The standard way that outbreak data is presented is in the form of a linelist, which incorporates detailed information on the reported cases of an outbreak. For example:
Case ID | Date Reported | Date Symptom Onset | Date Exposure | M/F | Age | Outcome | Classification |
---|---|---|---|---|---|---|---|
1 | dd/mm/yy | dd/mm/yy | dd/mm/yy | F | 26 | recovered | confirmed |
2 | dd/mm/yy | dd/mm/yy | dd/mm/yy | M | 32 | died | suspected |
3 | dd/mm/yy | dd/mm/yy | dd/mm/yy | M | 41 | recovered | confirmed |
4 | dd/mm/yy | dd/mm/yy | dd/mm/yy | F | 68 | recovered | probable |
… | … | … | … | … | … | … | … |
In order to estimate Rt, EpiEstim needs to be supplied with an estimate of the serial interval distribution (step A) and the incidence of confirmed cases (step B).
The serial interval is the time between symptom onset in successive cases. The serial interval distribution can be parameterised either by estimates taken from the literature or it can be estimated directly from the linelist if you have data for infector-infected pairs.
Serial interval from the literature:
If a disease has been well characterised, then you have the option to use estimates of the serial interval that can be found in the literature. You can either supply the mean and standard deviation (parametric distribution) or the full distribution (non-parametric distribution). See examples A and B in the Examples section.
Serial interval from infector-infected pairs:
This feature was recently incorporated into EpiEstim by Thompson et al 2019. If you have data for infector-infected pairs of cases you can estimate the serial interval distribution directly from your own data and account for its uncertainty.
For example, if “Case ID” represents the person who has been exposed and infected and “Infector ID” represents the potential source of infection, then you can format the contact data as follows:
Case ID | Infector ID | Source of Exposure |
---|---|---|
299 | 208 | funeral |
302 | 272 | hospital |
341 | 301 | household |
389 | 312 | other |
… | … | … |
Combined with dates of symptom onset in the line list, these can be used to estimate the serial interval distribution in this outbreak.
In an ideal world we would have access to data in the format above, where dates of symptom onset for infector/infected individuals are very well defined, but often we do not know the exact date of symptom onset. In this situation, EpiEstim allows you to account for this uncertainty using “interval-censored data”, where there are lower and upper bounds for the symptom onset time in infector/infected cases (Thompson et al 2019).
This type of data may look like this:
Day infector symptom onset (LB) | Day infector symptom onset (UB) | Day infected symptom onset (LB) | Day infected symptom onset (UB) |
---|---|---|---|
21 | 25 | 26 | 28 |
23 | 26 | 26 | 30 |
34 | 39 | 41 | 43 |
46 | 52 | 54 | 55 |
… | … | … | … |
For an example of how we can parameterise the SI using this method, see example C in the Examples section.
It is important to note that, particularly for outbreaks of emerging diseases, the serial interval estimate can be updated when you gather more data as the epidemic progresses.
The package incidence
can be used to easily plot
incidence, as shown in the following example using a pre-loaded
EpiEstim
dataset of an Influenza outbreak in 2009
(“Flu2009”).
data("Flu2009") # load example dataset
head(Flu2009$incidence) # displays the first n rows of the Flu2009$incidence data
#> dates I
#> 1 2009-04-27 1
#> 2 2009-04-28 1
#> 3 2009-04-29 0
#> 4 2009-04-30 2
#> 5 2009-05-01 5
#> 6 2009-05-02 3
For instance, in this example, incidence (I) is a count of
individuals with a symptom onset on a particular date. We can plot this
using incidence
as follows:
incidence
can convert the dates of symptom onset from
linelists with various formats into an incidence object that can be
easily understood by EpiEstim
.
Please see the incidence
vignette overview for more
details, features and examples: https://cran.r-project.org/web/packages/incidence/vignettes/overview.html
Local and Imported cases:
If you have data that distinguishes between local and imported cases, this can be supplied to EpiEstim with local and imported incidence in two separate columns, for example in this dataset:
Date | Incidence Overall | Incidence Local | Incidence Imported |
---|---|---|---|
dd/mm/yy | 31 | 28 | 3 |
dd/mm/yy | 48 | 46 | 2 |
dd/mm/yy | 45 | 44 | 1 |
dd/mm/yy | 52 | 48 | 4 |
… | … | … | … |
Temporally aggregated incidence data:
This vignette provides examples using incidence data that is
available on a daily basis. If you have incidence data that is reported
over aggregated timescales, such as weekly data or incidence reported 3
times a week, you will need to provide an additional parameter ‘dt’ to
the estimate_R()
function introduced in step C.
For full details about using EpiEstim for temporally aggregated incidence data, see the “EpiEstim_aggregated_incidence” vignette.
Once you have an incidence object (based on the dates of symptom onset) and information on the serial interval distribution, we can use the renewal equation (a form of branching process model) to estimate Rt. Rt is the average number of secondary cases that an infected individual would infect if the conditions remained as they were at time t (Cori et al 2013).
Here It represents the incidence of symptom onset at time t, which is approximated by a Poisson process using the renewal equation. Rt is what we want to estimate, It-s is the past incidence and ω is the serial interval distribution.
estimate_R()
This is where the estimate_R()
function comes in. We
supply EpiEstim
with the incidence and tell the package
which method we want to use to parameterise the serial interval.
estimate_R(
incid = incidence_object,
method = c("parametric_si", "non_parametric_si", "si_from_data",
"uncertain_si", "si_from_sample"), # choose one e.g. method = "parametric_si"
si_data = NULL, # if using "si_from_data" supply infector-infected data here
si_sample = NULL, # if using "si_from_sample" supply matrix where each column is an SI to be explored
config = make_config(list())
)
As mentioned, the serial interval (SI) can be defined using one of the following methods:
Parametric - "parametric_si"
uses the
mean and standard deviation. Ex. The user can supply
mean_si = 3
and std_si = 1
, which generates an
offset gamma distribution as the default:
Non-parametric - "non_parametric_si"
uses the discrete distribution of the SI. Ex. The user can supply
si_distr = c(0.000, 0.233, 0.359, 0.198, 0.103, 0.053, 0.027, 0.014, 0.007, 0.003, 0.002, 0.001)
,
which generates:
Uncertain - If we are particularly uncertain about
the parameters of the SI distribution, then we can use
"uncertain_si"
to draw the mean and standard deviation from
truncated normal distributions and generate multiple potential SI
distributions and integrate the results.
Ex. Here, the mean of the SI will be drawn from a normal distribution with a mean of 3 and SD of 1, truncated at 1 and 4 (left). The SD of the SI will be drawn from a normal distribution with a mean of 1.5 and SD of 0.5, truncated at 0.5 and 2.5 (centre). The drawn values will parameterise multiple SI distributions to be considered, but for clarity, the example below only displays 5 (right).