EpiEstim Vignette

The EpiEstim package allows you to estimate the time varying reproduction number (Rt) from epidemic data. This vignette will take you through the key stages in the process of estimating Rt and also introduce some additional packages (incidence and projections) that can work alongside EpiEstim in a typical outbreak analysis workflow. Please also see the ‘Examples’ and ‘FAQs’ sections.

Overview

Input data

The standard way that outbreak data is presented is in the form of a linelist, which incorporates detailed information on the reported cases of an outbreak. For example:

Case ID Date Reported Date Symptom Onset Date Exposure M/F Age Outcome Classification
1 dd/mm/yy dd/mm/yy dd/mm/yy F 26 recovered confirmed
2 dd/mm/yy dd/mm/yy dd/mm/yy M 32 died suspected
3 dd/mm/yy dd/mm/yy dd/mm/yy M 41 recovered confirmed
4 dd/mm/yy dd/mm/yy dd/mm/yy F 68 recovered probable

In order to estimate Rt, EpiEstim needs to be supplied with an estimate of the serial interval distribution (step A) and the incidence of confirmed cases (step B).

Step A: Serial Interval

The serial interval is the time between symptom onset in successive cases. The serial interval distribution can be parameterised either by estimates taken from the literature or it can be estimated directly from the linelist if you have data for infector-infected pairs.

Serial interval from the literature:

If a disease has been well characterised, then you have the option to use estimates of the serial interval that can be found in the literature. You can either supply the mean and standard deviation (parametric distribution) or the full distribution (non-parametric distribution). See examples A and B in the Examples section.

Serial interval from infector-infected pairs:

This feature was recently incorporated into EpiEstim by Thompson et al 2019. If you have data for infector-infected pairs of cases you can estimate the serial interval distribution directly from your own data and account for its uncertainty.

For example, if “Case ID” represents the person who has been exposed and infected and “Infector ID” represents the potential source of infection, then you can format the contact data as follows:

Case ID Infector ID Source of Exposure
299 208 funeral
302 272 hospital
341 301 household
389 312 other

Combined with dates of symptom onset in the line list, these can be used to estimate the serial interval distribution in this outbreak.

In an ideal world we would have access to data in the format above, where dates of symptom onset for infector/infected individuals are very well defined, but often we do not know the exact date of symptom onset. In this situation, EpiEstim allows you to account for this uncertainty using “interval-censored data”, where there are lower and upper bounds for the symptom onset time in infector/infected cases (Thompson et al 2019).

This type of data may look like this:

Day infector symptom onset (LB) Day infector symptom onset (UB) Day infected symptom onset (LB) Day infected symptom onset (UB)
21 25 26 28
23 26 26 30
34 39 41 43
46 52 54 55

For an example of how we can parameterise the SI using this method, see example C in the Examples section.

It is important to note that, particularly for outbreaks of emerging diseases, the serial interval estimate can be updated when you gather more data as the epidemic progresses.

Step B: Incidence

The package incidence can be used to easily plot incidence, as shown in the following example using a pre-loaded EpiEstim dataset of an Influenza outbreak in 2009 (“Flu2009”).

data("Flu2009") # load example dataset
head(Flu2009$incidence) # displays the first n rows of the Flu2009$incidence data 
#>        dates I
#> 1 2009-04-27 1
#> 2 2009-04-28 1
#> 3 2009-04-29 0
#> 4 2009-04-30 2
#> 5 2009-05-01 5
#> 6 2009-05-02 3

For instance, in this example, incidence (I) is a count of individuals with a symptom onset on a particular date. We can plot this using incidence as follows:

plot(as.incidence(Flu2009$incidence$I, dates = Flu2009$incidence$dates))

incidence can convert the dates of symptom onset from linelists with various formats into an incidence object that can be easily understood by EpiEstim.

Please see the incidence vignette overview for more details, features and examples: https://cran.r-project.org/web/packages/incidence/vignettes/overview.html

Local and Imported cases:

If you have data that distinguishes between local and imported cases, this can be supplied to EpiEstim with local and imported incidence in two separate columns, for example in this dataset:

Date Incidence Overall Incidence Local Incidence Imported
dd/mm/yy 31 28 3
dd/mm/yy 48 46 2
dd/mm/yy 45 44 1
dd/mm/yy 52 48 4

Temporally aggregated incidence data:

This vignette provides examples using incidence data that is available on a daily basis. If you have incidence data that is reported over aggregated timescales, such as weekly data or incidence reported 3 times a week, you will need to provide an additional parameter ‘dt’ to the estimate_R() function introduced in step C.

For full details about using EpiEstim for temporally aggregated incidence data, see the “EpiEstim_aggregated_incidence” vignette.

Estimate Rt

Once you have an incidence object (based on the dates of symptom onset) and information on the serial interval distribution, we can use the renewal equation (a form of branching process model) to estimate Rt. Rt is the average number of secondary cases that an infected individual would infect if the conditions remained as they were at time t (Cori et al 2013).

Here It represents the incidence of symptom onset at time t, which is approximated by a Poisson process using the renewal equation. Rt is what we want to estimate, It-s is the past incidence and ω is the serial interval distribution.

Step C: estimate_R()

This is where the estimate_R() function comes in. We supply EpiEstim with the incidence and tell the package which method we want to use to parameterise the serial interval.

estimate_R(
  incid = incidence_object,
  method = c("parametric_si", "non_parametric_si", "si_from_data",
             "uncertain_si", "si_from_sample"), # choose one e.g. method = "parametric_si"
  si_data = NULL, # if using "si_from_data" supply infector-infected data here
  si_sample = NULL, # if using "si_from_sample" supply matrix where each column is an SI to be explored
  config = make_config(list())
)

As mentioned, the serial interval (SI) can be defined using one of the following methods:

Parametric - "parametric_si" uses the mean and standard deviation. Ex. The user can supply mean_si = 3 and std_si = 1, which generates an offset gamma distribution as the default:

Non-parametric - "non_parametric_si" uses the discrete distribution of the SI. Ex. The user can supply si_distr = c(0.000, 0.233, 0.359, 0.198, 0.103, 0.053, 0.027, 0.014, 0.007, 0.003, 0.002, 0.001), which generates:

Uncertain - If we are particularly uncertain about the parameters of the SI distribution, then we can use "uncertain_si" to draw the mean and standard deviation from truncated normal distributions and generate multiple potential SI distributions and integrate the results.

Ex. Here, the mean of the SI will be drawn from a normal distribution with a mean of 3 and SD of 1, truncated at 1 and 4 (left). The SD of the SI will be drawn from a normal distribution with a mean of 1.5 and SD of 0.5, truncated at 0.5 and 2.5 (centre). The drawn values will parameterise multiple SI distributions to be considered, but for clarity, the example below only displays 5 (right).