Title: | Demographic analysis of DHS and other household surveys |
---|---|
Description: | This package includes tools for calculating demographic indicators from household survey data. Initially developed for for processing and analysis from Demographic and Health Surveys (DHS) and Multiple Indicator Cluster Surveys (MICS). The package provides tools to calculate standard child mortality, adult mortality, and fertility indicators stratified arbitrarily by age group, calendar period, pre-survey time periods, birth cohorts and other survey variables (e.g. residence, region, wealth status, education, etc.). Design-based standard errors and sample correlations are available for all indicators via Taylor linearisation or jackknife. |
Authors: | Jeff Eaton [aut, cre], Bruno Masquelier [aut] |
Maintainer: | Jeff Eaton <[email protected]> |
License: | GPL-3 |
Version: | 0.2.6 |
Built: | 2024-11-20 03:20:25 UTC |
Source: | https://github.com/mrc-ide/demogsurv |
Construct a model matrix for aggregating over age groups
.mm_aggr(mf, agegr)
.mm_aggr(mf, agegr)
mf |
Model frame for predicted rates |
agegr |
Numeric vector defining ages in years for splits. |
Calculate age-specific fertility rate (ASFR) and total fertility rate (TFR)
calc_asfr( data, by = NULL, agegr = 3:10 * 5, period = NULL, cohort = NULL, tips = c(0, 3), clusters = ~v021, strata = ~v024 + v025, id = "caseid", dob = "v011", intv = "v008", weight = "v005", varmethod = "lin", bvars = grep("^b3\\_[0-9]*", names(data), value = TRUE), birth_displace = 1e-06, origin = 1900, scale = 12, bhdata = NULL, counts = FALSE, clustcounts = FALSE )
calc_asfr( data, by = NULL, agegr = 3:10 * 5, period = NULL, cohort = NULL, tips = c(0, 3), clusters = ~v021, strata = ~v024 + v025, id = "caseid", dob = "v011", intv = "v008", weight = "v005", varmethod = "lin", bvars = grep("^b3\\_[0-9]*", names(data), value = TRUE), birth_displace = 1e-06, origin = 1900, scale = 12, bhdata = NULL, counts = FALSE, clustcounts = FALSE )
data |
A dataset (data.frame), for example a DHS individual recode (IR) dataset. |
by |
A formula specifying factor variables by which to stratify analysis. |
agegr |
Numeric vector defining ages in years for splits. |
period |
Numeric vector defining calendar periods to stratify analysis, use |
cohort |
Numeric vector defining birth cohorts to stratify analysis, use |
tips |
Break points for TIme Preceding Survey. |
clusters |
Formula or data frame specifying cluster ids from largest level to smallest level, ‘~0’ or ‘~1’ is a formula for no clusters. |
strata |
Formula or vector specifying strata, use ‘NULL’ for no strata. |
id |
Variable name for identifying each individual respondent (character string). |
dob |
Variable name for date of birth of each individual (character string). |
intv |
Variable name for interview date (character string). |
weight |
Formula or vector specifying sampling weights. |
varmethod |
Method for variance calculation. Currently "lin" for Taylor linearisation or "jk1" for unstratified jackknife, or "jkn", for stratified jackknife. |
bvars |
Names of variables giving child dates of birth. If |
birth_displace |
Numeric value to displace multiple births date of birth by. Default is '1e-6'. |
origin |
Origin year for date arguments. 1900 for CMC inputs. |
scale |
Scale for dates inputs to calendar years. 12 for CMC inputs. |
bhdata |
A birth history dataset ( |
counts |
Whether to include counts of births & person-years ('pys')
in the returned |
clustcounts |
Whether to return additional attributes storing cluster
specific counts of births |
Events and person-years are calculated using normalized weights. Unweighted
aggregations may be output by specifying weights=NULL
(default) or
weights=~1
.
The assumption is that all dates in the data are specified in the same
format, typically century month code (CMC). The period
argument is
specified in calendar years (possibly non-integer).
Default values for agegr
, period
, and tips
parameters returns
age-specific fertility rates over the three-years preceding the survey,
the standard fertility indicator produced in DHS reports.
A data.frame
consisting of estimates and standard errors. The full
covariance matrix of the estimates can be retrieved by vcov(val)
.
data(zzir) ## Replicate DHS Table 5.1 ## ASFR and TFR in 3 years preceding survey by residence calc_asfr(zzir, ~1, tips=c(0, 3)) reshape2::dcast(calc_asfr(zzir, ~v025, tips=c(0, 3)), agegr ~ v025, value.var = "asfr") calc_tfr(zzir, ~v025) calc_tfr(zzir, ~1) ## Replicate DHS Table 5.2 ## TFR by resdience, region, education, and wealth quintile calc_tfr(zzir, ~v102) # residence calc_tfr(zzir, ~v101) # region calc_tfr(zzir, ~v106) # education calc_tfr(zzir, ~v190) # wealth calc_tfr(zzir) # total ## Calculate annual TFR estimates for 10 years preceding survey tfr_ann <- calc_tfr(zzir, tips=0:9) ## Sample covariance of annual TFR estimates arising from complex survey design cov2cor(vcov(tfr_ann)) ## Alternately, calculate TFR estimates by calendar year tfr_cal <- calc_tfr(zzir, period = 2004:2015, tips=NULL) tfr_cal ## sample covariance of annual TFR estimates arising from complex survey design ## Generate estimates split by period and TIPS cov2cor(vcov(tfr_cal)) calc_tfr(zzir, period = c(2010, 2013, 2015), tips=0:5) ## ASFR estimates by birth cohort asfr_coh <- calc_asfr(zzir, cohort=c(1980, 1985, 1990, 1995), tips=NULL) reshape2::dcast(asfr_coh, agegr ~ cohort, value.var = "asfr")
data(zzir) ## Replicate DHS Table 5.1 ## ASFR and TFR in 3 years preceding survey by residence calc_asfr(zzir, ~1, tips=c(0, 3)) reshape2::dcast(calc_asfr(zzir, ~v025, tips=c(0, 3)), agegr ~ v025, value.var = "asfr") calc_tfr(zzir, ~v025) calc_tfr(zzir, ~1) ## Replicate DHS Table 5.2 ## TFR by resdience, region, education, and wealth quintile calc_tfr(zzir, ~v102) # residence calc_tfr(zzir, ~v101) # region calc_tfr(zzir, ~v106) # education calc_tfr(zzir, ~v190) # wealth calc_tfr(zzir) # total ## Calculate annual TFR estimates for 10 years preceding survey tfr_ann <- calc_tfr(zzir, tips=0:9) ## Sample covariance of annual TFR estimates arising from complex survey design cov2cor(vcov(tfr_ann)) ## Alternately, calculate TFR estimates by calendar year tfr_cal <- calc_tfr(zzir, period = 2004:2015, tips=NULL) tfr_cal ## sample covariance of annual TFR estimates arising from complex survey design ## Generate estimates split by period and TIPS cov2cor(vcov(tfr_cal)) calc_tfr(zzir, period = c(2010, 2013, 2015), tips=0:5) ## ASFR estimates by birth cohort asfr_coh <- calc_asfr(zzir, cohort=c(1980, 1985, 1990, 1995), tips=NULL) reshape2::dcast(asfr_coh, agegr ~ cohort, value.var = "asfr")
Should replicate mortality rates reported in DHS reports.
calc_dhs_mx(sib, period = c(0, 84))
calc_dhs_mx(sib, period = c(0, 84))
sib |
A dataset as 'data.frame'. |
period |
Interval 'period' defined in the months before the survey. |
Default arguments are configured to calculate under 5 mortality from a DHS Births Recode file.
calc_nqx( data, by = NULL, agegr = c(0, 1, 3, 5, 12, 24, 36, 48, 60)/12, period = NULL, cohort = NULL, tips = c(0, 5, 10, 15), clusters = ~v021, strata = ~v024 + v025, weight = "v005", dob = "b3", dod = "dod", death = "death", intv = "v008", varmethod = "lin", origin = 1900, scale = 12 )
calc_nqx( data, by = NULL, agegr = c(0, 1, 3, 5, 12, 24, 36, 48, 60)/12, period = NULL, cohort = NULL, tips = c(0, 5, 10, 15), clusters = ~v021, strata = ~v024 + v025, weight = "v005", dob = "b3", dod = "dod", death = "death", intv = "v008", varmethod = "lin", origin = 1900, scale = 12 )
data |
A dataset (data.frame), for example a DHS births recode (BR) dataset. |
by |
A formula specifying factor variables by which to stratify analysis. |
agegr |
Numeric vector defining ages in years for splits. |
period |
Numeric vector defining calendar periods to stratify analysis, use |
cohort |
Numeric vector defining birth cohorts to stratify analysis, use |
tips |
Break points for TIme Preceding Survey. |
clusters |
Formula or data frame specifying cluster ids from largest level to smallest level, ‘~0’ or ‘~1’ is a formula for no clusters. |
strata |
Formula or vector specifying strata, use ‘NULL’ for no strata. |
weight |
Formula or vector specifying sampling weights. |
dob |
Variable name for date of birth (character string). |
dod |
Variable name for date of death (character string). |
death |
Variable name for event variable (character string). |
intv |
Variable name for interview date (character string). |
varmethod |
Method for variance calculation. Currently "lin" for Taylor linearisation or "jk1" for unstratified jackknife, or "jkn", for stratified jackknife. |
origin |
Origin year for date arguments. 1900 for CMC inputs. |
scale |
Scale for dates inputs to calendar years. 12 for CMC inputs. |
data(zzbr) zzbr$death <- zzbr$b5 == "no" # b5: child still alive ("yes"/"no") zzbr$dod <- zzbr$b3 + zzbr$b7 + 0.5 ## Calculate 5q0 from birth history dataset. ## Note this does NOT exactly match DHS calculation. ## See calc_dhs_u5mr(). u5mr <- calc_nqx(zzbr) u5mr ## Retrieve sample covariance and correlation vcov(u5mr) # sample covariance cov2cor(vcov(u5mr)) # sample correlation ## 5q0 by sociodemographic characteristics calc_nqx(zzbr, by=~v102) # by urban/rural residence calc_nqx(zzbr, by=~v190, tips=c(0, 10)) # by wealth quintile, 0-9 years before calc_nqx(zzbr, by=~v101+v102, tips=c(0, 10)) # by region and residence ## Compare unstratified standard error estiamtes for linearization and jackknife calc_nqx(zzbr, varmethod = "lin") # unstratified design calc_nqx(zzbr, strata=NULL, varmethod = "lin") # unstratified design calc_nqx(zzbr, strata=NULL, varmethod = "jk1") # unstratififed jackknife calc_nqx(zzbr, varmethod = "jkn") # stratififed jackknife ## Calculate various child mortality indicators (neonatal, infant, etc.) calc_nqx(zzbr, agegr=c(0, 1)/12) # neonatal calc_nqx(zzbr, agegr=c(1, 3, 5, 12)/12) # postneonatal calc_nqx(zzbr, agegr=c(0, 1, 3, 5, 12)/12) # infant (1q0) calc_nqx(zzbr, agegr=c(12, 24, 36, 48, 60)/12) # child (4q1) calc_nqx(zzbr, agegr=c(0, 1, 3, 5, 12, 24, 36, 48, 60)/12) # u5mr (5q0) ## Calculate annaul 5q0 by calendar year calc_nqx(zzbr, period=2005:2015, tips=NULL)
data(zzbr) zzbr$death <- zzbr$b5 == "no" # b5: child still alive ("yes"/"no") zzbr$dod <- zzbr$b3 + zzbr$b7 + 0.5 ## Calculate 5q0 from birth history dataset. ## Note this does NOT exactly match DHS calculation. ## See calc_dhs_u5mr(). u5mr <- calc_nqx(zzbr) u5mr ## Retrieve sample covariance and correlation vcov(u5mr) # sample covariance cov2cor(vcov(u5mr)) # sample correlation ## 5q0 by sociodemographic characteristics calc_nqx(zzbr, by=~v102) # by urban/rural residence calc_nqx(zzbr, by=~v190, tips=c(0, 10)) # by wealth quintile, 0-9 years before calc_nqx(zzbr, by=~v101+v102, tips=c(0, 10)) # by region and residence ## Compare unstratified standard error estiamtes for linearization and jackknife calc_nqx(zzbr, varmethod = "lin") # unstratified design calc_nqx(zzbr, strata=NULL, varmethod = "lin") # unstratified design calc_nqx(zzbr, strata=NULL, varmethod = "jk1") # unstratififed jackknife calc_nqx(zzbr, varmethod = "jkn") # stratififed jackknife ## Calculate various child mortality indicators (neonatal, infant, etc.) calc_nqx(zzbr, agegr=c(0, 1)/12) # neonatal calc_nqx(zzbr, agegr=c(1, 3, 5, 12)/12) # postneonatal calc_nqx(zzbr, agegr=c(0, 1, 3, 5, 12)/12) # infant (1q0) calc_nqx(zzbr, agegr=c(12, 24, 36, 48, 60)/12) # child (4q1) calc_nqx(zzbr, agegr=c(0, 1, 3, 5, 12, 24, 36, 48, 60)/12) # u5mr (5q0) ## Calculate annaul 5q0 by calendar year calc_nqx(zzbr, period=2005:2015, tips=NULL)
Create episode dataset split by period, age group, and time preceding survey indicator (TIPS)
create_tips_data( dat, period = do.call(seq.int, as.list(range(dat$intvy) + c(-16, 1))), agegr = 3:12 * 5, tips = 0:15, dobvar = "sibdob", dodvar = "sibdod" )
create_tips_data( dat, period = do.call(seq.int, as.list(range(dat$intvy) + c(-16, 1))), agegr = 3:12 * 5, tips = 0:15, dobvar = "sibdob", dodvar = "sibdod" )
dat |
A dataset as 'data.frame'. |
period |
Numeric vector defining calendar periods to stratify analysis, use 'NULL' for no periods. |
agegr |
Numeric vector defining ages *in years* for splits. |
tips |
Break points for TIme Preceding Survey. |
dobvar |
Variable name for date of birth (character string). |
dodvar |
Variable name for date of death (character string). |
This is a wrapper for the pyears
function
in the survival
package with convenient stratifications for
demographic analyses.
demog_pyears( formula, data, period = NULL, agegr = NULL, cohort = NULL, tips = NULL, origin = 1900, scale = 12, dob = "(dob)", intv = "(intv)", tstart = "tstart", tstop = "tstop", event = "event", weights = NULL )
demog_pyears( formula, data, period = NULL, agegr = NULL, cohort = NULL, tips = NULL, origin = 1900, scale = 12, dob = "(dob)", intv = "(intv)", tstart = "tstart", tstop = "tstop", event = "event", weights = NULL )
formula |
a formula object.
The response variable will be a vector of follow-up times
for each subject, or a |
data |
a data frame in which to interpret the variables named in
the |
period |
Numeric vector defining calendar periods to stratify analysis, use |
agegr |
Numeric vector defining ages in years for splits. |
cohort |
Numeric vector defining birth cohorts to stratify analysis, use |
tips |
Break points for TIme Preceding Survey. |
origin |
Origin year for date arguments. 1900 for CMC inputs. |
scale |
a scaling for the results. As most rate tables are in units/day, the default value of 365.25 causes the output to be reported in years. |
dob |
Variable name for date of birth (character string). |
intv |
Variable name for interview date (character string). |
tstart |
Variable name for the start of follow up time, example is date of birth. Default is 'tstart'. |
tstop |
Variable name for the end of follow up time, examples include interview date or date of death. Default is 'tend'. |
event |
Variable name for the event indicator, example is birth or death. Default is 'event'. |
weights |
case weights. |
Note that event
must be a binary variable per the internals
of the pyears()
function. The function could
be updated to work around this stipulation.
Calculate the covariance matrix for a vector of estimates of the form fn(L * x/n)
using unstratified (JK1) or stratified (JKn) jackknife calculation removing a
single cluster at a time. The calculation assumes infinite population sampling.
jackknife(x, n, strataid = NULL, L = diag(nrow(x)), fn = function(x) x)
jackknife(x, n, strataid = NULL, L = diag(nrow(x)), fn = function(x) x)
x |
|
n |
|
strataid |
integer or factor vector consisting of id for each strata. Optional, length should be number of columns of x if supplied. |
L |
|
fn |
function to transorm ratio x/n. |
If strataid
is provided, then the stratified (JKn) covariance is calculated, while
if strataid = NULL
then the unstratified (JK1) covariance is calculated. The
latter corresponds to the unstratified jackknife covariance reported in DHS survey
reports. The calculations are equivalent for strataid = rep(1, ncol(x))
.
a data frame with q
rows consisting of estimates calculated as
fn(L * rowSums(x) / rowSums(n) )
, standard error, and 95% CIs calculated
on the untransformed scale and then transformed. The covariance matrix is
returned as the "var"
attribute and can be accessed by vcov(val)
.
Pedersen J, Liu J (2012) Child Mortality Estimation: Appropriate Time Periods for Child Mortality Estimates from Full Birth Histories. PLoS Med 9(8): e1001289. https://doi.org/10.1371/journal.pmed.1001289.
Convert respondent-level sibling history data to one row per sibling
reshape_sib_data( data, widevars = grep("^v", names(data), value = TRUE), longvars = grep(sibvar_regex, names(data), value = TRUE), idvar = "caseid", sib_vars = sub("(.*)_.*", "\\1", longvars), sib_idvar = "mmidx", sibvar_regex = "^mm[idx0-9]" )
reshape_sib_data( data, widevars = grep("^v", names(data), value = TRUE), longvars = grep(sibvar_regex, names(data), value = TRUE), idvar = "caseid", sib_vars = sub("(.*)_.*", "\\1", longvars), sib_idvar = "mmidx", sibvar_regex = "^mm[idx0-9]" )
data |
A dataset as |
widevars |
Character vector of respondent-level variable names to include. |
longvars |
Character vector of variables corresponding to each sibling. |
idvar |
Vector of variable names uniquely identifying each respondent. |
sib_vars |
Vector of same length as longvars giving variable names in long dataset. |
sib_idvar |
Variable name uniquely identifying each sibling record. Should appear amongst |
sibvar_regex |
Optionally, a regular expression to identify variable names for |
data(zzir) zzsib <- reshape_sib_data(zzir) zzsib$death <- factor(zzsib$mm2, c("dead", "alive")) == "dead" zzsib$sex <- factor(zzsib$mm1, c("female", "male")) # drop mm2 = 3: "missing" calc_nqx(zzsib, by=~sex, agegr=seq(15, 50, 5), tips=c(0, 7), dob="mm4", dod="mm8")
data(zzir) zzsib <- reshape_sib_data(zzir) zzsib$death <- factor(zzsib$mm2, c("dead", "alive")) == "dead" zzsib$sex <- factor(zzsib$mm1, c("female", "male")) # drop mm2 = 3: "missing" calc_nqx(zzsib, by=~sex, agegr=seq(15, 50, 5), tips=c(0, 7), dob="mm4", dod="mm8")
An example DHS births recode dataset with each row representing a child ever born to individuals eligible for the women's questionnaire. This data is not from any actual DHS survey.
zzbr
zzbr
A data frame with 23,666 rows and 105 variables. To keep the model dataset small only variables starting with 'caseid', 'v0', 'v1', or 'b', have been included.
https://dhsprogram.com/data/Download-Model-Datasets.cfm
An example DHS individual recode dataset with each row representing an individual eligible for the women's questionnaire. This data is not from any actual DHS survey.
zzir
zzir
A data frame with 8,348 rows and 819 variables. To keep the model dataset small only variables starting with 'caseid', 'v0', 'v1', 'v2', 'b', or 'mm' have been included.
https://dhsprogram.com/data/Download-Model-Datasets.cfm