Title: | Utility Functions For Naomi Datasets |
---|---|
Description: | This package contains utility functions for creating and manipulating datasets for the Naomi model and related projects. |
Authors: | Jeffrey Eaton [aut, cre] |
Maintainer: | Jeffrey Eaton <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.0.13 |
Built: | 2024-10-12 03:43:48 UTC |
Source: | https://github.com/mrc-ide/naomi.utils |
Allocate areas at the most granular level to survey regions via spatial join based on largest overlapping area.
allocate_areas_survey_regions(areas_wide, survey_region_boundaries)
allocate_areas_survey_regions(areas_wide, survey_region_boundaries)
areas_wide |
wide format area hierarchy, created by |
survey_region_boundaries |
survey_region_boundaries dataset created by
|
The function sf:st_join(..., largest = TRUE)
is used to construct a spatial
join based on the area of largest overlap.
If the mapping is clean, the following should be satisfied:
All areas are allocated to a survey region. This might not happen if an area is non-overlapping with the survey geometry.
All survey regions should contain some areas. This might not happen if all areas overlapping a region are not cleanly nested and have a larger overlap with other regions.
The function assert_survey_region_areas()
implements these checks.
These conditions are not comprehensive and do not guarantee the mapping is accurate, but will catch some basic errors.
A simple features data frame consisting of a mapping of all areas to a survey_region_id.
Checks for consistent area IDs between two datasets
assert_area_id_check(df1, df2, key)
assert_area_id_check(df1, df2, key)
df1 |
a dataframe containing area_id |
df2 |
a dataframe containing area_id |
key |
list of columns to compare |
If unique area IDs are present between the two datasets, an error will be generated along with a map of mismatching IDs
Checks valid age groups
assert_pop_age_group(var)
assert_pop_age_group(var)
var |
a value in dataframe extracted using |
If additional age groups are present or missing values for age group, an error will be generated
Checks for: (1) consistent aread IDs between pop data and boundaries file (2) pop data age groups consistent with naomi age groups
assert_pop_data_check(pop_data, boundaries)
assert_pop_data_check(pop_data, boundaries)
pop_data |
population dataframe |
boundaries |
boundary file conatining area IDs |
If unique area IDs are present between the two datasets, an error will be generated along with a map of mismatching IDs. If unique age groups area present, an error will be generated.
Assign each survey cluster with geocoordinates to an area, ensuring that the assigned area is contained in the specified survey region.
assign_dhs_cluster_areas(survey_clusters, survey_region_areas)
assign_dhs_cluster_areas(survey_clusters, survey_region_areas)
survey_clusters |
Interim survey clusters dataset created by
|
survey_region_areas |
Dataset of the areas contained in each survey
region, created by survey_region_areas is a list of candidate location areas for each cluster. Join candidate areas and then select the nearest area based on distance. Usually the coordinate should be contained (distance = 0) |
For each survey cluster with geographic coordinates, the area ID containing the cluster is assigned by:
Identify all areas contained in the survey region in which the cluster is located. This comprises the set of candidate areas where could be located.
Calculate the nearest distance from the cluster coordinates to each candidate area. This distance is 0 if a cluster is contained in an area.
Select the area as the area with the nearest distance, in most cases an area containing the cluster (distance = 0).
sf::st_distance() is substantially slower than sf::st_join(). This function could be (maybe much) more efficient by first using st_join() to assign the majority of clusters that are contained in an area, then calculating the distance for the remaining clusters that were not contained inside any of the candidate areas.
Survey clusters dataset with an area_id assigned for each cluster with geographic coordinates and formatted conforming to survey_clusters schema.
Calculate age/sex/area stratified survey estimates for biomarker outcomes
calc_survey_hiv_indicators( survey_meta, survey_regions, survey_clusters, survey_individuals, survey_biomarker, areas, sex = c("male", "female", "both"), age_group_include = NULL, area_top_level = min(areas$area_level), area_bottom_level = max(areas$area_level), artcov_definition = c("both", "arv", "artself"), by_res_type = FALSE )
calc_survey_hiv_indicators( survey_meta, survey_regions, survey_clusters, survey_individuals, survey_biomarker, areas, sex = c("male", "female", "both"), age_group_include = NULL, area_top_level = min(areas$area_level), area_bottom_level = max(areas$area_level), artcov_definition = c("both", "arv", "artself"), by_res_type = FALSE )
survey_meta |
Survey metadata. |
survey_regions |
Survey regions. |
survey_clusters |
Survey clusters. |
survey_individuals |
Survey individuals. |
survey_biomarker |
Survey biomarkers. |
areas |
Areas. |
sex |
Sex. |
age_group_include |
Vector of age agroups to include |
area_top_level |
Area top level. |
area_bottom_level |
Area bottom level. |
artcov_definition |
Definition to use for calculate ART coverage. |
by_res_type |
Whether to stratify estimates by urban/rural res_type; logical. |
All other data will be subsetted based on the survey_id
values appearing in
survey_meta, so if only want to calculate for a subset of surveys it is
sufficient to pass subset for survey_meta and full data frames for the others.
Much of this function needs to be parsed out into more generic functions and rewritten to be more efficient.
Age group would be more efficient if traversing a tree structure.
Need generic function to calculate
Flexibility about age/sex stratifications to calculate.
The argument artcov_definition
controls whether to use both ARV biomarker and
self-report (artcov_definition = "both"
; default), ARV biomarker only
(artcov_definition = "arv"
), or self-report ART use only
(artcov_definition = "artself"
). If option is "both"
, then all HIV positive
are used as the denomiator and no missing data on either indicator are
incorporated. If the option is "arv"
or "artself"
then missing values in those
variables, respectively, are treated as missing.
This function is useful for checking level of coarseness of a simplified versus raw shapefile and any slivers in a shapefile.
check_boundaries(sh1, sh2 = NULL)
check_boundaries(sh1, sh2 = NULL)
sh1 |
Bottom shapefile with red boundaries |
sh2 |
Top shapefile with red boundaries |
Check whether PJNZ contains .shiny90 file
check_pjnz_shiny90(pjnz)
check_pjnz_shiny90(pjnz)
pjnz |
file path to PJNZ |
TODO: Check whether the .shiny90 file is valid.
Logical whether PJNZ file contains a .shiny90 file
Compare boundaries of two shapefiles by overlaying them
compare_boundaries(sh1, sh2 = NULL, aggregate = FALSE)
compare_boundaries(sh1, sh2 = NULL, aggregate = FALSE)
sh1 |
is bottom shapefile with red boundaries |
sh2 |
is top shapefile with red boundaries |
aggregate |
whether to aggregate shapefiles |
Copy a PJNZ file to a new location an delete everything except for the .DP and .PJN files.
copy_pjnz_extract(pjnz, out, shiny90 = NULL, force_shiny90 = FALSE)
copy_pjnz_extract(pjnz, out, shiny90 = NULL, force_shiny90 = FALSE)
pjnz |
file path to source PJNZ |
out |
file path to save output |
shiny90 |
file path to external .shiny90 zip (optional) |
force_shiny90 |
Logical whether or not to force replacement of a .shiny90 file already in the PJNZ with the provided path. The default behaviour is not to replace the .shiny90 file if it already exists in the PJNZ. |
Both pjnz and out must be length 1. To apply to multiple files, use
Map
function, e.g. Map(copy_pjnz_extract, pjnz_list, out_list)
.
The file must be renamed (pjnz cannot equal out) to avoid inadvertently deleting components from an archived PJNZ file.
The default 'force_shiny90 = FLASE)
Create dataset of indiviaul demographic and HIV outcomes.
create_individual_hiv_dhs(surveys, clear_rdhs_cache = FALSE)
create_individual_hiv_dhs(surveys, clear_rdhs_cache = FALSE)
surveys |
data.frame of surveys, returned by |
The following fields are extracted:
survey_id
cluster_id
household
line
sex
age
dob_cmc
interview_cmc
indweight
hivstatus
arv
artself
vls
cd4
artall
hivweight
data.frame consisting of survey ID, cluster ID and individual demographic and HIV outcomes. See details.
## Not run: surveys <- create_surveys_dhs("MWI") individuals <- create_individual_hiv_dhs(surveys) ## End(Not run)
## Not run: surveys <- create_surveys_dhs("MWI") individuals <- create_individual_hiv_dhs(surveys) ## End(Not run)
Create survey region boundaries dataset from DHS spatial data repository
create_survey_boundaries_dhs( surveys, levelrnk_select = NULL, verbose_download = FALSE )
create_survey_boundaries_dhs( surveys, levelrnk_select = NULL, verbose_download = FALSE )
surveys |
data.frame of surveys, returned by |
levelrnk_select |
A named vector specifying which LEVELRNK to select for a given survey if multiple level ranks are available. Defaults to NULL in which the level with the largest number of regions is selecteed. See details. |
verbose_download |
Whether to print messages from |
For some surveys, the DHS spatial data repository and the survey clusters
datasets boundaries at multiple levels (e.g. admin 1 and admin 2). In these
cases, the admin level with the largest number or regions is selected by
default. The options for multiple level surveys will be printed as messages.
To selected a different level supply a named vector with survey_id
/ LEVELRNK
pairs, for example levelrnk_select = c("MWI2015DHS" = 1)
. See examples.
A simple features data frame containing DHS region code, region name, and region boundaries for each survey.
## Not run: surveys <- create_surveys_dhs("MWI") region_boundaries <- create_survey_boundaries_dhs(surveys) ## Select three regions levelrnk_select = c("MWI2015DHS" = 1) region_boundaries <- create_survey_boundaries_dhs(surveys, levelrnk_select) ## End(Not run)
## Not run: surveys <- create_surveys_dhs("MWI") region_boundaries <- create_survey_boundaries_dhs(surveys) ## Select three regions levelrnk_select = c("MWI2015DHS" = 1) region_boundaries <- create_survey_boundaries_dhs(surveys, levelrnk_select) ## End(Not run)
Create male circumcision outcomes dataset from DHS
create_survey_circumcision_dhs(surveys, clear_rdhs_cache = FALSE)
create_survey_circumcision_dhs(surveys, clear_rdhs_cache = FALSE)
surveys |
data.frame of surveys, returned by |
The following fields are extracted:
survey_id
individual_id
circumcised
circ_age
circ_where
circ_who
data.frame consisting of survey ID, individual ID and male circumcision outcomes. See details.
## Not run: surveys <- create_surveys_dhs("MWI") circ <- create_circumcision_dhs(surveys) ## End(Not run)
## Not run: surveys <- create_surveys_dhs("MWI") circ <- create_circumcision_dhs(surveys) ## End(Not run)
Create survey clusters dataset from DHS household recode and geocluster datasets.
create_survey_clusters_dhs(surveys, clear_rdhs_cache = FALSE)
create_survey_clusters_dhs(surveys, clear_rdhs_cache = FALSE)
surveys |
data.frame of surveys, returned by |
data.frame consisting of survey clusters, survey region id, and cluster geographic coordinates if available.
## Not run: surveys <- create_surveys_dhs("MWI") survey_regions <- create_survey_boundaries_dhs(surveys) surveys <- surveys_add_dhs_regvar(surveys, survey_regions) survey_clusters <- create_survey_clusters_dhs(surveys) ## End(Not run)
## Not run: surveys <- create_surveys_dhs("MWI") survey_regions <- create_survey_boundaries_dhs(surveys) surveys <- surveys_add_dhs_regvar(surveys, survey_regions) survey_clusters <- create_survey_clusters_dhs(surveys) ## End(Not run)
Create survey individuals and biomarker dataset from DHS extract
create_survey_individuals_dhs(dat) create_survey_biomarker_dhs(dat)
create_survey_individuals_dhs(dat) create_survey_biomarker_dhs(dat)
dat |
data.frame of merged individual extract, returned by
|
data.frame matching UNAIDS data schema
Create DHS survey meta data table
create_survey_meta_dhs(surveys)
create_survey_meta_dhs(surveys)
surveys |
data.frame of surveys, returned by |
data.frame of survey metadata specification.
## Not run: surveys <- create_surveys_dhs("MWI") survey_meta <- create_survey_meta_dhs(surveys) ## End(Not run)
## Not run: surveys <- create_surveys_dhs("MWI") survey_meta <- create_survey_meta_dhs(surveys) ## End(Not run)
Construct survey regions dataset by identifying the smallest area_id that contains the whole survey region.
create_survey_regions_dhs(survey_region_areas)
create_survey_regions_dhs(survey_region_areas)
survey_region_areas |
Area allocation to survey regions, created by
|
Survey regions dataset conforming to schema.
Construct a surveys dataset from DHS API. Usess rdhs
to identify the DHS
country code from the ISO3, selects relevant surveys, then constructs the
survey_id
and survey_mid_calendar_quarter
.
create_surveys_dhs( iso3, survey_type = c("DHS", "AIS", "MIS"), survey_characteristics = 23 )
create_surveys_dhs( iso3, survey_type = c("DHS", "AIS", "MIS"), survey_characteristics = 23 )
iso3 |
Three letter ISO3 country code. |
survey_type |
DHS survey types to access. See |
survey_characteristics |
DHS survey characteristic IDs to filter on See |
A data frame containing the response from the dhs_surveys API endpoint
and the survey_id
and survey_mid_calendar_quarter
.
## Not run: create_surveys_dhs("MWI") ## End(Not run)
## Not run: create_surveys_dhs("MWI") ## End(Not run)
Convert nested hierarchy from wide to long format
gather_areas(x)
gather_areas(x)
x |
Wide format nested hierarchy. |
Generate a Naomi area ID consisting of ISO3, area level and
a random nchar
digit alpha numeric.
generate_area_id(iso3, level, nchar = 5)
generate_area_id(iso3, level, nchar = 5)
iso3 |
three character ISO3 code |
level |
area level as an integer |
nchar |
number of alpha numeric digits to generate |
This function is not vectorized. It generates a single area ID.
This function does not set the seed. Ensure to set the seed before calling the function if you want to reproduce the same results.
An area_id in the format <ISO3>_<level>_<xyz12>
.
generate_area_id("ISO", 1)
generate_area_id("ISO", 1)
Find Calendar Quarter Midpoint of Two Dates
get_mid_calendar_quarter(start_date, end_date)
get_mid_calendar_quarter(start_date, end_date)
start_date |
vector coercibel to Date |
end_date |
vector coercibel to Date |
A vector of calendar quarters
start <- c("2005-04-01", "2010-12-01", "2016-01-01") end <-c("2005-08-01", "2011-05-01", "2016-06-01") mid_calendar_quarter <- get_mid_calendar_quarter(start, end)
start <- c("2005-04-01", "2010-12-01", "2016-01-01") end <-c("2005-08-01", "2011-05-01", "2016-06-01") mid_calendar_quarter <- get_mid_calendar_quarter(start, end)
Prepare output from hintr debug rds for debugging
hintr_inputs_ready(jobid, root = ".")
hintr_inputs_ready(jobid, root = ".")
jobid |
The issue ID, the name of the folder in sharepoint |
root |
The debug root dir |
Path to local debug
Download debug from server and upload into sharepoint
naomi_debug( id, jobid, dest_folder = "Shared Documents/2023_debug", server = NULL )
naomi_debug( id, jobid, dest_folder = "Shared Documents/2023_debug", server = NULL )
id |
The model fit or calibrate ID to download debug for |
jobid |
The issue ID, the name of the folder to create in sharepoint |
dest_folder |
The root destination folder in sharepoint |
server |
The folder to download debug from, defaults to production server |
Path to local debug
Extract Gridded Population of the World (GPW) raster data
naomi_extract_gpw(areas, gpw_path = "~/Data/population/GPW 4.11/")
naomi_extract_gpw(areas, gpw_path = "~/Data/population/GPW 4.11/")
areas |
Naomi area hierarchy dataset with boundaries. |
gpw_path |
Local path to GPW v4.11 raster files. |
This function relies on accessing GPW population files via a local path to the GPW v4.11 rasters because the files are very large.
Datasets are downloaded from:
Age/sex stratified populations for 2010: https://sedac.ciesin.columbia.edu/data/set/gpw-v4-basic-demographic-characteristics-rev11/data-download (each file ~2GB).
Total population in 2000, 2005, 2010, 2015, 2020 (unraked): https://sedac.ciesin.columbia.edu/data/set/gpw-v4-population-count-rev11/data-download (each file ~400MB).
Downloaded datasets should be saved in the following directory structure under
gpw_path
:
~/Data/population/GPW 4.11/ ├── Demographic characteristics │ ├── gpw-v4-basic-demographic-characteristics-rev11_a000_004_2010_30_sec_tif │ ├── gpw-v4-basic-demographic-characteristics-rev11_a005_009_2010_30_sec_tif │ ├── gpw-v4-basic-demographic-characteristics-rev11_a010_014_2010_30_sec_tif │ ├── gpw-v4-basic-demographic-characteristics-rev11_a015_019_2010_30_sec_tif │ ├── gpw-v4-basic-demographic-characteristics-rev11_a020_024_2010_30_sec_tif │ ├── gpw-v4-basic-demographic-characteristics-rev11_a025_029_2010_30_sec_tif │ ├── gpw-v4-basic-demographic-characteristics-rev11_a030_034_2010_30_sec_tif │ ├── gpw-v4-basic-demographic-characteristics-rev11_a035_039_2010_30_sec_tif │ ├── gpw-v4-basic-demographic-characteristics-rev11_a040_044_2010_30_sec_tif │ ├── gpw-v4-basic-demographic-characteristics-rev11_a045_049_2010_30_sec_tif │ ├── gpw-v4-basic-demographic-characteristics-rev11_a050_054_2010_30_sec_tif │ ├── gpw-v4-basic-demographic-characteristics-rev11_a055_059_2010_30_sec_tif │ ├── gpw-v4-basic-demographic-characteristics-rev11_a060_064_2010_30_sec_tif │ ├── gpw-v4-basic-demographic-characteristics-rev11_a065_069_2010_30_sec_tif │ ├── gpw-v4-basic-demographic-characteristics-rev11_a070_074_2010_30_sec_tif │ ├── gpw-v4-basic-demographic-characteristics-rev11_a075_079_2010_30_sec_tif │ ├── gpw-v4-basic-demographic-characteristics-rev11_a080_084_2010_30_sec_tif │ └── gpw-v4-basic-demographic-characteristics-rev11_a085plus_2010_30_sec_tif └── Unraked ├── gpw-v4-population-count-rev11_2000_30_sec_tif ├── gpw-v4-population-count-rev11_2005_30_sec_tif ├── gpw-v4-population-count-rev11_2010_30_sec_tif ├── gpw-v4-population-count-rev11_2015_30_sec_tif └── gpw-v4-population-count-rev11_2020_30_sec_tif
A data frame formatted as Naomi population dataset.
Extract WorldPop raster data
naomi_extract_worldpop( areas, iso3 = areas$area_id[areas$area_level == 0], years = c(2010, 2015, 2020) )
naomi_extract_worldpop( areas, iso3 = areas$area_id[areas$area_level == 0], years = c(2010, 2015, 2020) )
areas |
Naomi area hierarchy dataset with boundaries. |
iso3 |
ISO3 country code. |
years |
Years to extract WorldPop data |
Raster files are downloaded from the WorldPop FTP. Some files are very large. It is recommended to run this on a fast internet connection.
A data frame formatted as Naomi population dataset
Plot area hierarchy levels
plot_area_hierarchy_summary(areas, nrow = 1)
plot_area_hierarchy_summary(areas, nrow = 1)
areas |
area hierarchy sf object |
nrow |
number of rows, integer. |
A ggplot2 object illustrating the area hierarchy
Summary plot of survey cluster coordinates outside boundaries
plot_survey_coordinate_check( survey_clusters, survey_region_boundaries, survey_region_areas )
plot_survey_coordinate_check( survey_clusters, survey_region_boundaries, survey_region_areas )
survey_clusters |
Survey clusteres dataset. |
survey_region_boundaries |
Survey region boundaries dataset. |
The survey_region_boundaries
dataset is used to define the scope of what
is plotted. A subset of regions can be plotted by subsetting that dataset
to the desired range.
A list of grobs, one for each survey.
Read Spectrum region code from PJNZ file
read_pjnz_region_code(pjnz)
read_pjnz_region_code(pjnz)
pjnz |
file path to source PJNZ |
Read shape file from ZIP
read_sf_zip(zfile, pattern = "shp$")
read_sf_zip(zfile, pattern = "shp$")
zfile |
Path to zip file |
pattern |
Pattern to read files for from zip, defaults to files ending with 'shp' |
Reads all files in ZIP archive zfile
matching pattern
with
function read_fn
and returns as a list.
read_sf_zip_list(zfile, pattern = "\\.shp$", read_fn = sf::read_sf)
read_sf_zip_list(zfile, pattern = "\\.shp$", read_fn = sf::read_sf)
zfile |
path to a zip directory |
pattern |
string pattern passed to |
read_fn |
function used to read matched files. |
Read country from .zip.shiny90 file
read_shiny90_country(shiny90_zip)
read_shiny90_country(shiny90_zip)
shiny90_zip |
path to .shiny90 export file |
Shiny90 country / region name.
Recode age group from Naomi 1 to Naomi 2
recode_naomi1_age_group(x)
recode_naomi1_age_group(x)
x |
Character vector of age groups in Naomi 1 format |
Character vector of age groups in Naomi 2 format
recode_naomi1_age_group(c("15-19", "15+", "00+"))
recode_naomi1_age_group(c("15-19", "15+", "00+"))
Update ART and ANC programme data set to Naomi 2.0 specifications
recode_naomi1_art(art) recode_naomi1_anc(anc)
recode_naomi1_art(art) recode_naomi1_anc(anc)
art |
Data frame of ART data conforming to Naomi 1.0 schema. |
anc |
Data frame of ANC testing data conforming to Naomi 1.0 schema. |
Rename current_art
column to art_current
.
Recode year
column to calendar_quarter
in ART dataset.
Recode age_group
column from 15-49
format to Y015_049
.
Recode ancrt_*
columns to anc_*
.
Data frame of ART data conforming to Naomi 2.0 schema.
The variable name for the survey region variable is sourced from the
DHS survey boundaries datasets sourced by create_survey_boundaries_dhs()
.
Utility function to merge survey region variable name to surveys
dataset from survey_region_boundaries
dataset.
surveys_add_dhs_regvar(surveys, survey_region_boundaries)
surveys_add_dhs_regvar(surveys, survey_region_boundaries)
surveys |
surveys dataset, data.frame. |
survey_region_boundaries |
survey_region_boundaries dataset, sf object. |
This will throw an error if the REGVAR is not unique to each survey_id within
the survey_region_boundaries
dataset.
The surveys data.frame
Validate naomi population dataset
validate_naomi_population(population, areas, area_level)
validate_naomi_population(population, areas, area_level)
area_level |
area level(s) at which population is supplied |
Check that:
Column names match schema
Population stratification has exactly area_id / sex / age_group for each year data are supplied
Invisibly TRUE or raises error.
Validation of mapping to survey region areas
validate_survey_region_areas( survey_region_areas, survey_region_boundaries, warn = FALSE )
validate_survey_region_areas( survey_region_areas, survey_region_boundaries, warn = FALSE )
survey_region_areas |
Allocation of areas to survey regions, returned by
|
survey_region_boundaries |
survey_region_boundaries dataset created by
|
warn |
Raise a warning instead of an error (default |
Conducts checks on survey_region_areas
:
All areas have been mapped to a survey region in each survey.
All survey regions contain at least one area. Otherwise no clusters could have come from that survey region.
Passing these checks does not confirm the mapping is accurate, but these checks will flag inconsistencies that need cleaning.
invisibly TRUE or raises an error.
Save an sf object as a zipped archive with the four ESRI shape
file components .shp
, .dbf
, .prj
, .shx
. This wraps
sf::write_sf()
.
write_sf_shp_zip(obj, zipfile, overwrite = FALSE)
write_sf_shp_zip(obj, zipfile, overwrite = FALSE)
obj |
an object of class |
zipfile |
path to write zip output file. Must have file extension .zip. |
overwrite |
logical whether to overwrite |
Return value of file.copy()
, TRUE
if file successfully written.
nc <- read_sf(system.file("shape/nc.shp", package="sf")) write_sf_shp_zip(nc, "nc.zip")
nc <- read_sf(system.file("shape/nc.shp", package="sf")) write_sf_shp_zip(nc, "nc.zip")