--- title: "Naomi data model" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Naomi data model} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup, include = FALSE, message = FALSE} library(dplyr) library(forcats) library(ggplot2) library(sf) library(naomi) ``` ```{r load_datasets, message = FALSE, include = FALSE} #' areas area_levels <- naomi::demo_area_levels area_hierarchy <- naomi::demo_area_hierarchy area_boundaries <- naomi::demo_area_boundaries #' population population_agesex <- naomi::demo_population_agesex age_group_meta <- naomi::get_age_groups() fertility <- data.frame(area_id = character(0), time = numeric(0), age_group = character(0), calendar_quarter = character(0), asfr = numeric(0)) #' surveys survey_meta <- naomi::demo_survey_meta survey_regions <- naomi::demo_survey_regions survey_clusters <- naomi::demo_survey_clusters survey_individuals <- naomi::demo_survey_individuals survey_biomarker <- naomi::demo_survey_biomarker survey_hiv_indicators <- naomi::demo_survey_hiv_indicators #' programme art_number <- naomi::demo_art_number anc_testing <- naomi::demo_anc_testing ``` ## Data model diagramme * Coloured tables are required inputs to Naomi model. * White backgrond tables are in the ADR only. ![data model](figure/data_model.png) ## Areas data * `area_levels` contains metadata describing the levels in the area hierarchy for the country. * `area_hierarchy` contains the nested hierarchy of area IDs at each level described in `area_levels`. * `area_boundaries` defines the spatial boundaries for each area in `area_hierarchy`. The fields `center_x` and `center_y` define in `area_hierarchy` defines longitude/latitude coordinates within the area. This field is currently optional. The R package will construct these centers from the boundaries if they are not provided. They might wish to be provided for two reasons: 1. Offset centers might be provided to avoid overlapping centroids when creating bubble plots (e.g. Zomba and Zomba City). 2. In future modelling we might rely on population-weighted centroids to estimate average distances between areas. From a conceptual perspective, `area_hierarchy` and `area_boundaries` each have one record per `area_id` and it would make sense for them to be in a single table schema. They are separate schemas for convenience so that `area_hierarchy` can be saved as human-readable CSV file while `area_boundaries` is saved as `.geojson` format by default. The figures below show example code for generating a typical plot from the Areas schemas: ```{r, fig.width=7, fig.height=4, warning=FALSE} area_hierarchy %>% left_join(area_levels %>% select(area_level, area_level_label)) %>% mutate(area_level_label = area_level_label %>% fct_reorder(area_level)) %>% ggplot() + geom_sf(data = . %>% left_join(area_boundaries) %>% st_as_sf()) + geom_label(aes(center_x, center_y, label = area_sort_order), alpha = 0.5) + facet_wrap(~area_level_label, nrow = 1) + naomi:::th_map() ``` ## Population data * `age_group_meta` contains metadata definining a standardised set of age groups. This is containted in `naomi::get_age_groups()`. * `population_agesex` contains population estimates by area, sex, and five-year age group. Estimates are required at the highest level of the area hierarchy for all age groups from 0-4 through 80+. * `fertility` contains age-specific fertility rate (ASFR) estimates by area. Time is identified as `quarter_id` defined as the number of calendar quarters since the year 1900 (inspired by DHS Century Month Code [CMC]): $$ \mathrm{quarter\_id} = (\mathrm{year} - 1900) * 4 + \mathrm{quarter}.$$ The function `interpolate_population_agesex()` interpolates population estimates to specified `quarter_ids`. ```{r} naomi::get_age_groups() ``` ## Survey data * `survey_meta` contains meta data about each household survey. * `survey_hiv_indicators` is analytical table with area-level indicators. This is the table used as inputs to Naomi. Indicators are calculated for all stratifications of area/age/sex. Typically the most granular stratification would be selected for model input. The remaining tables are harmonized survey microdatasets used for calculating the indicators dataset. The table `survey_hiv_indicators` should also contain all survey HIV prevalence inputs required for Spectrum and EPP. It should be further extended to also calculate other indicators required by Spectrum, e.g. HIV testing outcomes for shiny90, proportion ever had sex, breastfeeding duration, and fertlity by HIV status. ## Programme data * `art_number` reports the number currently receiving ART at the end of each quarter by area. * `anc_testing` reports antenatal clinic (ANC) attendees and outcomes during the quarter. The model is currently specified to accept ART numbers by age 0-14 (`age_group = `r filter(naomi::get_age_groups(), age_group_label == "0-14")``) and age 15+ (`age_group = `r filter(naomi::get_age_groups(), age_group_label == "15+")$age_group``) either both sexes together (`sex = "both"`) or by sex (`sex = "female"`/`sex = "male"`). Possible extension may allow ART inputs by finer stratification. For `art_number` it is important to distinguish between zero persons receiving ART (e.g. no ART available in the area) versus missing data about the number on ART in an area. Current specification requires a value `art_current = 0` for an area with no ART whereas no entry for a given area will be interpreted as missing data. This could be revised, for example to require explicit input for all areas with a code for missing data. The `anc_testing` data is currently input for all ages of pregnant women aggregated, that is `age_group = `r filter(naomi::get_age_groups(), age_group_label == "15-49")$age_group`` for age 15-49.