Spectrum
stores projection data as PJNZ
files. These are just zips
containing several files of projection data.
unzip(system.file("testdata", "Botswana2018.PJNZ", package = "specio"),
list = TRUE)
#> Name Length Date
#> 1 Botswana_ 2018 updated ART.DP 2038692 2018-03-23 16:21:00
#> 2 Botswana_ 2018 updated ART.DPUA 656 2018-03-23 15:23:00
#> 3 Botswana_ 2018 updated ART.DPUAD 382382 2018-03-23 15:23:00
#> 4 Botswana_ 2018 updated ART.ep1 1946 2018-03-20 12:44:00
#> 5 Botswana_ 2018 updated ART.ep3 1292 2018-03-20 12:44:00
#> 6 Botswana_ 2018 updated ART.ep4 10866 2018-03-20 12:44:00
#> 7 Botswana_ 2018 updated ART.ep5 410266 2018-03-20 12:44:00
#> 8 Botswana_ 2018 updated ART.PJN 6470 2018-03-23 16:21:00
#> 9 Botswana_ 2018 updated ART.SPT 12156 2018-01-22 10:21:00
#> 10 Botswana_ 2018 updated ART.TYP 11 2018-03-20 12:44:00
#> 11 Botswana_ 2018 updated ART.xml 512995 2018-03-20 12:44:00
#> 12 Botswana_ 2018 updated ART_meta.csv 986 2018-01-22 10:21:00
#> 13 Botswana_ 2018 updated ART_surv.csv 21363 2018-01-22 10:21:00
#> 14 epptmplt/epptmplt_en/Concentrated (C).wst 85305 2018-03-20 12:44:00
#> 15 epptmplt/epptmplt_en/Urban Rural (G).wst 29419 2018-03-20 12:44:00
specio
Note that large portions of the general information about the files
are taken from a write up of the Spectrum-EPP Communications between
Jeff Eaton and Tim Brown in December 2016. Where this relates to
specio
this information has been updated to reflect the
best of current knowledge.
.DP
The DP
file contains demographic projection data. The
data is persisted as a csv
with a column containing tags
used to identify the field.
The csv
contains 4 named columns, Tag
,
Description
, Notes
and Data
.
There are also an arbitrary number of extra columns all of which contain
more data.
#> 'data.frame': 7852 obs. of 10 variables:
#> $ Tag : chr "" "<FirstYear MV2>" "" "" ...
#> $ Description: chr "" "" "" "<Value>" ...
#> $ Notes : chr "" "" "" "" ...
#> $ Data : chr "" "" "" "1970" ...
#> $ X : chr "" "" "" "" ...
#> $ X.1 : chr "" "" "" "" ...
#> $ X.2 : chr "" "" "" "" ...
#> $ X.3 : num NA NA NA NA NA NA NA NA NA NA ...
#> $ X.4 : chr "" "" "" "" ...
#> $ X.5 : chr "" "" "" "" ...
The Tag
column contains only field tags
e.g. <BigPop MV3>
and end tags
<End>
. The field tag is used to identify the start of
data related to a particular property which runs up to the next end
tag.
The Description
column can sometimes contain information
about the data in the other columns but is frequently blank and is
rarely used in EPP data. For each property it contains a
<Value>
tag indicating where the data in the data
column begins. The data will start either on the same row as the value
tag or in the row one below.
For example the first 3 columns of the Botswana 2018 data for a particular field tag are
#> Tag Description Notes
#> 534 <CD4ThreshHoldAdults MV>
#> 535 CD4 count threshold for eligibility - Adults
#> 536 <Value>
#> 537 <End>
#> Data X X.1 X.2
#> 534
#> 535
#> 536 200 200 200 200
#> 537
The Notes
column optionally contains information about
each row of data. This may be things such as row names for the data or
information about a group of data contained in the next few rows. When
reading in EPP data, because this column is used in different ways for
different fields, we tend to not read it in and instead label data via
configuration.
The Data
column and onwards contains all of the data
related to a field. It can be of many types including scalars, vectors,
arrays, multi-dimensional arrays and collections of arrays. Therefore
each field requires some configuration for specio
to be
able to extract the data from the file.
.xml
The xml
file contains a seralised Java object
representing the workset. It contains all data used by the Java popup
when EPP is launched within Spectrum. Each property is contained within
top level Java epp2011.core.sets.Workset
class.
Note that for array properties reading the index is important as any
zeros in the array are sometimes omitted from the seralised data and so
we must add them back in at the appropriate indices when we read in the
data from the xml
file.
Note also that the default representation for an NA
value in the serialized Java is -1
so we convert these when
they are encountered.
Some data is persisted as properties of the class e.g.
<void property="aidsNormalizeRange">
<array class="int" length="2">
<void index="0">
<int>1975</int>
</void>
<void index="1">
<int>1993</int>
</void>
</array>
</void>
The type can be read from the second line after the property is declared. This can be another object itself.
Data may also be persisted as a method, we can see this for data which represents a collection of data frames e.g.
<void method="add">
<object class="epp2011.core.sets.ProjectionSet" id="ProjectionSet1">
<void property="PMTCTData">
<array class="[D" length="14">
<void index="0">
<array class="double" length="41">
<void index="0">
<double>-1.0</double>
</void>
<void index="1">
<double>-1.0</double>
</void>
...
</array>
</void>
<void index="1">
<array class="double" length="41">
<void index="0">
<double>-1.0</double>
</void>
...
</array>
</void>
...
</array>
</void>
<void property="PMTCTSiteSampleSizes">
<array class="[I" length="14">
...
These can be identified by the object class
ProjectionSet
and array class [D
or
[I
meaning array of double or int arrays respectively.
.SPT
- EPP results for SpectrumReturns the results of the national epidemic to Spectrum. The
SPT
file is the primary file for sending the results of
EPPs fitting back to Spectrum. The file contains 6 or more sections each
of which is delimited by either single a =
or two
==
. It has the following overall structure:
Lines after the line containing the keyword BASEYEAR
through to the delimiter =
, the end of the projection
indicator. Each separate line contains the projection year, HIV
prevalence and HIV incidence for the workset as a whole.
The examples are taken from an existing EPP file for Ukraine.
EPP 5.0 // Indicates the EPP file version
Ukraine // Country name
AGERANGE 15-49 // Age range – specifies whether fitting was done assuming 15-49 or 15+
BASEYEAR 2009 // Base year – this is an artifact of when EPP fixed population size for
// concentrated epidemics in a given year
1970,0.00000,0.00000 // Projection year, national HIV prevalence %, national HIV incidence %
…
2019,0.84569,0.03831
2020,0.86060,0.03875 // Projection year, national HIV prevalence %, national HIV incidence %
= // End of section 1 indicator
Lines after the end of projection indicator =
at the end
of Section 1 through to the ==
delimiter. Each line
contains the projection year, female/male incidence ratio if available
(-1 otherwise e.g. as in generalized epidemics where F/M ratio cannot be
determined in EPP), percent of HIV+ individuals who are IDU in that
year, number of IDU AIDS deaths and number of IDU non-AIDS deaths in
that year.
= // End of section 1 indicator
1970,-1.00000,0.00000,0,0 // Year, F/M ratio, % of HIV+ who are IDUs, IDU AIDS deaths,
// IDU non-AIDS deaths
….
2019,0.14165,16.79733,344,681 // Year, F/M ratio, % of HIV+ who are IDUs, IDU AIDS deaths,
// IDU non-AIDS deaths
2020,0.12591,15.69599,314,677 // Year, F/M ratio, % of HIV+ who are IDUs, IDU AIDS deaths,
// IDU non-AIDS deaths
== // End of section 2 indicator
Lines starting after ==
delimiter at the end of section
2 through to the next =
. This repeats the same information
as section 1.
== // End of section 2 indicator
Ukraine_May 2015: // Workset name
POP 24051282 INC 100.0 // Total population in base year & percent of base year
// incidence in the workset – always 100%
1970,0.00000,0.00000 // Year, workset HIV prevalence %, workset HIV incidence %
…
2020,0.86060,0.03875 // Year, workset HIV prevalence %, workset HIV incidence %
= // End of section 3 indicator
Lines after the end of projection indicator =
at the end
of section 3 through to the ==
delimiter. This repeats the
same information as in section 2.
= // End of section 3 indicator
1970,-1.00000,0.00000,0,0 // Year, F/M ratio, % of HIV+ who are IDUs, IDU AIDS deaths,
// IDU non-AIDS deaths
…
2020,0.12591,15.69599,314,677 // Year, F/M ratio, % of HIV+ who are IDUs, IDU AIDS deaths,
// IDU non-AIDS deaths
== // End of section 4 indicator
Subsequent sections then describe each sub-population or sub-epidemic projection in detail with the following information:
Starting after the ==
at the end of section 4. After
providing some detail on the sub-population (specified below), each
subsequent line gives the projection year, HIV prevalence, HIV incidence
and total population size in EPP for that particular sub-population
== // End of section 4 indicator
Ukraine_May 2015\IDUs:BOTH,IDU,75.0 // Sub-pop name, :, special population indicators, % male
POP 292826 INC 32.8823 // Population in base year & percent of base year incidence in sub-population
IDUMORT 1.0700 // Excess IDU mortality among HIV+ IDU
1970,0.00000,0.00000,300000 // Year, sub-pop HIV prevalence %,
// sub-pop HIV incidence %, pop size in year
…
2020,10.97878,0.40086,255083 // Year, sub-pop HIV prevalence %,
// sub-pop HIV incidence %, pop size in year
= // End of section 5 indicator
Starting after the =
at the end of section 5 through the
next ==
. Each line gives the projection year, F/M ratio for
that group, percent of the group who are IDUs, number of IDU AIDS deaths
and the number of IDU non-AIDS deaths.
If sub-population has IDU characteristic, then % of HIV+ who are IDUs will be 100% and IDUs will be AIDS & non-AIDS deaths for the sub-population. Otherwise, all three numbers will be zero.
= // End of section 5 indicator
1970,0.33333,100.00000,0,0 // Year, F/M ratio, % of HIV+ who are IDUs, IDU AIDS deaths,
// IDU non-AIDS deaths for sub-pop
…
2020,0.33333,100.00000,314,677
== // End of section 6 indicator
Section 5 and 6 are then repeated for each sub-population and/or sub-epidemic until the end of the file.
When dealing with generalised epidemics, the F/M ratio will always be
returned as -1.0
, indicating to Spectrum to use its own
patterns since EPP does not track gender in generalised epidemics. In
addition, there will generally not be any IDU. Thus, lines in section 2
will appear as:
1981,-1.00000, 0.00000,0,0
1982,-1.00000, 0.00000,0,0
...
If a -1.0
ratio occurs in a concentrated epidemic, it
signifies that a value could not be calculated for that year as the
incidence was zero. Otherwise the F/M ratio for that particular workset,
sub-epidemic or sub-population projection will be shown.
For each sub-population (an actual curve fit to data), the special population indicators are of two types:
URBAN
, RURAL
or
BOTH
LORISK
, FSW
,
MSW
, IDU
, CLIENT
,
MSM
, PRI
(prisoner) or TG
(transgender).SPU
- EPP uncertainty results for SpectrumNote that this is missing from the example above as it has been
removed manually to save disk space as this is a large file. The
SPU
file contains uncertainty results for the national
epidemic.
The purpose of the SPU
file is to pass the resamples
done during the IMIS process to Spectrum for use in its own uncertainty
calculations. Normally, 3000 resamples are done. The SPU
file passes first the overall national Bayesian medians followed by the
number of unique national resamples. Because a particular resample may
get selected multiple times, each resample is also provided with a COUNT
of the number of times it was resampled, followed by a series of lines
containing the prevalence and incidence each year in the format:
An annotated description of the format is as follows:
EPP 5.0 3000 // EPP file format version followed by number of resamples
Botswana // Country name
BASEYEAR 2009 // Baseyear for populations (deprecated and not used)
1970, 0.00000, 0.00000 // Bayesian median prevalence and incidence series
1971, 0.05806, 0.04821
…
2020, 32.93643, 2.14343
== // End of Bayesian median series indicator
COUNT 2.0 // Number of times following series was resampled
1970, 0.00000, 0.00000 // Prevalence & incidence series for 1st unique resample
1971, 0.04905, 0.00332
…
2020, 31.89423, 2.43432
== // End of series indicator
COUNT 5.0 // Number of times following series was resampled
1970, 0.00000, 0.00000 // Prevalence & incidence series for 2nd unique resample
1971, 0.05902, 0.00102
…
2020, 31.45333, 2.98174
==
… // this continues until all unique curves in the 3,000
// resamples are specified. COUNTs will sum to 3,000
specio
As seen above the PJNZ
file contains several other files
other than those read in by specio
. Where information about
these files is known it is included below. This information is taken
from a document written by Tim Brown in 2016, some of this may be out of
date so should not be considered the source of truth for details about
how these files are used.
ep1
- year and demographic inputs that EPP needs from
SpectrumThe ep1
file provides the essential information EPP
needs to set up the epidemic projection and calculate parameters derived
from the demographics to ensure Spectrum-EPP consistency on populations
over time. In the absence of HIV, this data is used by EPP to exactly
reproduce the Spectrum populations for each year. This information
includes:
NOTE: in the following file specifications, words entered all in CAPS are keywords to be used to easily identify the information contained on that line or in that section of the file. All population numbers are entered as integers. All values are separated by commas, so no extraneous commas should occur in country names or in projection names.
Each population line consists of five numbers separated by commas:
Year, 15-49_or 15+_population, 15 year olds, 50 year olds, net_migration_15-49_or_15+
Population values will be provided for the full epidemiological projection period, from the projection start year to the projection end year.
The format for the ep1
file is as follows
//Same as the invoking argument for
//country name (may have blanks but
COUNTRY,country name,country code //no commas), country code is UN code
PROJNAME,“C:\Users\tim\DATA\proj_name.PJN” //Same as invoking argument(fully
//qualified filename for base proj file)
FIRSTPROJYR,1970 //First year of the epidemiological projection
LASTPROJYR,2016 //Last year of epidemiological projection
IDUMORT,1.07 //Excess IDU mortality in percent per year
AGERANGE 15-49 //The age range surveillance data addresses
POPSTART //Start of the non-AIDS population projection
1970,3459875,148280,98854,34589
1971,3632868,155694,103796,35281
1972,3814512,163479,108986,35986
1973,4005237,171653,114435,36706
1974,4205499,180236,120157,37440
1975,4415774,189247,126165,38189
...
2012,8900434,381447,254298,65234
2013,8989438,385262,256841,65886
2014,9079332,389114,259410,66545
2015,9170126,393005,262004,67211
2016,9261827,396935,264624,67883
POPEND
For example, here are the first few lines from the file for Peru:
COUNTRY,Peru,604
PROJNAME,C:\Users\Tim Brown\AppData\Roaming\Futures Institute\Spectrum\Temp\Peru 2015 FinalPJNZ~BC99.tmp\Peru 2015 Final.PJN
FIRSTPROJYR,1970
LASTPROJYR,2021
IDUMORT,1.07
AGERANGE 15-49
POPSTART
1970,5928014,291589,83621,0
1971,6117343,300730,86139,0
1972,6314019,310432,88774,0
1973,6518406,320827,91641,0
1974,6731110,332084,94737,0
1975,6949523,343962,98008,-3022
1976,7175224,356148,101368,-4632
1977,7408225,368362,104835,-6078
typ
- type of the EPP workset either
GENERALIZED
or CONCENTRATED
File is used to communicate epidemic type back to Spectrum. It
consists of a single line containing either GENERALIZED
or
CONCENTRATED
depending on type selected by user.
ep4
- ART data and parametersThe ep4
file provides the information about
antiretroviral therapy using the CD4 compartment model adopted in
Spectrum. This includes the following parameters (with associated
keywords indicated in parentheses):
CD4LOWLIMITS
): the lower CD4 limit
for each of the CD4 compartments in the modelLAMBDA
): the progression rate through the CD4
compartmentsNEWINFECTSCD4
): the
percent of new infections going into each of the CD4 compartmentsMU
): the annual mortality of
those not on ARTALPHA1
, ALPHA2
and
ALPHA3
): the annual mortality for those on ART for the 1st
6 months, 2nd 6 months and for more than one yearINFECTREDUC
): the percent by
which the person’s risk of infecting another is reduced by being on
ART.ARTSTART
/ARTEND
).In addition, this file contains information on the number of
HIV-positive 15 year olds entering the 15-49 or 15+ population and the
number of HIV-positive 50 years olds leaving the 15-49 population,
disaggregated by on ART (HIVPOS_15YEAROLDS
,
HIVPOS_50YEAROLDS
) and not on ART
(HIVPOS_15YEAROLDSART
, HIVPOS_50YEAROLDSART
).
For 15 year olds a CD4 distribution is provided for both on ART and off
ART groups , while 50 year olds are assumed to have the same CD4
distribution as the population as a whole. The file also contains
indicators for special populations (SPECPOP
), the type of
ART coverage (ARTCOVERAGE
: MALE_FEMALE
,
CD4_PERCENT
, CD4_NUMBER
) and data on dropout
rates (ARTDROPOUTRATE
) and CD4 medians at initiation
(CD4MEDIAN
).