Supplementary Information: Composable probabilistic models can lower barriers to rigorous infectious disease modelling

Setup

This supplementary document contains the full implementation details for the case studies presented in the main text. The code here demonstrates how our proof of concept compositional modelling DSL recreates and extends existing epidemiological models.

Required definitions

We first define the latent models that are reused across case studies. These definitions mirror those in the main text.

# Core packages needed for case studies
using EpiAware, Distributions
using DynamicPPL, Turing, LinearAlgebra
using ReverseDiff
using Chain, CSV, DataFramesMeta, Dates
using CairoMakie
using CodeTracking, Revise
ar2 = AR(;
    damp_priors=[truncated(Normal(0.2, 0.2), 0, 1),
        truncated(Normal(0.1, 0.05), 0, 1)],
    init_priors=[Normal(0, 0.2), Normal(0, 0.2)],
    ϵ_t=HierarchicalNormal(std_prior=HalfNormal(0.1))
)
ma1 = MA(;
    θ_priors=[truncated(Normal(0.0, 0.2), -1, 1)],
    ϵ_t=HierarchicalNormal(std_prior=HalfNormal(0.1))
)
using Accessors
arma21 = @set ar2.ϵ_t = ma1
arima211 = DiffLatentModel(arma21, Normal(0, 0.2); d=1)

Case studies

We demonstrate how our proof of concept compositional modelling DSL can recreate and extend existing epidemiological models. Each case study shows how complex models are built by composing reusable components, validating their behaviour through prior predictive checks, and fitting them to real data. The examples progress from simple renewal models to more complex observation processes that account for reporting delays and temporal effects. The first two case studies have also been replicated using the EpiAwareR R interface [1], in order to demonstrate cross-ecosystem accessibility. The EpiAwareR package vignettes provide step-by-step R implementations showing how R users can compose the same models using familiar syntax.

All code and data for reproducing the analyses in this paper are available at: https://github.com/EpiAware/ComposableProbabilisticIDModels.

On the derivation of the renewal equation from an age-dependent branching process: an epidemic modelling perspective

In On the derivation of the renewal equation from an age-dependent branching process: an epidemic modelling perspective, Mishra et al (2020) [2] demonstrate the mathematical correspondance between age-dependent branching processes and time-since-infection epidemiological models, as a renewal model with time-varying reproduction number \(R_t\). Renewal models use the renewal equation to model how new infections arise from previous infections, weighted by the generation time distribution (or serial interval) [3]. This is analogous to an autoregressive process where the autoregressive coefficients have epidemiological meaning rather than being estimated parameters. They show how solutions to the renewal equation, when combined with a negative binomial observation model, define a Bayesian hierarchical framework for inference on reported case time series data, demonstrating this on test-confirmed cases of COVID-19 in South Korea.

Data

Mishra et al used daily reported test-confirmed cases of COVID-19 in South Korea between January to July 2020. This data is curated by the covidregionaldata R package [4], but we have saved the South Korean data locally.

using Chain, CSV, DataFramesMeta, Dates
datapath = "data/south_korea_data.csv"
south_korea_data = @chain datapath begin
    CSV.read(DataFrame)
    (y_t = _.cases_new, dates = _.date)
end

Model

Our model is inspired by Mishra et al and uses a log-scale time-varying reproductive number \(\log R_t\) modelled as an AR(2) process, which in turn specifies the latent infections \(I_t\) as a solution to the renewal equation conditional on the trajectory of \(\log R_t\). The latent infection process is then linked directly to reported cases \(C_t\) on matching days using a negative binomial link distribution. The key difference from Mishra et al is in the initialization. They seed the renewal equation with importations (independent daily effects \(\mu_t \sim \text{Exponential}(0.5)\)), whilst we initialize by solving for the growth rate corresponding to the initial reproduction number and extrapolating backwards without allowing for ongoing importations.

\[ \begin{aligned} \rho_1, \rho_2, Z_0, Z_{-1}, I_0, \sigma, \phi &\sim \pi(\cdot), \\ \epsilon_t & \sim \text{Normal}(0, \sigma)~~ \text{i.i.d } \forall t, \\ Z_t &= \rho_1 Z_{t-1} + \rho_2 Z_{t-2} + \epsilon_t,~~ t = 1, 2, 3, \dots\\ R_t & = \exp(Z_t), \\ r & \text{ solves } G(r) = 1 / R_1, \\ I_{t} & = I_0 e^{rt},~~ t = 0, -1, -2, -3,-n+1, \\ I_t &= R_t \sum_{s \geq 1} g_s I_{t-s}, ~~ t = 1, 2, 3, \dots, \\ y_t & \sim \text{NegBin}(I_t, \phi), ~~ t = 1, 2, 3, \dots. \end{aligned} \]

Where \(\pi\) is a prior distribution for the hyperparmeters of the AR(2) model, initial states \(Z_0, Z_{-1}\), the autoregressive coefficients \(\rho_1,\rho_2\) and innovations standard deviation \(\sigma\), along with initial condition value for the latent infections \(I_0\) and observation overdispersion \(\phi\). \(g_t\) is the generation distribution probability mass function (pmf). \(r\) is the growth rate determined by the by \(R_1 = \exp(Z_1)\) using the implicit relationship [5].

\[ G(r) = \sum_{j \geq 1} g_j \exp(- r j) = 1 / R_1. \]

This means that we determine the initial condition of the latent infecteds before \(t=0\), \(I_{-1}, I_{-2}, \dots, I_{-n+1}\) jointly with sampling \(R_1\) where \(n\) is the maximum support value of the \(g_t\) pmf.

Latent model

We reuse the AR(2) model ar2 defined in the Setup section, but update the first damping coefficient prior ρ₁ to match Mishra et al. They specify ρ₁ ~ N(0.8, 0.05) constrained to [0,1], reflecting strong autocorrelation in the log reproduction number process.

ar2_mishra = @set ar2.damp_prior = arraydist([
    truncated(Normal(0.8, 0.05), 0, 1),
    truncated(Normal(0.1, 0.05), 0, 1)
])

The init_priors=[Normal(0, 0.2), Normal(0, 0.2)] represent the initial states Z₀ and Z₋₁ of the log reproduction number process. On the natural scale, this prior for initial R values has approximate mean 1 and standard deviation 0.2. Prior predictive samples (Figure 1 A) show that a priori these priors assign a few percent chance of achieving very high \(R_t\) values, i.e. \(R_t \sim 10-1000\) is not excluded.

Infection generating process

The renewal equation requires a discrete generation time pmf \(g_t\). Our prototype provides a constructor that converts continuous distributions into the required discrete pmf using double interval censoring [6]. This is called inside the EpiData struct which we use to define the expected model specific input data. Mishra et al used a \(\text{Gamma}(6.5, 0.62)\) serial interval distribution,

truth_SI = Gamma(6.5, 0.62)
model_data = EpiData(gen_distribution=truth_SI)

Figure 1 D compares the discretized generation interval with the underlying continuous distribution. As expected the observed discrete generation interval differs from the underlying continuous distrbution due to the double censoring process.

As in the earlier example, we define our renewal model using a specialised struct, Renewal. This requires the discretized generation time (from model_data) and an initialisation prior. We use a lognormal prior for the initial latent infections as it has a skewed right tail allow some weight on large initial values.

log_I0_prior = Normal(log(1.0), 0.1)
renewal = Renewal(model_data; initialisation_prior=log_I0_prior)

This results in the following model structure.

renewal
Renewal(
    data =
        EpiAware.EpiInfModels.EpiData{Float64, typeof(exp)}([0.026663134095601056, 0.14059778064943784, 0.2502660305615846, 0.24789569560506844, 0.1731751163417783, 0.09635404000022223, 0.04573437575216367, 0.019313826994143808], 8, exp),
    initialisation_prior = Normal(0.0, 0.1),
    recurrent_step =
        EpiAware.EpiInfModels.ConstantRenewalStep{Float64}([0.019313826994143808, 0.04573437575216367, 0.09635404000022223, 0.1731751163417783, 0.24789569560506844, 0.2502660305615846, 0.14059778064943784, 0.026663134095601056]))

To demonstrate the infection model independently, we define a fixed \(R_t\) trajectory that decreases from 3 to 0.5 over 50 days and pass this to the generate_latent_infs function which, like the generate_latent function, relies on multiple dispatch to contruct the model that the struct it is passed defines.

Rt = [0.5 + 2.5 / (1 + exp(t - 15)) for t in 1:50]
renewal_mdl = generate_latent_infs(renewal, log.(Rt))

The implied distribution of \(I_t\) trajectories conditional on this \(R_t\) trajectory can then be sampled independently of other model components (Figure 1 B). As expected this gives us a “classical” outbreak like dynamic with infection incidence initially increasing exponentially, the rate of growth slowing over time, and then finally the infection incidence “turning over”.

The full infection generating process, that is the model defined in Section 2.1 without the link to case data, can be constructed by passing samples of \(Z_t\) into the renewal model sampler.

Observation model

In Mishra et al the overdispersion parameter \(\phi\) sets the relationship between the mean and variance of the negative binomial errors. We default to a prior on \(\sqrt{1/\phi}\) (referred to as the cluster factor) because this quantity is approximately the coefficient of variation of the observation noise and, therefore, is easier to reason on a priori beliefs. A prior for \(\phi\) was not specified in Mishra et al, so we use \(\sqrt{1/\phi} \sim \text{HalfNormal}(0.1)\).

negbin = NegativeBinomialError(cluster_factor_prior=HalfNormal(0.1))

As with the latent model, we can generate a Turing.jl model conditional on a fixed latent infection trajectory. Here we simulated this to look like an outbreak with a symmetrical growth and decay phase.

I_t = [1000 * exp(-(t - 15)^2 / (2 * 4)) for t in 1:30]
negbin_mdl = generate_observations(negbin, missing, I_t)

Here we use a missing argument to indicate observed variables that are to be sampled rather than to be used to accumulate log posterior density. Prior predictive samples (Figure 1 C) show the dispersion around the mean implied by our choice of prior.

Fitting to data

We now compose the three model components into an EpiProblem, which defines the complete generative model for the time range tspan.

tspan = (45, 80)
mishra = EpiProblem(epi_model = renewal,
    latent_model = ar2_mishra,
    observation_model = negbin,
    tspan = tspan)

We create training data by subsetting the full data to match tspan.

training_data = @chain south_korea_data begin
    @set _.y_t = _.y_t[first(tspan):last(tspan)]
    @set _.dates = _.dates[first(tspan):last(tspan)]
end

As a last step before fitting the model, we generate a new Turing.jl model using generate_epiaware, the EpiProblem we just defined, and our training data.

using Turing
mishra_mdl = generate_epiaware(mishra, training_data)

Following the same compositional pattern as our modelling DSL, we also support composing inference approaches using EpiMethod, which, for example, can combine pre-sampling steps with sampling algorithms. We use the No U-Turns (NUTS) sampler with batched implementation of pathfinder variational inference [7] that returns the single pathfinder route with maximum estimated evidence lower bound to estimate the intialise the sampler.

using ReverseDiff
inference_method = EpiMethod(
    pre_sampler_steps=[ManyPathfinder(nruns=3, maxiters=100)],
    sampler=NUTSampler(
        target_acceptance=0.9,
        adtype=AutoReverseDiff(compile=true),
        ndraws=1000,
        nchains=4,
        mcmc_parallel=MCMCThreads(),
        nadapts=1000)
)

We now need to combine our inference approach with our generated model. We can do this using the apply_method.

mishra_results = apply_method(mishra_mdl,
    inference_method,
    training_data
)

This gives the following posterior estimates for our model parameters.

summarystats(mishra_results.samples)
Summary Statistics
         parameters      mean       std      mcse    ess_bulk   ess_tail                  Symbol   Float64   Float64   Float64     Float64    Float64   Flo ⋯

  latent.ar_init[1]    0.0845    0.1993    0.0056   1263.6841   602.7069    1. ⋯
  latent.ar_init[2]    0.0511    0.1911    0.0067    821.7404   505.4683    1. ⋯
  latent.damp_AR[1]    0.7794    0.0471    0.0018    742.1930   349.2913    1. ⋯
  latent.damp_AR[2]    0.1289    0.0430    0.0020    458.6735   360.9220    1. ⋯
         latent.std    0.4810    0.0599    0.0037    260.8339   440.5815    1. ⋯
      latent.ϵ_t[1]    0.8835    0.7774    0.0277    791.3335   621.3829    1. ⋯
      latent.ϵ_t[2]    1.2386    0.7636    0.0244    991.5469   691.4004    0. ⋯
      latent.ϵ_t[3]    1.5945    0.8366    0.0264   1008.7536   571.0975    1. ⋯
      latent.ϵ_t[4]    1.2603    0.8295    0.0245   1141.7212   701.3931    1. ⋯
      latent.ϵ_t[5]    2.0800    0.7977    0.0261    915.9277   673.3197    1. ⋯
      latent.ϵ_t[6]    2.1352    0.7652    0.0246    967.3797   702.3582    1. ⋯
      latent.ϵ_t[7]    0.9115    0.6763    0.0193   1235.5973   758.4562    1. ⋯
      latent.ϵ_t[8]    0.6195    0.6181    0.0195   1002.1109   798.5719    1. ⋯
      latent.ϵ_t[9]   -0.2934    0.5779    0.0186    970.9564   840.9962    1. ⋯
     latent.ϵ_t[10]   -1.8894    0.6873    0.0280    592.0248   711.3857    1. ⋯
     latent.ϵ_t[11]   -2.1907    0.6931    0.0343    412.3206   306.5206    1. ⋯
     latent.ϵ_t[12]   -0.6047    0.5729    0.0193    887.9072   591.7452    1. ⋯
          ⋮              ⋮         ⋮         ⋮          ⋮          ⋮           ⋱
                                                   2 columns and 24 rows omitted
Figure 1: Model components and posterior analysis for Section 2.1. (A) Prior samples from the AR(2) latent process for log \(R_t\) over 50 days, showing potential reproductive number trajectories. (B) Prior samples from the renewal model conditional on a fixed \(R_t\) trajectory, demonstrating infection dynamics. (C) Prior samples from the negative binomial observation model around a latent infection curve, illustrating observation noise. (D) Comparison of the continuous serial interval distribution (green line) with its discretised pmf (bars) used in the renewal model. (E) Posterior predictive distribution for daily cases, with median (purple line) and 50% (dark ribbon) and 95% (light ribbon) credible intervals compared to observed data (black points). (F) Posterior predictive distribution for time-varying \(R_t\) on a log scale, with median (green line) and 50% (dark ribbon) and 95% (light ribbon) credible intervals.

Figure 1 shows that the compositional model defined using our proof of concept recovers the main finding in Mishra et al; that the \(R_t\) in South Korea peaked at a very high value (\(R_t \sim 10\) at peak, Figure 1 F) before rapidly dropping below 1 in early March 2020, with the model capturing both the epidemic trajectory and day-to-day variation in case counts (Figure 1 E).

EpiNow2: Estimate real-time case counts and time-varying epidemiological parameters

In this case study, we replicate a common EpiNow2 configuration using our proof of concept framework. EpiNow2 [8] is a widely-used R package for estimating the time-varying reproduction number and making forecasts for epidemics in real-time. EpiNow2 is built around three core modeling components that work together to reconstruct epidemic dynamics from delayed and noisy case count observations. These are: a discrete renewal equation to model how new infections arise from previous infections, weighted by the generation time distribution (the time between successive infections in a transmission chain); the delay between infection and case reporting, which typically involves multiple sequential delays including incubation periods and reporting lags; observation error to capture overdispersion and model misspecification, with additional modifiers such as day-of-week effects, and underascertainment to account for biases in reporting patterns. We recreate the core EpiNow2 functionality by reusing model components from the main text and Section 2.1 and composing them with new delay and temporal effect components.

Data

We use the example dataset included with the EpiNow2 R package, which contains daily confirmed COVID-19 cases from Italy during the first wave in 2020, from 22nd February to 30th June.

using Chain, CSV, DataFramesMeta, Dates
datapath = "data/italy_data.csv"
italy_data = @chain datapath begin
    CSV.read(DataFrame)
    (y_t = _.confirm, dates = Date.(_."date"))
end

Model

Our model reuses the renewal infection model and ARIMA(2,1,1) latent process from earlier sections, making it piecewise constant by week, whilst building an observation model that accounts for reporting delays and day-of-week effects using discrete convolutions and a partially pooled day of the week effect. Unlike Section 2.1 where the serial interval distribution was used (as in Mishra et al), EpiNow2 uses the generation time distribution, which represents the time between successive infections in a transmission chain. Mathematically, we represent the complete model as:

\[ \begin{aligned} \rho_1, \rho_2, Z_0, Z_{-1}, I_0, \sigma_Z, \sigma_\omega, \phi &\sim \pi(\cdot), \\ \epsilon_w & \sim \text{Normal}(0, \sigma_Z)~~ \text{i.i.d } \forall w, \\ Z_w - Z_{w-1} &= \rho_1 (Z_{w-1} - Z_{w-2}) + \rho_2 (Z_{w-2} - Z_{w-3}) + \epsilon_w,~~ w = 1, 2, 3, \dots\\ R_t &= \exp\left(Z_{\lfloor t/7 \rfloor}\right),~~ \text{Piecewise } R_t \text{ constant by week}\\ I_t &= R_t \sum_{s \geq 1} g_s I_{t-s}, ~~ t = 1, 2, 3, \dots, \\ S_t &= \sum_{s\geq1} I_{t-s} \eta_s,~~ \text{Incubation delay: infections to symptom onset}\\ D_t &= \sum_{s\geq1} S_{t-s} \xi_s,~~ \text{Reporting delay: symptom onset to case reports}\\ \tilde{\omega}_k &\sim \text{N}(0, \sigma^2_\omega),~ k = 0,\dots, 6,~~ \text{Day-of-week modifier} \\ \omega &= 7 \times \text{softmax}(\tilde{\omega}),\\ y_t &\sim \text{NegBin}( D_t \omega_{t~\text{mod}~7}, \phi),~~ \text{Link to data}. \end{aligned} \tag{1}\]

Where \(\pi\) represents prior distributions for model hyperparameters. \(g_t\) is the generation time distribution pmf, \(\eta_t\) is the incubation period distribution pmf, and \(\xi_t\) is the reporting delay distribution pmf (all obtained by double interval censoring continuous distributions [6]). \(S_t\) represents symptom onsets (infections convolved with incubation period) and \(D_t\) represents delayed case reports (symptom onsets convolved with reporting delay). The vector \(\omega = [\omega_0,\dots,\omega_6]\) encodes day-of-week effects on reporting.

Latent model

We reuse the ARIMA(2,1,1) model arima211 defined in the Setup section but modify it to better reflect outbreak dynamics.

arima211_outbreak = @set arima211.init_prior = Normal(0.3, 0.3)

The latent_init prior of Normal(0.3, 0.3) for the initial log reproduction number Z₁ corresponds on the natural scale to R₁ with approximate mean 1.4 and standard deviation 0.4, favouring growing epidemics more appropriate for the COVID-19 outbreak context.

We then broadcast this modified ARIMA process to a weekly timescale, as discussed in more detail in the following section.

weekly_arima211 = broadcast_weekly(arima211_outbreak)

This approach models the log reproduction number as constant within each week whilst allowing weekly changes to follow the ARIMA(2,1,1) process. In EpiNow2, although stationary process representations of \(R_t\) are available, the default latent process is a differenced stationary (Matern) Gaussian Process; that is stationary only on its increments. Similarly, our piecewise constant weekly model applies the ARIMA(2,1,1) model to \(\log R_{w}\) at the weekly scale, then broadcasts this to daily values. The weekly piecewise constant formulation we use is an option in the EpiNow2 package but not part of the default model. Prior predictive samples (Figure 2 A) show the behaviour of the piecewise constant by week process.

Infection generating process

Unlike Section 2.1 where the serial interval distribution was used (as in Mishra et al), EpiNow2 uses the generation time distribution, which represents the time between successive infections in a transmission chain. Whilst the serial interval (time between symptom onsets) is often used as a proxy for the generation time because it is more readily observable, using the serial interval can be problematic [9] and is primarily done in practice when generation time estimates are unavailable, particularly early in an outbreak [10].

Here we follow the EpiNow2 getting started vignette to parameterise our model, this has a Gamma distribution with uncertain parameters for the generation time: shape ~ Normal(1.4, 0.48) and rate ~ Normal(0.38, 0.25). Since our prototype does not yet support uncertain delay distributions, we use the mean parameter values, giving a Gamma(1.4, 0.38) distribution with mean 3.68 days

gen_time_dist = Gamma(1.4, 1 / 0.38)
epinow2_data = EpiData(gen_distribution=gen_time_dist)

We use a similar renewal structure as we did in Section 2.1, updating it to use the generation time distribution rather than the serial interval.

initialisation_prior = Normal(log(1.0), 2.0)
renewal_gt = Renewal(epinow2_data; initialisation_prior=initialisation_prior)

As before, to demonstrate the infection model independently, we supply a fixed \(R_t\) trajectory that decreases from 3 to 0.5 over 50 days. Figure 2 B shows prior samples from the renewal process conditional on this \(R_t\) trajectory.

Observation model

We build the observation model through the composition of modular components reusing the negative binomial link observation model (negbin) from Section 2.1, and then layer additional modelling components on top. We start by adding a day of the week ascertainment model using the ascertainment_dayofweek helper function.

dayofweek_negbin = ascertainment_dayofweek(negbin;
    latent_model=HierarchicalNormal(std_prior=HalfNormal(1.0))
)

Unpacking this helper function reveals a nested stack of modelling constructions.

function ascertainment_dayofweek(model::AbstractTuringObservationModel;
        latent_model::AbstractTuringLatentModel = HierarchicalNormal(),
        transform = (x, y) -> x .* y, latent_prefix = "DayofWeek")
    return Ascertainment(model, broadcast_dayofweek(latent_model), transform, latent_prefix)
end

First, transformation and broadcasting

function broadcast_dayofweek(model::AbstractTuringLatentModel; link = x -> 7 * softmax(x))
    transformed_model = TransformLatentModel(model, link)
    return BroadcastLatentModel(transformed_model, 7, RepeatEach())
end

which uses the TransformLatentModel and BroadcastLatentModel structs to dispatch the transformation \(\omega = 7 \times \text{softmax}(\tilde{\omega})\) and then broadcast this along the time series.

Second is the linkage of the day of week model to the latent infections time series as a multiplicative ascertainment process, which produces

@model function EpiAwareBase.generate_observations(obs_model::Ascertainment, y_t, Y_t)
    @submodel expected_obs_mod = generate_latent(
        obs_model.latent_model, length(Y_t)
    )

    expected_obs = obs_model.transform(Y_t, expected_obs_mod)

    @submodel y_t = generate_observations(obs_model.model, y_t, expected_obs)
    return y_t
end

This day-of-week model assumes a static weekly pattern; to allow temporal evolution, the weekly pattern could be modelled by concatenating latent processes, where each day’s effect becomes a time-varying process.

In addition to ascertainment, we also need to model the incubation period (infection to symptom onset) and a reporting delay (symptom onset to case reporting). Again following the EpiNow2 getting started vignette, the incubation period uses a LogNormal distribution with uncertain parameters: meanlog ~ Normal(1.6, 0.064) and sdlog ~ Normal(0.42, 0.069), giving a mean of 5.41 days at the parameter means. The reporting delay uses a LogNormal with meanlog = 0.58 and sdlog = 0.47, giving a mean of 1.99 days.

Since our prototype does not yet support uncertain delay distributions, we use the mean parameter values:

incubation_distribution = LogNormal(1.6, 0.42)
reporting_distribution = LogNormal(0.58, 0.47)

The LatentDelay struct uses double interval censoring [6] to convert continuous delay models into pmfs and stacks with the temporal modifier model. Figure 2 D compares the continuous delay distributions with their discretised pmf representations. We compose two delay layers sequentially:

incubation_dayofweek_negbin = LatentDelay(dayofweek_negbin, incubation_distribution)
delay_dayofweek_negbin = LatentDelay(incubation_dayofweek_negbin, reporting_distribution)

This code demonstrates component reuse (base negbin from Section 2.1), layered composition (adding day-of-week effects via ascertainment_dayofweek), and sequential composition (adding two delay layers via nested LatentDelay). Note that since these are deterministic delays, we could more efficiently compute this by first convolving the two delay distributions or by automating this using the struct-in-struct approach with multiple dispatch. Figure 2 C demonstrates how these delay and day-of-week effects work together to transform a latent infection signal.

Fitting to data

We compose these modelling subcomponents into one EpiProblem model. The delay padding accounts for the combined length of the incubation period and reporting delay pmfs, allowing the model to properly account for infections that have occurred but not yet been observed by the end of the training period.

incubation_pmf_length = length(incubation_dayofweek_negbin.rev_pmf)
reporting_pmf_length = length(delay_dayofweek_negbin.rev_pmf)
delay_padding = incubation_pmf_length + reporting_pmf_length

tspan = (1, 40 + delay_padding)
epinow2 = EpiProblem(epi_model=renewal_gt,
    latent_model=weekly_arima211,
    observation_model=delay_dayofweek_negbin,
    tspan=tspan)

This EpiProblem uses the renewal model with generation time data (renewal_gt) and the piecewise constant by week ARIMA(2,1,1) latent model (weekly_arima211). The observation model reuses negbin from Section 2.1, layering on delays and day-of-week effects.

We filter the data as before to the timespan of interest.

italy_training_data = @chain italy_data begin
    @set _.y_t = _.y_t[first(tspan):last(tspan)]
    @set _.dates = _.dates[first(tspan):last(tspan)]
end

We again construct a Turing model using generate_epiaware.

epinow2_mdl = generate_epiaware(epinow2, italy_training_data)

We reuse the same inference_method defined in Section 2.1..

epinow2_results = apply_method(epinow2_mdl,
    inference_method,
    italy_training_data
)

Here is the summarised posterior:

summarystats(epinow2_results.samples)
Summary Statistics
            parameters      mean       std      mcse    ess_bulk   ess_tail                  Symbol   Float64   Float64   Float64     Float64    Float64    ⋯

    latent.latent_init    0.5618    0.1966    0.0083    548.6762   564.7769    ⋯
     latent.ar_init[1]   -0.0741    0.1656    0.0061    736.3826   777.5034    ⋯
     latent.ar_init[2]   -0.0728    0.1528    0.0059    664.8443   604.6462    ⋯
     latent.damp_AR[1]    0.2411    0.1438    0.0048    796.6071   439.8779    ⋯
     latent.damp_AR[2]    0.1042    0.0466    0.0018    588.0963   281.3383    ⋯
           latent.θ[1]    0.0440    0.2006    0.0067    899.9353   805.4505    ⋯
            latent.std    0.1721    0.0533    0.0023    522.0696   535.0555    ⋯
         latent.ϵ_t[1]   -1.3658    0.6426    0.0246    715.3240   499.2763    ⋯
         latent.ϵ_t[2]   -1.3201    0.6507    0.0224    857.0045   507.3722    ⋯
         latent.ϵ_t[3]    0.5510    0.5564    0.0198    813.6620   567.0359    ⋯
         latent.ϵ_t[4]    0.5302    0.5368    0.0202    734.7493   630.3145    ⋯
         latent.ϵ_t[5]   -0.2029    0.7430    0.0229   1060.1479   708.1389    ⋯
         latent.ϵ_t[6]   -0.0235    0.9382    0.0275   1166.0891   771.8094    ⋯
        init_incidence    5.5885    0.5933    0.0250    566.8938   638.0469    ⋯
     obs.DayofWeek.std    0.1597    0.0733    0.0040    354.4191   356.2343    ⋯
  obs.DayofWeek.ϵ_t[1]    0.0726    0.5009    0.0183    739.6065   733.0768    ⋯
  obs.DayofWeek.ϵ_t[2]   -0.4525    0.5012    0.0192    683.4531   707.1766    ⋯
           ⋮                ⋮         ⋮         ⋮          ⋮          ⋮        ⋱
                                                    2 columns and 6 rows omitted

We see the model has converged and the summary statistics are similar to Section 2.1. Figure 2 shows that the compositional model recovers similar \(R_t\) dynamics to Section 2.1 (Figure 2 F), whilst explicitly accounting for reporting delays (Figure 2 D) and day-of-week effects that help disentangle true transmission changes from reporting artifacts (Figure 2 E).

Figure 2: Model components and posterior analysis for Section 2.2. (A) Prior samples from the piecewise constant by week ARIMA(2,1,1) latent process for log \(R_t\) over 50 days, showing non-stationary reproductive number trajectories that are constant within each week. (B) Prior samples from the renewal model conditional on a fixed \(R_t\) trajectory. (C) Prior samples from the composite observation model including incubation period, reporting delays and day-of-week effects around a latent infection curve. (D) Comparison of the continuous incubation period (green line) and reporting delay (blue line) distributions with the combined discretised delay pmf (bars). (E) Posterior predictive distribution for daily cases, with median (purple line) and 50% (dark ribbon) and 95% (light ribbon) credible intervals compared to observed data (black points). (F) Posterior predictive distribution for time-varying \(R_t\) on a log scale, with median (green line) and 50% (dark ribbon) and 95% (light ribbon) credible intervals.

Contemporary statistical inference for infectious disease models using Stan

In Contemporary statistical inference for infectious disease models using Stan, Chatzilena et al. [11] demonstrate Bayesian inference for compartmental disease models using Stan’s ODE solvers. Their single strain influenza example compares deterministic and stochastic formulations of the SIR model, showing how to perform parameter inference and model comparison for ODEs embedded within a probabilistic programming framework. We recreate this example using EpiAware, introducing the SIR model as an alternative infection generating process to the renewal model used in the previous case studies. Unlike Stan, we integrate with existing solver libraries from the SciML ecosystem [12] rather than encoding bespoke solvers, allowing us to leverage state-of-the-art numerical methods and automatic solver selection.

Data

The analysis uses data from an influenza outbreak in an English boarding school, reported in the British Medical Journal in 1978. Of the 763 children at the school, 512 became ill over 14 days. We accessed this data using the R package outbreaks [13].

using OrdinaryDiffEqTsit5, OrdinaryDiffEqRosenbrock, SciMLSensitivity, LogExpFunctions

N = 763
datapath = "data/influenza_england_1978_school.csv"
england_data = @chain datapath begin
    CSV.read(DataFrame)
    @transform(:ts = :date - minimum(:date))
    @transform(@byrow :ts = Dates.value(:ts) + 1.0)
    (y_t = _.in_bed, dates = _.date, ts = _.ts)
end

Model

The mathematical description of the model is:

\[ \begin{aligned} \beta, \gamma, I(0), \kappa_0, \rho, \sigma &\sim \pi(\cdot), \\ R_0 &= \beta / \gamma, & \text{(Basic reproduction number)} \\ r_{SI}(t) &= \beta S(t) I(t), & \text{(Infection rate)} \\ r_{IR}(t) &= \gamma I(t), & \text{(Recovery rate)} \\ \frac{dS}{dt} &= -r_{SI}(t), \\ \frac{dI}{dt} &= r_{SI}(t) - r_{IR}(t), \\ \frac{dR}{dt} &= r_{IR}(t), \\ \epsilon_t &\sim \text{Normal}(0, \sigma^2), & \text{(Observation innovations)} \\ \kappa_{t} &= \rho \kappa_{t-1} + \epsilon_t, & \text{(Ascertainment process)} \\ \lambda_t &= \text{softplus}(N \cdot I(t))\exp(\kappa_t), & \text{(Expected observations)} \\ Y_t &\sim \text{Poisson}(\lambda_t). & \text{(Observation link)} \end{aligned} \]

Here we connect a continuous time model, where we index continuously evolving variables using \((t)\), to daily incrementing processes which we index with a subscript \(t\). The population variables \(S(t)\), \(I(t)\), and \(R(t)\) represent proportions of the total population \(N\) that are susceptible, infected, and recovered at time \(t\). The transmission rate \(\beta\) governs infection spread, whilst the recovery rate \(\gamma\) determines infectious period duration. The basic reproduction number \(R_0\) characterises transmission potential, assumed constant in this model. All individuals are assumed to be either susceptible or infected at the beginning of the outbreak, therefore only \(I(0)\) is needed to define the initial conditions. We follow Chatzilena et al. and treat the number “in bed” as a proxy for the infected compartment. \(\pi\) represents the priors over each parameter.

The softplus transformation \(\text{softplus}(x) = \log(1 + \exp(x))\) ensures numerical stability when scaling the infected proportion \(I(t)\) to counts using population size \(N\), smoothly maintaining positivity even when ODE solvers return small negative values near zero.

The log-scale ascertainment rate is represented as an autoregressive AR(1) process \(\kappa_t\) on daily increments, with damping parameter \(\rho\), initial state \(\kappa_0\), and innovation standard deviation \(\sigma\). Chatzilena et al. parameterise this as a time-discretised Ornstein-Uhlenbeck process, equivalent to an AR(1) process. Setting \(\kappa_t = 0\) for all \(t\) gives the deterministic model with direct Poisson link. The aim of the stochastic component is to absorb noise generated by model mis-specification, improving robustness when the SIR model is an approximation to more complex transmission dynamics.

Latent model

Unlike the renewal models in previous case studies, the SIR model does not require a time-varying latent process for \(R_t\) as transmission dynamics are fully determined by the ODE parameters.

Infection generating process

Our prototype provides SIRParams and ODEProcess structs for defining ODE-based infection models. The SIRParams struct specifies priors for the transmission rate \(\beta\), recovery rate \(\gamma\), and initial proportion infected. Following Chatzilena et al., we use log-normal and gamma priors for \(\beta\) and \(\gamma\), though we modify their specifications to improve numerical stability and prior interpretability. Chatzilena et al. use \(\text{Gamma}(0.004, 0.002)\) (shape, rate parameterisation) for the recovery rate, which has shape < 1 and places most prior mass near zero. While this acts as a weakly informative prior, it creates numerical difficulties during sampling as \(\gamma \to 0\) implies \(R_0 \to \infty\).We instead use \(\text{Gamma}(8, 0.03125)\) which has a prior mean of 0.25, implying a mean 4-day infectious period appropriate for influenza, while ensuring the prior is bounded away from zero. Chatzilena et al. use \(\text{LogNormal}(0, 1)\) for the transmission rate, which combined with their recovery rate prior ensures \(R_0 > 1\) but with extreme values. We instead use \(\text{LogNormal}(-0.5, 0.5)\) which, combined with our recovery rate prior, implies an \(R_0\) prior with median approximately 3 and 10% prior probability of \(R_0 < 1\), providing a more interpretable weakly informative prior for influenza. We use \(\text{Beta}(1, 99)\) for the initial proportion infected to concentrate prior mass near small outbreak initiations. Prior predictive samples (Figure 3 A) show the resulting SIR trajectories.

sir_params = SIRParams(
    tspan = (0.0, england_data.ts[end]),
    infectiousness = LogNormal(-0.5, 0.5),
    recovery_rate = Gamma(8, 0.03125),
    initial_prop_infected = Beta(1, 99)
)

The SIR differential equations are implemented in the vector field function:

function _sir_vf(du, u, p, t)
    S, I, R = u
    β, γ = p
    du[1] = -β * S * I
    du[2] = β * S * I - γ * I
    du[3] = γ * I
    return nothing
end

The ODEProcess composes the parameter model with solver details. This follows the standard SciML compositional approach of problem definition composed with solution method [12], specialised here to probabilistically generated SIR parameters. The ODE model returns the infected proportion \(I(t)/N\); returning proportions rather than counts improves numerical stability across different population sizes. We choose a switching ODE solver which switches between explicit (Tsit5) and implicit (Rosenbrock23) solvers, helping avoid the ODE solver failing when the sampler tries extreme parameter values.

sir_process = ODEProcess(
    params = sir_params,
    sol2infs = sol -> sol[2, :],
    solver_options = Dict(
        :alg => AutoTsit5(Rosenbrock23()),
        :verbose => false,
        :saveat => england_data.ts
    )
)

The ODEProcess generates infections by solving the ODE system and extracting the infected compartment via sol2infs. EpiAware composes this into a generative model:

@model function EpiAwareBase.generate_latent_infs(epi_model::ODEProcess, Z_t)
    n = isnothing(Z_t) ? 0 : size(Z_t, 1)

    @submodel sol = _generate_ode_solution(epi_model, n)
    I_t = epi_model.sol2infs(sol)

    return I_t
end

Observation model

We link the infected proportion to observed cases using TransformObservationModel, which applies population scaling and the softplus transformation before passing to the Poisson link.

pois = PoissonError()
transform_pois = TransformObservationModel(pois, x -> softplus.(N .* x))

For the stochastic model, we add time-varying ascertainment using Ascertainment, which wraps an observation model with a latent process. Here we reuse the same AR latent model type from earlier case studies, but instead of modelling infection dynamics, we apply it to the observation process. The AR(1) process \(\kappa_t\) operates on the log scale and enters multiplicatively via \(\exp(\kappa_t)\). We specify weakly informative priors: damping near zero (strongly autocorrelated), initial state near zero (minimal baseline adjustment), and small innovation standard deviation. Unlike earlier uses where we called generate_latent() directly, Ascertainment takes the latent model as an argument and handles generation internally. Prior predictive samples (Figure 3 B) show how this modifies expected observations.

ascertainment_ar = AR(
    damp_priors = [HalfNormal(0.005)],
    init_priors = [Normal(0, 0.001)],
    ϵ_t = HierarchicalNormal(std_prior = HalfNormal(0.02))
)

ascertainment_model = Ascertainment(
    model = pois,
    latent_model = ascertainment_ar
)

transform_ascertainment = @set transform_pois.model = ascertainment_model

The Ascertainment wrapper generates the latent process and applies it to observations:

@model function EpiAwareBase.generate_observations(obs_model::Ascertainment, y_t, Y_t)
    @submodel expected_obs_mod = generate_latent(
        obs_model.latent_model, length(Y_t)
    )

    expected_obs = obs_model.transform(Y_t, expected_obs_mod)

    @submodel y_t = generate_observations(obs_model.model, y_t, expected_obs)
    return y_t
end

Fitting to data

We compose both models using EpiProblem. Just as we swap lower-level model components (e.g., replacing a Poisson with an Ascertainment-wrapped Poisson), we can swap components at the EpiProblem level to compare modelling assumptions whilst keeping other parts fixed. We reuse the same inference_method defined in Section 2.1. The only additional requirement for ODE-based models is making SciMLSensitivity available for reverse-mode automatic differentiation through the ODE solutions.

tspan = (1, length(england_data.y_t))
deterministic_sir = EpiProblem(
    epi_model = sir_process,
    latent_model = Null(),
    observation_model = transform_pois,
    tspan = tspan
)

stochastic_sir = @set deterministic_sir.observation_model = transform_ascertainment
deterministic_sir_mdl = generate_epiaware(deterministic_sir, england_data)
deterministic_sir_results = apply_method(
    deterministic_sir_mdl, inference_method, england_data
)
summarystats(deterministic_sir_results.samples)
Summary Statistics
  parameters      mean       std      mcse   ess_bulk   ess_tail      rhat   e     Symbol   Float64   Float64   Float64    Float64    Float64   Float64     ⋯

           β    1.8795    0.0564    0.0028   408.6360   412.5023    1.0074     ⋯
           γ    0.4788    0.0119    0.0006   413.1089   441.7420    1.0040     ⋯
          I₀    0.0005    0.0001    0.0000   381.6523   425.5782    1.0090     ⋯
                                                                1 column omitted
stochastic_sir_mdl = generate_epiaware(stochastic_sir, england_data)
stochastic_sir_results = apply_method(
    stochastic_sir_mdl, inference_method, england_data
)
summarystats(stochastic_sir_results.samples)
Summary Statistics
                    parameters      mean       std      mcse    ess_bulk   ess                       Symbol   Float64   Float64   Float64     Float64    Fl ⋯

                             β    1.8818    0.0634    0.0021    896.0199   730 ⋯
                             γ    0.4847    0.0147    0.0006    711.3084   479 ⋯
                            I₀    0.0005    0.0002    0.0000    860.6886   671 ⋯
  obs.Ascertainment.ar_init[1]    0.0000    0.0010    0.0000   1198.0745   582 ⋯
  obs.Ascertainment.damp_AR[1]    0.0052    0.0038    0.0001    892.9072   554 ⋯
         obs.Ascertainment.std    0.0478    0.0242    0.0012    426.6135   420 ⋯
      obs.Ascertainment.ϵ_t[1]    0.0489    0.9612    0.0296   1048.5278   694 ⋯
      obs.Ascertainment.ϵ_t[2]    0.0445    0.9768    0.0238   1660.9292   643 ⋯
      obs.Ascertainment.ϵ_t[3]   -0.3196    0.9544    0.0281   1159.8692   732 ⋯
      obs.Ascertainment.ϵ_t[4]    0.5427    0.9047    0.0285    971.1511   734 ⋯
      obs.Ascertainment.ϵ_t[5]    0.0917    0.8463    0.0285    880.6414   600 ⋯
      obs.Ascertainment.ϵ_t[6]   -0.4713    0.8596    0.0264   1072.3979   562 ⋯
      obs.Ascertainment.ϵ_t[7]    0.6100    0.8997    0.0282   1017.3369   680 ⋯
      obs.Ascertainment.ϵ_t[8]    1.3195    0.9915    0.0339    905.0531   539 ⋯
      obs.Ascertainment.ϵ_t[9]    1.0188    1.0082    0.0345    908.0064   485 ⋯
     obs.Ascertainment.ϵ_t[10]    0.1037    0.9010    0.0235   1554.4053   887 ⋯
     obs.Ascertainment.ϵ_t[11]   -0.5264    0.9854    0.0316    967.8084   538 ⋯
               ⋮                    ⋮         ⋮         ⋮          ⋮           ⋱
                                                    3 columns and 2 rows omitted
Figure 3: Model components and posterior analysis for Section 2.3. (A) Prior S (top) and I (bottom) compartment trajectories. (B) Prior samples from the stochastic observation model including time-varying ascertainment and Poisson sampling around a latent infection curve. (C) Prior (grey) and posterior \(R_0\) distributions comparing deterministic (blue) and stochastic (orange) models. (D) Posterior S (top) and I (bottom) compartment trajectories for both deterministic (blue) and stochastic (orange) models with 50% credible intervals. (E) Posterior predictive cases for both models, with median and 50% credible intervals compared to observed data (black points). (F) Posterior \(R_t\) over time on a log scale for both models with reference line at \(R_t=1\).

Figure 3 shows that both compositional SIR models recover similar results to Chatzilena et al. Using the deterministic SIR model infers a reproductive number (\(R_0\)) of 3.9 (3.7–4.2 95% CrI), whilst using the stochastic SIR model infers \(R_0\) of 3.9 (3.5–4.2 95% CrI) (Figure 3 C). The deterministic SIR model posterior predictive trajectories (Figure 3 E) appear less well calibrated than those generated using the stochastic SIR model, with several data points falling outside the 95% prediction envelopes. This indicates that the flexible observation model used in the stochastic SIR model is compensating for model misspecification in the deterministic SIR model. Both models show similar posterior compartment dynamics (Figure 3 D) and time-varying reproduction numbers (Figure 3 F).

References

1.
Funk S. EpiAwareR: R interface to the EpiAware compositional infectious disease modelling framework. 2025. Available: https://github.com/sbfnk/EpiAwareR
2.
Mishra S, Berah T, Mellan TA, Unwin HJT, Vollmer MA, Parag KV, et al. On the derivation of the renewal equation from an age-dependent branching process: an epidemic modelling perspective.
3.
Cori A, Ferguson NM, Fraser C, Cauchemez S. A new framework and software to estimate time-varying reproduction numbers during epidemics. American journal of epidemiology. 2013;178: 1505–1512.
4.
Palmer J, Sherratt K, Martin-Nielsen R, Bevan J, Gibbs H, CMMID COVID-19 Working Group, et al. Covidregionaldata: Subnational data for COVID-19 epidemiology. Journal of Open Source Software. 2021;6: 3290. doi:10.21105/joss.03290
5.
Wallinga J, Lipsitch M. How generation intervals shape the relationship between growth rates and reproductive numbers. Proceedings of the Royal Society B: Biological Sciences. 2007;274: 599–604.
6.
Charniga K, Park SW, Akhmetzhanov AR, Cori A, Dushoff J, Funk S, et al. Best practices for estimating and reporting epidemiological delay distributions of infectious diseases. PLoS computational biology. 2024;20: e1012520.
7.
Zhang L, Carpenter B, Gelman A, Vehtari A. Pathfinder: Parallel quasi-newton variational inference. Journal of Machine Learning Research. 2022;23: 1–49.
8.
Abbott S, Hellewell J, Thompson R, Sherratt K, Gibbs H, Bosse N, et al. Estimating the time-varying reproduction number of SARS-CoV-2 using national and subnational case counts [version 2; peer review: 1 approved, 1 approved with reservations]. Wellcome Open Research. 2020;5. doi:10.12688/wellcomeopenres.16006.2
9.
Park SW, Sun K, Abbott S, Sender R, Bar-On YM, Weitz JS, et al. Inferring the differences in incubation-period and generation-interval distributions of the delta and omicron variants of SARS-CoV-2. Proceedings of the National Academy of Sciences. 2023;120: e2221887120.
10.
Zhao S, Lin Q, Ran J, Musa SS, Yang G, Wang W, et al. Preliminary estimation of the basic reproduction number of novel coronavirus (2019-nCoV) in china, from 2019 to 2020: A data-driven analysis in the early phase of the outbreak. International journal of infectious diseases. 2020;92: 214–217.
11.
Chatzilena A, Zyl G van, Manson AL, Earl AM, Jamieson FB, Joloba ML, et al. Contemporary tuberculosis transmission in Lagos, Nigeria: Spatial patterns and model calibration using genomic epidemiology. Epidemics. 2019;29: 100363. doi:10.1016/j.epidem.2019.100363
12.
Rackauckas C, Nie Q. DifferentialEquations.jl – a performant and feature-rich ecosystem for solving differential equations in julia. The Journal of Open Research Software. 2017;5: 15. doi:10.5334/jors.151
13.
Jombart T, Frost S, Nouvellet P, Lythgoe K, Ghozzi S, Cori A, et al. Outbreaks: A collection of disease outbreak data. 2020. Available: https://CRAN.R-project.org/package=outbreaks