Outbreak Detection

Surveillance algorithms usually work on regular spaced aggregated time series of case counts. Let \(\mathbf{x} = (x_1, \dots x_T)\) be such a time series with entries at regularly spaced, discrete timepoints \(t\). An entry \(x_t\) of that time series is defined as the number of observed case counts in that time period.

Time Point Classification

Based on this we can view the problem as a sequential supervised learning problem [Die02], in which the sequence of counts is paired with a sequence of outbreak labels \((\mathbf{x}, \mathbf{y})\), with \(\mathbf{x} = (x_1, \dotsc, x_T), x_i \in \mathbb{N}_0\) and \(y_i \in \mathbb{B}\). For each timepoint \(t\) a boolean label is assigned, corresponding to whether there were outbreak cases present in the aggregation time interval. We call this problem time point classification. This is the standard formulation of common surveillance algorithms.

Time Series Classification

The time point formulation can be extended into a time series formulation by dividing the time series \(\mathbf{x}\) into smaller time series and assigning the label of the last time point to the whole time series. Thus a data set \(\{(\mathbf{x}_j, y_j)\}_{j=1}^T\) is obtained. This formulation is especially useful for incorporating reporting delay. That means that the information at time point \(t = j\) can be quite different depending on whether \(j\) is relatively recent, e.g. \(j = T\) or already some time in the past. This is due to the fact that information arrives sometimes slowly in epidemiological surveillance systems. We call this problem formulation time series classification.

Models

As of now all models included in epysurv work on univariate time series of counts. Extensions to multivariate time series and incorporation of spatial data exist in the R surveillance package, but their inclusion is only planned for later releases.

The currently included models can be viewed as semi-supervised techniques from a machine learning or anomaly detection perspective [CBK09]. All models fit historic data, assuming that they represent the normal state of the system. Having fitted the data, an estimate for the case counts of the current week is computed. This estimate is compared to the number of cases reported in the current If the observed case count exceeds the expected number by some threshold, an alarm is raised. Most models in fact compute a predictive distribution for the estimated number of case counts and raise an alarm if the actual number exceeds a certain quantile of this distribution.

Window-based Approaches

The simplest form of outbreak detection algorithms are window-based approaches. For them the expectation for the current week is computed from a moving window of fixed size. For example the EarsC1 algorithm, computes its predictive distribution based the mean and standard deviation of the last seven timepoints, using a normal distribution.

Because of the short time interval considered, these approaches are naturally insensitive against seasonality and trend. However, recent outbreaks can contaminate the data, reducing the sensitivity of the algorithms.

This category includes the Ears-family [HTST03], CDC [SWHK89] and the RKI [SSHohle16] algorithm.

GLM-based Approaches

Approaches based on Generalized Linear Models (GLMs) form a popular group of outbreak detection algorithms. They compute a predictive distribution for the current week based on fitting a GLM to previous data. An alarm is raised if the current observation is unlikely under the predictive distribution controlled by some $alpha$ value. Often Poisson or Negative Binomial models are used to do justice to the count nature of the data. Moreover, terms to accommodate seasonality and trend are often incorporated as well. GLM-based approaches included the classical Farrington algorithm [FABC96] and its more recent extension [NEF+13].

Cusum-based Approaches

Both window-based and GLM approaches have the downside that they only incorporate evidence from the current week. Larger outbreaks that build up slowly could therefore easily be missed. Cusum-based approaches are inspired by models from statistical process control~cite{Oakland2007} and incorporate evidence from previous timepoints. Instead of computing a predictive distribution, evidence that observed case counts do originate from an epidemic is accumulated until a certain threshold is exceeded and an alarm is raised. Then the sum is reset.

Cusum-based approaches include the Cusum [RLM99], generalized likelihood ratio methods based on Poisson:cite:Hohle2006 or negative binomial distributions~cite{Hohle2008} and the OutbreakP method [FrisenASchioler09].

CBK09

Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomaly detection. ACM Computing Surveys, 41(3):1–58, jul 2009. URL: http://www.cs.umn.edu/sites/cs.umn.edu/files/tech_reports/07-017.pdf, doi:10.1145/1541880.1541882.

Die02

Thomas G Dietterich. Machine Learning for Sequential Data: A Review. In Joint IAPR international workshops on statistical techniques in pattern recognition (SPR) and structural and syntactic pattern recognition (SSPR), 15–30. Springer, Berlin, Heidelberg, 2002. URL: http://link.springer.com/10.1007/3-540-70659-3\2, arXiv:0-387-31073-8, doi:10.1007/3-540-70659-3_2.

FABC96

C. P. Farrington, N. J. Andrews, A. D. Beale, and M. A. Catchpole. A Statistical Algorithm for the Early Detection of Outbreaks of Infectious Disease. Journal of the Royal Statistical Society. Series A (Statistics in Society), 159(3):547, 1996. URL: https://www.jstor.org/stable/10.2307/2983331?origin=crossref http://www.jstor.org/stable/10.2307/2983331?origin=crossref, doi:10.2307/2983331.

FrisenASchioler09

Marianne Frisén, E Andersson, and L Schiöler. Robust outbreak surveillance of epidemics in Sweden. Statistics in Medicine, 28(3):476–493, 2009. URL: www.interscience.wiley.com, arXiv:NIHMS150003, doi:10.1002/sim.3483.

HTST03

Lori Hutwagner, William Thompson, G Matthew Seeman, and Tracee Treadwell. The bioterrorism preparedness and response Early Aberration Reporting System (EARS). Journal of urban health : bulletin of the New York Academy of Medicine, 80(2 Suppl 1):i89–i96, 2003. URL: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3456557/pdf/11524_2006_Article_200.pdf, doi:10.1007/PL00022319.

NEF+13

Angela Noufaily, Doyo G. Enki, Paddy Farrington, Paul Garthwaite, Nick Andrews, and André Charlett. An improved algorithm for outbreak detection in multiple surveillance systems. Statistics in Medicine, 32(7):1206–1222, 2013. URL: http://ojphi.org, doi:10.1002/sim.5595.

RLM99

G Rossi, L Lampugnani, and M Marchi. an Approximate Cusum Procedure for. Statistics in Medicine, 2122(November 1997):2111–2122, 1999.

SSHohle16

Maëlle Salmon, Dirk Schumacher, and Michael Höhle. Monitoring Count Time Series in R : Aberration Detection in Public Health Surveillance. Journal of Statistical Software, 2016. URL: http://www.jstatsoft.org/v70/i10/, arXiv:1411.1292, doi:10.18637/jss.v070.i10.

SWHK89

Donna F Stroup, G David Williamson, Joy L Herndon, and John M Karon. Detection of aberrations in the occurrence of notifiable diseases surveillance data. Statistics in medicine, 8(3):323–329, 1989.