User Guide

Using epysurv models should be straightforward if you are familiar with scikit-learn and pandas.

Data Format

Let’s first consider the models in the epysurv.models.timepoint package. Each model has a fit and a predict method that takes a pandas.DataFrame representing an epidemiological count time series of the following form:

            n_cases  n_outbreak_cases
2004-01-05        0                 0
2004-01-12        0                 0
2004-01-19        2                 0
2004-01-26        2                 0
2004-02-02        1                 0

The data frame needs to have a regular DatetimeIndex and two columns containing case counts. n_cases represents the total number of cases observed and n_outbreak_cases the number of cases are labeled as belonging to an outbreak. Therefore n_cases should always be bigger or equal to n_outbreak_cases as there can not be more outbreak cases as cases in total. Note also that each row represents the number of cases observed in the period between the row’s timepoint and the next timepoint. So in the above example the first row denotes that there were zero cases observed from 2004-01-05 up to 2004-01-11 inclusive.

Fitting

When passing the data frame to fit the outbreak cases are subtracted from the total cases to obtain the in control time series, i.e. the time series without outbreaks.

If you do not have any labeled outbreak data, but just the raw counts, the n_cases column will be taken as is under the assumption that your data is in fact in control data. A warning is still issued in this case.

Prediction

At prediction time only the total case counts are required. The data frame passed to predict needs to consist of observations that are spaced at the same regular time intervals as the training data. All data points should lie strictly in the future of the training data. The data frame returned is the original data augmented by an alarm column that indicated whether the model predicts an outbreak at that time point or not.

           n_cases  alarm
2011-01-03        1    0.0
2011-01-10        0    0.0
2011-01-17        3    0.0
2011-01-24        3    0.0
2011-01-31        3    0.0

Using Time Series Classification Models

For each each model in the epysurv.models.timepoint package there is a corresponding model in the epysurv.models.timeseries package. These models basically perform the same task, but make a binary prediction (alarm / no alarm) for an entire time series instead of just a single time point. See Time Series Classification for a more detailed discussion. Therefore, bot fit and predict take iterables of data frames described above and labels: Iterable[Tuple[DataFrame, bool]]. The label indicates whether the last time point of the time series is to be considered an outbreak. The predict method in this case only returns a time series of alarms.