The Poisson distribution is a natural first option when dealing with (practically) unbounded count data. The scanstatistics package provides the function scan_poisson
, which is an expectation-based scan statistic for univariate Poisson-distributed data proposed by Neill et al. (2005).
Theoretical motivation
For the expectation-based Poisson scan statistic, the null hypothesis of no anomaly states that at each location \(i\) and duration \(t\), the observed count is Poisson-distributed with expected value \(\mu_{it}\): \[
H_0 \! : Y_{it} \sim \textrm{Poisson}(\mu_{it}),
\] for locations \(i=1,\ldots,m\) and durations \(t=1,\ldots,T\), with \(T\) being the maximum duration considered. Under the alternative hypothesis, there is a space-time cluster \(W\) consisting of a spatial zone \(Z \subset \{1,\ldots,m\}\) and a time window \(D = \{1, 2, \ldots, d\} \subseteq \{1,2,\ldots,T\}\) such that the counts in \(W\) have their expected values inflated by a factor \(q_W > 1\) compared to the null hypothesis: \[
H_1 \! : Y_{it} \sim \textrm{Poisson}(q_W \mu_{it}), ~~(i,t) \in W.
\] For locations and durations outside of this window, counts are assumed to be distributed as under the null hypothesis. Calculating the scan statistic then involves three steps:
- For each space-time window \(W\), find the maximum likelihood estimate of \(q_W\), treating all \(\mu_{it}\)’s as constants.
- Plug the estimated \(q_W\) into (the logarithm of) a likelihood ratio with the likelihood of the alternative hypothesis in the numerator and the likelihood under the null hypothesis (in which \(q_W=1\)) in the denominator, again for each \(W\).
- Take the scan statistic as the maximum of these likelihood ratios, and the corresponding window \(W^*\) as the most likely cluster (MLC).
Using the Poisson scan statistic
The first argument to scan_poisson
should be a data table with columns ‘location’, ‘duration’, ‘count’ and ‘mu’. The latter two columns contain the observed counts and the estimated Poisson expected value parameters respectively, and the table holds data for the period in which we want to detect anomalies. Locations should be encoded as the integers 1, 2, …, which means that factor variables can be used for this purpose. The duration column counts time backwards, so that a duration of 1 is the most recent time interval, duration 2 is the second most recent, and so on.
We will create such a table by subsetting the NM_popcas
table, which holds the population and the number of brain cancer cases for each year between 1973-1991 and each county of New Mexico. Note that the population numbers are (perhaps poorly) interpolated from the censuses conducted in 1973, 1982, and 1991.
data(NM_popcas)
tab <- NM_popcas[year >= 1986 & year < 1990, ]
tab[, duration := max(year) - year + 1]
tab[, location := county]
We still need to add the column ‘mu’, which should hold the predicted Poisson expected value parameter \(\mu_{it}\) for each location \(i\) and time interval \(t\). In this example we would like to detect a potential cluster of brain cancer in the counties of New Mexico during the years 1986-1989. Thus, we will use data from the years prior to 1986 to estimate the Poisson parameter for all counties in the years following. A simple generalized linear model (GLM) with a linear time trend and an offset for county population size will suffice to demonstrate the scan statistic. We fit such a model and create the needed column as follows:
mod_poisson <- glm(count ~ offset(log(population)) + 1 + I(year - 1985),
data = NM_popcas[year < 1986, ],
family = poisson(link = "log"))
# Add the expected value parameter column
tab[, mu := predict(mod_poisson, tab, type = "response")]
We can now calculate the Poisson scan statistic. To give us more confidence in our detection results, we will perform 99 Monte Carlo replications, by which data is generated using the parameters from the null hypothesis and a new scan statistic calculated. This is then summarized in a \(p\)-value, calculated as the proportion of times the replicated scan statistics exceeded the observed one. The output of scan_poisson
is an object of class “scanstatistic”, which comes with the print method seen below.
set.seed(1)
poisson_result <- scan_poisson(tab, zones, n_mcsim = 99)
print(poisson_result)
## Data distribution: Poisson
## Type of scan statistic: Expectation-based
## Number of locations considered: 32
## Maximum duration considered: 4
## Number of spatial zones: 415
## Number of Monte Carlo replicates: 99
## p-value of observed statistic: 0.01
## Most likely event duration: 4
## ID of locations in most likely cluster: 15, 26
As we can see, the most likely cluster for an anomaly stretches from 1986-1989 and involves the locations numbered 15 and 26, which correspond to the counties
counties <- as.character(NM_geo$county)
counties[c(15, 26)]
[1] "losalamos" "santafe"
These are the same counties detected by Kulldorff et al. (1998), though their analysis was retrospective rather than prospective as ours was. Ours was also data dredging (adjective) as we used the same study period with hopes of detecting the same cluster.
A heuristic score for locations
We can score each county according to how likely it is to be part of a cluster in a heuristic fashion using the function score_locations
, and visualize the results on a heatmap as follows:
# Calculate scores and add column with county names
county_scores <- score_locations(poisson_result)
county_scores[, county := counties]
# Create a table for plotting
score_map_df <- merge(NM_map, county_scores, by = "county", all.x = TRUE)
score_map_df[subregion == "cibola",
relative_score := county_scores[county == "valencia", relative_score]]
ggplot() +
geom_polygon(data = score_map_df,
mapping = aes(x = long, y = lat, group = group,
fill = relative_score),
color = "grey") +
scale_fill_gradient(low = "#e5f5f9", high = "darkgreen",
guide = guide_colorbar(title = "Relative\nScore")) +
geom_text(data = centroids,
mapping = aes(x = long, y = lat, label = subregion),
alpha = 0.5) +
ggtitle("County scores")

As we can see, this does not match up entirely with the previous results as Torrance was not part of the most likely cluster.
Finding the top-scoring clusters
Finally, if we want to know not just the most likely cluster, but say the five top-scoring space-time clusters, we can use the function top_clusters
. The clusters returned can either be overlapping or non-overlapping in the spatial dimension, according to our liking.
top5 <- top_clusters(poisson_result, k = 5, overlapping = FALSE)
# Find the counties corresponding to the spatial zones of the 5 clusters.
top_counties <- top5$zone %>%
purrr::map(get_zone, zones = zones) %>%
purrr::map(function(x) counties[x])
# Add the counties corresponding to the zones as a column
top5[, counties := top_counties]
To get \(p\)-values for these clusters, the values of the cluster-specific statistics in the table above can be compared to the replicate scan statistics calculated earlier. These \(p\)-values will be conservative, since secondary clusters from the original data are compared to the most likely clusters from the replicate data sets.
top5[, pvalue := mc_pvalue(statistic, poisson_result$replicated)]
top5
## zone duration statistic
## 1: 49 4 9.1806711
## 2: 3 2 6.8196550
## 3: 140 4 3.5377879
## 4: 10 4 3.4072029
## 5: 9 2 0.8372729
## counties pvalue
## 1: losalamos,santafe 0.01
## 2: chaves 0.01
## 3: bernalillo,lincoln,sierra,socorro,torrance,valencia 0.48
## 4: guadalupe 0.50
## 5: grant 1.00