1.4 Inferring an outbreak source location

Detailed case data can be analysed in isolation in an attempt to infer a region in space that is likely to contain the source of an outbreak. In theory, the recorded movements of each case define the exact spaces each case has occupied and therefore within that record will be the exact location(s) at which each case has received an infectious dose of Legionella bacteria. It is therefore assumed within the analyses described below that the infectious dose has come from a source within relatively close spatial proximity to the locations in space occupied by a case.

The approaches described below are all raster-based GIS operations. Initially data for each case is analysed in isolation to identify those areas in space where a potential source of infection could be located (for that individual case). Each raster operation can treat case data in two ways:

1. Firstly, it can treat each point in space occupied by a case equally no matter how long that space has been occupied, when it has been occupied or how many times it been occupied.

2. Secondly, it can weight each point in space in an attempt to assign relative importance to each location. Weighting might be assigned based on the time spent at a location, the number of times a location is visited or the date a space was occupied (based on an incubation-period based weighting).

Having created a raster-grid for each case, attempting to highlight the areas in space that could potentially contain the outbreak source, the raster's are combined for all cases to form a composite. The composite raster output is intended to illustrate areas of commonality in space between all cases that could be suggestive of a source of common infection (and can therefore be used to target additional investigation). When creating a composite raster it is important that the relative importance of each of the input case raster's is equal, otherwise there is potential for a single case to significantly skew the results.

Generally speaking you can assume that more time will have been spent at each 'point' location visited rather than at each point in space along a travel route (although this may not always necessarily be true). Assuming the likelihood of becoming infected with Legionnaires' disease is dose-dependent and that longer periods of time spent within a dose contour-reflect higher received doses we can normally assume those 'point' locations visited to be the more likely spaces in which infection has taken place. It may therefore be more sensible to apply the methods described below, in the first instance, to the point based data describing each location visited by a case. It is entirely plausible however that in certain scenarios Legionnaires' disease has been contracted whilst travelling and so the analysis of travel route data can also, in certain circumstances, produce meaningful results.

When considering the outputs of the analyses described below it is useful to overlay any data available on potential source locations (e.g. cooling towers, air scrubbers etc) to identify whether the region inferred by the analysis is within close spatial proximity to a potential source location.

Considerations for a cross-border outbreak: the analytical methods discussed in this section rely on individual-based case data rather than aggregated case data. They are therefore unlikely to be used within a cross-border outbreak scenario where only aggregate data is available.

1.4.1 Basic buffer analysis

Data requirements: Case data

Description: Buffering is the most basic approach that aims to identify areas of commonality in space between cases. Buffers can be generated around the point locations visited/travel routes taken by each case in an attempt to describe an area in which we assume the location of the infectious source could potentially be found. If a grid-based buffering operation is used we can buffer the locations visited/travel routes taken by each individual case and then assign each cell within the buffer a value of 1. Combing the output for each case to form a composite output would see higher counts in the areas that cases had been near to more commonly. Those areas of commonality could perhaps suggest an area where the source of infection is more likely to be located. A possible criticism of this methodology however, is that it assumes any location within the buffer is of equal significance whether it is 1 m or 1 km from the location actually occupied by a case. Figure 1.5 shows an example output presented from a grid-based buffer analysis.

GIS tools required: A raster-based buffer tool is required for buffering locations visited/travel routes. A raster-based 'sum' tool is required to create a composite raster grid.

Figure 1.5 Example output from a basic buffer analysis.

1.4.2 'Euclidean' buffer analysis

Data requirements: Case data

Description: A potential improvement upon the basic buffering technique would be to use a Euclidean-distance measure. In the same way as the basic buffer analysis described in section 1.4.1, the locations visited/travel routes taken by each case are 'buffered' individually; however the Euclidean distance function can be thought of as type of continuous buffer with each cell in the raster grid being assigned a value corresponding to its closest distance to the location visited/travel route taken. Those buffers can then be summed together in the same way as the basic buffer analysis; however as the Euclidean distance technique assigns lower values to those cells closer to the locations visited/travel routes taken, it will be those areas in the composite raster with the lowest values that will be of interest. A potential criticism of the Euclidean distance function is that it allocates a value to a cell based only on its closest distance to a point location/travel route, therefore losing any further information provided by other point locations/travel routes. Figure 1.6 shows an example output presented from a grid-based euclidean distance buffer analysis.

GIS tools required: A raster-based euclidean distance tool is required to perform Euclidean-distance buffering of locations visited/travel routes. A raster-based 'sum' tool is required to create a composite raster grid.

Figure 1.6 Example output from a 'Euclidean' buffer analysis

1.4.3 Kernel density analysis

Data requirements: Case data

Description: When analysing the locations visited/travel routes taken by each case, kernel density analysis can overcome the criticism of the Euclidean distance approach because instead of using only one distance to allocate a value to a cell, distances to every point location visited (including each point along a travel routes) can be considered. Essentially a smoothed surface is produced by the kernel (of a given search radius or 'bandwidth') visiting each point and weighting the surrounding raster cells based on the distance from that point. Raster cell values are greatest at the point itself and diminish with distance from that point until the search radius (or 'bandwidth') is reached where a value of 0 is assigned. The functional form of the kernel density can be chosen by the GIS users, but Gaussian (normal) form is often used, where the bandwidth will be dependent on the desired variation. The values of each kernel surface, taken at each point for an individual case, are then summed together to produce a single smoothed surface. The output raster surface for each case can then be summed to form a composite. Within that composite surface it is those areas with the highest values that we can assume are more likely to be within close proximity of the source of infection.

An additional advantage of using a kernel density function is that features incorporated in the analysis can be weighted such that certain features have greater influence upon the kernel surface than others. For example, a higher weight could be given to those locations visited on the days that infection was more likely to have occurred (this can be inferred based on the date of symptom onset and the assumed incubation period) or based on the time spent at a location. Assuming the likelihood of becoming infected with Legionnaires' disease is dose-dependent and that longer periods of time spent within a dose contour-reflect higher received doses, weighting the analysis by time spent within a 'cell' could help better predict a likely source location. By conducting kernel density analysis with and without various weighting's different patterns may emerge that could add insight to an outbreak investigation.

A potential limitation to the kernel density method, however, is that it requires a search radius (or 'bandwidth') parameter to be defined which can have a significant impact upon the resulting output. If the search radius value is too great then the resulting output will be 'over-smoothed' revealing very general trends, whereas a very low value may identify localised variation but to such an extent that it reveals no new insight above simply visualising the source data itself. In the case of a Legionnaires' disease outbreak the search radius could perhaps be seen as the assumed distance at which an outbreak source could potentially infect an individual. However, due to the differing nature of the various aerosol emitting facilities that could be responsible for an outbreak, this value is not fixed. It is therefore advisable to perform any kernel density analyses with varying search radii and to compare the results. Figure 1.7 provides an example output of a kernel density analysis, and Figure 1.8 provides an example of a weighted kernel density analysis based on the same underlying data. The data are weighted based on time spent at each location.

GIS tools required: A raster-based kernel density tool is required to perform kernel density analysis on the locations visited/travel routes taken. A raster-based 'sum' tool is required to create a composite raster grid.

Examples from the literature: Kernel density analysis was employed as part of the response to the South Wales, 2010 Legionnaires' disease outbreak. The analyses identified an area in the Rhymney Valley in which 2 cooling towers were located.

Figure 1.7 Example output from kernel density analysis.

Figure 1.8 Example output from a weighted kernel density analysis (weighted by time spent at location)

Legionnaires' disease outbreak investigation toolbox

1.4 Inferring an outbreak source location