Spatial clustering of COVID-19 in Punjab

14 August 2020
NEW RESEARCH NOTE BASED ON FINDINGS FROM PUNJAB

In April 2020, the Centre for Policy Research (CPR), began a collaboration with the Government of Punjab (GoP) to revise the State’s COVID-19 testing strategy. This is the second in a series of research notes based on findings of this collaboration.

Understanding the spatial dimensions of COVID-19 spread is critical to planning for the next stage of the pandemic. This note is based on data from positive cases between April 1st 2020 and May 28th 2020. Using polling booths and their corresponding zones as units of observation, analysis  reveals that:

  • Every district in Punjab has a distinct pattern of spread;
  • Even within each district, cases are clustered within very localised spaces. Our analysis finds that the disease outbreak is in fact clustered at the polling booth level.

Although this is only an early slice of the testing data in Punjab, we believe the larger empirical insights are generalizable. Infection rates can vary widely from one neighborhood to the next, so the polling booth offers a convenient scale at which to understand spatial clustering. Furthermore, the contact of a confirmed case of COVID is one of the most predictive factors of infections detected, as it is the basis for contact tracing protocols. Given that so much of the testing occurs from contact tracing, and contacts are likely to be spatially proximate to initial seed of an infection, we expect the data to continue to demonstrate this highly spatially clustered structure.

The policy implications of this analysis are significant. If cases are indeed as localised as this note shows, a very agile and mobile policy response is required. Healthcare workers for instance, can be redeployed to specific clusters to ensure that facilities near where outbreaks occur are strengthened. Since analysis of this data, the lockdown has been lifted and movement across Punjab and outside have grown. This puts even greater pressure on governments to efficiently contain outbreaks at a disaggregated level, for which data and logistical demands are of utmost and urgent importance.

Disparate trends at the district level

To understand COVID-19 positivity in Punjab, our team plotted the number of positive cases between April 1st and May 28th for every district in the state. This data is shown in Figure 1 as a 7-day moving average. This is a particularly valuable time period to study as it coincides almost perfectly with the lockdown -- so we can more closely observe the patterns of infection transmission in the population.

What is immediately discernible from the figure is the presence of a single spike in many districts around May 3rd; a feature that we explain below as arising from the testing of one specific population — those returning as pilgrims to Hazoor Sahib in Nanded. Secondly, each district presents a very distinct pattern of epidemic spread during the period of the lockdown.  

Jalandhar for instance, had a steady stream of cases. Ludhiana had very few cases to begin with, but saw a rise during the beginning of May and then sudden starts and stops later in the month. SAS Nagar had intermittent increases in cases in early April, another cluster around the beginning of May, and nothing thereafter. Tarn Taran saw a single cluster at the beginning of May and similarly nothing thereafter. Interestingly, the majority of districts saw very few cases at all.

The takeaway from these disparate experiences, it seems, is that throughout the lockdown period (and excluding the single spike), in every single week since April 1st, at most two districts have accounted for at least 80% of the infections in Punjab.

Figure 1: Daily Test Positivity by District

A lack of available data however, may mean that this is still too broad a generalisation. In the data shared with us, we did not have pre-coded information on geographical location within each district, so that for the first 379 positive cases (till April 20th), we attempted to (manually) match their addresses to polling booths. As polling booths typically have around 1000 voters, they provide one way to characterise a fairly small geographical location, and hew to natural neighborhood boundaries — and even sub-neighborhood boundaries for larger neighborhoods.

We were able to complete this match for 347 of the 379 cases.[1] Out of an abundance of caution, we conducted this exercise twice and in cases of doubt, when addresses were not specific enough, we allowed the same address to be linked to two (or more) polling booths. In 30% of cases, we were able to match the address to a single polling booth, and in 68% of cases, to two. Taken together, about 97% of our addresses can be matched to within seven polling booths (about 7000 voters) -- a fairly compact spatial area around the size of a medium-sized colony.

COVID-19 cases are highly clustered

Our matching exercise confirms that even below the level of the district, cases remain very densely clustered. Notably, while the 347 cases in our dataset were drawn from 126 different locations, 80% came from only 57 locations — and only 224 polling booths.

Take Jalandhar; a district that has had an unrelenting stream of COVID-19 cases since April. Within our dataset of 347 cases, we have 85 cases from Jalandhar, of which 80% are from 51 polling booths within the district. This is a very small cluster when one considers that there are 1864 polling booths in the district of Jalandhar. Put simply, this means that 80% of positive cases occurred in less than 3% of all polling booths in the district.

This clustering phenomenon is observable for the whole of Punjab. Until April 29th, less than 2% of the 11,323 polling booths in Punjab accounted for 80% of cases. [2]

Although this is only an early slice of the testing data in Punjab, we believe the larger empirical insights are generalizable. Infection rates can vary widely from one neighborhood to the next, so the polling booth offers a convenient scale at which to understand spatial clustering. Furthermore, the contact of a confirmed case of COVID is one of the most predictive factors of infections detected, as it is the basis for contact tracing protocols. Given that so much of the testing occurs from contact tracing, and contacts are likely to be spatially proximate to initial seed of an infection, we expect the data to continue to demonstrate this highly spatially clustered structure.

Does spatial clustering only reflect the extent of testing?

One possible explanation for the clustering of cases is that infections are only found where people are tested; if a district has few cases, it may simply be because few people were tested. An extreme version of this hypothesis is that every district has exactly the same number of cases, but the difference in detected cases is entirely due to differences in testing intensity. We cannot answer this question with certitude since no district in Punjab has undertaken the periodic structured sampling that would be required to distinguish testing intensity from the positivity rate.

We can make some progress by directly examining the intensity of testing and the number of positive cases. One case where this is definitely true is in the coordinated spikes across districts in early May. These spikes were due to the extensive testing of returnees from Nanded in Maharashtra. As these pilgrims came back to their respective districts, they were all systematically tested and a substantial proportion were found to be COVID-19 positive.

Nevertheless, this explanation cannot be the full story. Figure 2 plots the number of tests and the test positivity rate (using 7-day moving averages) for each district in Punjab. If it is the case that the sharp clusters we see reflect the intensity of testing, we should also see similar changes in the tests over these sharply defined time periods. We don’t. On the contrary, in virtually all districts, there has been a rise in testing over the period of our data, and we find widely varying test positivity curves over these two months. Therefore, at least some portion of this clustering is due to real variations in the infection rates, rather than testing intensity.[3]

Figure 2: Daily Number of Tests by District

Why does this clustering matter?

Revealing the cluster pattern of COVID-19 spread in Punjab is important for multiple scenarios. If infections across districts occur in a highly linked manner, then cases may `surge’ and then decline in all districts at the same time. But if infections are not tightly linked, then one district may surge while another district’s case numbers remain low; even within the `surging’ district, there may be many Tehsils or even Gram Panchayats that are (mostly) COVID-19 free. This leads to two policy implications:

  • Strategic investments in healthcare capacity: The two situations require very different policies. In the first scenario, GoP would have to simultaneously invest in building capacity in all districts - investing in healthcare facilities, ICU beds and ventilators etc. - to meet a surge that arrives at the same time. Under the second scenario, more capacity can be built centrally and around mobile services that can move from one district (or Tehsil) to another. For instance, ICU beds can be increased in strategically central districts with ambulance services to bring in COVID-19 patients from other locations as required. Or, the government could invest in capacity to quickly transport medical equipment from one district to another. This is not an easy task, as districts will be asked to care for patients from other regions, or move equipment to where it is needed most, but with clear planning ahead of time, this is an achievable objective.
  • Localised lockdowns: The localisation of the epidemic also implies that lockdowns, if needed, can be carried out in smaller areas to prevent the spread of the disease. Although we currently lack data on the precise geographical location of patients as well as data from a structured sampling scheme, our analysis thus far suggests that it will be possible to go to levels below the district for subsequent lockdowns.  This kind of hyper-localised lockdown would allow the economy to function with fewer disruptions.

Why do we see this clustering of cases?

A final question is why we observe such clustering if epidemic models all suggest that a surge occurs the moment initial infections are seen. We first note that very similar patterns were observed with SARS-Cov1; even within the single city of Hong Kong, some regions were hit very hard and others not at all. In fact, spatial clustering was a defining feature of that epidemic.

The current thinking in terms of why spatial clustering occurs, is because some people who fall sick infect many others (‘super-spreaders)’, while others infect very few. The research has not identified why this may be so— it could be biological, due to variation in viral shedding, or it could be social, through the number of contacts that people have. Regardless of the reason, what is of interest is that once we realise that people have different levels of infectivity, we would expect this to affect the spatial patterns of the epidemic. If an infection starts in a place with a person who has low infectivity, it will die out quickly. But if it starts with someone who has high infectivity, it will spread quickly and infect many others. And if it infects someone with high infectivity who in turn infects another person with high infectivity, then it will surge.

Our preliminary work based on the contact data indeed suggests that there is variation in infectivity in Punjab, as 50% of the infected cases in that data came from only 31 original infections.

Implications of spatial clustering

Epidemiologists often refer to a number known as “R0” — the number of individuals who become infected due to exposure to an infected individual. The stress on an average R0 has consumed the thinking on this pandemic, but the emphasis is now slowly shifting to the variation in R0 rather than the average. More variation—even if it is inherently unpredictable—is good news since peak-load demands on hospital care will be lower and lockdowns can be implemented at a geographically disaggregated level.

From the standpoint of testing for coronavirus, we know that the most efficient manner to test the population (i.e., understand the pattern of infection across the population with the fewest tests possible) is to administer more tests in areas that have the greatest risk of infection. The fact that at any point in time the vast majority of infections will occur in a small geographic area - a cluster effect - has important implications for how to allocate tests across the population. In particular, this implies that, rather than thinking about allocating tests across the state in some fixed manner, it is important to keep a surfeit of tests that can be allocated to emerging “hotspots” which can be efficiently processed across labs in the state.

The key now will be to use all the evidence we have to quickly understand this dispersion and its implication for testing logistics and policy.

POSTSCRIPT: We have shown in this note that the underlying patterns infection are spatially clustered over a period of low mobility (lockdown). Since analysis of this data, the lockdown has been lifted and movement across Punjab and outside have grown. This puts even greater pressure on governments to efficiently contain outbreaks at a disaggregated level, for which data and logistical demands are of utmost and urgent importance.

Other research notes as part of the series can be accessed below:


[1] In most cases, the excluded cases are migrants with addresses outside of Punjab but some small number of exclusions are due to inability to match the address.

    [2]Testing reflects both testing due to contact tracing and the testing of special populations as designated by ICMR. We find greater clustering among contacts compared to the special populations: 50% of contacts are in only 13 locations while 20 of the 41 cases in the special populations come from 16 locations.

    [3]A May ICMR sero-survey that randomly tested across 60 districts of the country and “hotspot” containment zones found significantly higher rates of infection (or previous infection) in the hotspot zones, adding credence to the idea that the true distribution of infection is highly spatially clustered.


    CPR's COVID-19 research group consists of the following:
    Jishnu Das (Centre for Policy Research/Georgetown University), Tyler McCormick (University of Washington), Partha Mukhopadhyay (Centre for Policy Research), Neelanjan Sircar (Centre for Policy Research/Ashoka University), Yamini Aiyar (Centre for Policy Research), Vidisha Mehta, Kanhu Charan Pradhan (Centre for Policy Research), Olivier Telle (Centre for Policy Research/Centre National de la Recherche Française), Harish Sal (Centre for Policy Research), Benjamin Daniels (Georgetown University) and Shamindra Nath Roy (Centre for Policy Research).

    The views shared belong to individual faculty and researchers and do not represent an institutional stance on the issue.