Assessing the value of integrating national longitudinal shopping data into respiratory disease forecasting models – Nature Communications

Open-source software used

We used the following open-source software in the data collection and analysis:

  1. 1.

    Open source PostgreSQL 13:

  2. 2.


  3. 3.


  4. 4.


  5. 5.


  6. 6.


  7. 7.


  8. 8.


Shopping, mortality and Socio-demographic data

Data was collected for this study through working partnerships with: the UK National Health Service (NHS); a UK commercial high street retailer; and via open source data providers. Data sources and descriptions used in this study can be seen in Supplementary Table 2, which lists the data used for both dependent and independent variables. In Supplementary Table 2, the table column ‘Research relating variable to respiratory disease’ cites the journal papers providing scientific evidence why each variable type was included.

Data underpinning medication sales variables originates from two data-sets made available to the project by a national UK high street retailer, incorporating: 1. a 20% sample of timestamped commercial sales transactions (including over 5 million transactions of non-prescription medications for cough, decongestant and throat) from England and Wales recorded through 2,702,449 loyalty cards from November 2009 to April 2015; and 2. a data-set of all in-store weekly commercial sales transactions in England (over 2 billion) labelled by store location, covering the period between March 2016 to March 2020. Commercial sales data was sales units of all in-store transactions with store location only; no sales transactions were linked to individual customers, and no personal data was used in this study. Other independent variables used in the models are derived from: English indices of deprivation 201961; Nomis official census and labour market statistics62; Housing age data from the Valuation Office Agency 202063; ONS Census 201164; Land use in England from 2018 live tables from the Department for Levelling Up, Housing and Communities and Ministry of Communities and Local Government65; and Weather ERA5 data from the European Centre for Medium-Range Weather Forecasts66.

Data was collected or aggregated to the geographic regions of LTLAs (Local Tier Local Authorities). The LTLAs boundaries used in this study are the 314 local governmental areas across all of England as structured in 2019. LTLAs were used because, as geographic areas used by government data, they would enable direct inclusion of a wide range of demographic, environmental and socioeconomic data previously found to be significant to the risk of death from respiratory disease (Supplementary Table 2). LTLAs were chosen over other spatial divisions, including census areas MSOAs (Middle Layer Super Output Areas) and LSOAs (Lower Layer Super Output Areas), to find a balance between limiting data suppression of deaths done to maintain data de-identification, and maintaining a high enough level of spatial granularity for data to have a significant impact on predictions.

The output variable for the models, respiratory deaths in each of the England’s 314 Local Authorities (LTLAs), originates from two datasets supplied by the UK Office of National Statistics (ONS) and the National Commissioning Data Repository (NDCR) containing: 1. details of total weekly registered deaths from respiratory disease (as the underlying cause)(ICD 10 coding: J00 – J99) in England and Wales from 7th December 2009 to 13th April 201567; and 2. ONS data on all weekly registered deaths from respiratory disease (ICD 10 coding: J00 – J99) by 314 LTLAs in England from 18th March 2016 to 27th March 2020. The health data in this study was used under the terms of usual practice for research as defined in the UK Policy Framework for Health and Social Care Research, with the study designed to investigate the health issues in a population to improve population health68. To keep health data de-identified and extractable from the NDCR, death counts below 5 within an LTLA were suppressed and reported as 5 prior to receipt. Suppression was between 20% and 25%.

Open source data availability for UK LTLAs outside of England was limited, and consequently analysis was for local-level forecasting was restricted to 314 LTLAs in England alone. Even with this restriction other variables under consideration for inclusion such as mobility data, search engine trends, temporal pollen, traffic, and air pollution counts, could not be included as there was no access to these datasets at this level of geospatial granularity or for the 4 year period within the time limits of the project. Data matching, the time series of available target predictions and available dynamic datasets of sales and weather, gave the dataset its time-frame of the 18th March 2016 to 27th March 2020 for weekly deaths by respiratory disease.

Exploratory modelling / National-level Analysis

Preliminary exploration of the national UK relationship between sales data and respiratory deaths was undertaken before modelling at local levels to better understand potential forecast horizons. For this analysis a distinct preceding time period was considered, using ONS data on weekly registered deaths from respiratory disease (as the underlying cause) in England and Wales from 7th December 2009 to 13th April 201567 and medication sales data drawn with forecast horizons of 3, 10, 17, 24, 31 days (forecast horizon refers to the time lapse between the date of registered deaths being predicted and the date of reported sales). Independent variables consisted of ‘week number’ (1 to 52) and features created from commercial sales data, including absolute weekly sales units for: cough medication, dry cough, mucus cough, decongestant and throat healthcare products, reflecting the data partner’s internal product categories. Sales features were aggregated by week from over 5 million over-the-counter transactions recorded by 2,702,449 loyalty cards. Data was temporally stratified into a training (70%) and test set (30%), and out-of-sample performance examined against the held-out test data via RMSE (root mean squared error) and R2 (coefficient of determination). Several linear regression models were trained (one for each forecast horizon at which sales data was engineered) and evaluated to test for the optimal forecast horizon in days between sales and registered deaths (Table 1). Results then guided subsequent feature engineering stages of fully cross-validated local-level modelling experiments, as described in the following section.

Local-level respiratory-death forecasting

The central experiment of this study, with forecasting at local-level, was structured into two phases: modelling and variable importance analysis. Phase 1 investigated a range of random forest regressors to identify an optimized reference model, PADRUS, for forecasting respiratory deaths 17 days in advance (as indicated by exploratory national-level modelling). All available data used in this modelling covered the period between 18th March 2016 and 27th March 2020, with variables reflecting hypothesised associations with deaths from respiratory disease (see Supplementary Table 2 for a full feature list). In order to provide comparison two other models were generated – baseline, using just week number and LTLA population over 65, and – PADRUNOS, which used all variables apart from those derived from sales data. Phase 2 of the analysis sought to explain the impact of the inclusion and importance of different variables on the models’ predictions, including commercial sales data. Phase 2 applied the novel variable importance tool MCR for random forest regressor37,40.

Feature engineering

All raw data was aggregated to LTLA spatial resolutions across England, with 314 LTLAs being assigned weekly values for the dependent variable (target feature) and each of the 56 independent variables (input features). The data used for the dependent variable was all registered deaths from respiratory disease (ICD 10 coding: J00 – J99) by LTLA on a weekly time basis. Independent features used by models can be categorised into dynamic or static variables. Dynamic variables had a forecast horizon of 17 days. Static features, which were assumed to be constant over the period of analysis, were themselves grouped into variables sets including: demographics, indices of multiple deprivation, age of population, housing and land use. Dynamic (temporal) features were grouped into: week number, commercial sales, weather-related. These groupings were used for the Group-MCR (Model Class Reliance), and further details of 56 features were created can be seen in Supplementary Table 2 which lists any data manipulation from the original source (e.g. use of averages/percentages/aggregation approaches).

Feature engineering for sales data

Sales data variables were created from both the total sales from the week ending 24 days before the day of the prediction, and the total sales from the week ending 17 days before the prediction. These are two seven day non-overlapping periods. We refer to these as variables with a 17 or 24 day lag. Online sales were not included, and all data was pre-aggregated to store level. Based on this pre-aggregation, no individual customer information was known, nor could be reconstructed. Features derived from sales data were included as both absolute and relative values for each medication category, with relative values further represented as both a proportion of that weeks store sales, as well as in a form normalised by overall national sales, to account for changing footfall.

Complex data manipulation was needed for feature creation from commercial sales. Sales from the 2354 stores, distributed as they are across England, had to be assigned to each of the 314 English LTLAs. To help achieve this we first introduce some notation. The total number of unit sold across the whole UK in a particular week for product category, c is defined as: salesUK,c.

While total sales across the country can simply be counted, the number of units bought by customers from any particular LTLA, salesL,c, cannot, given that people will travel across LTLA boundaries to purchase medications. To estimate the proportion of sales that arise from customers living in a particular local authority a store catchment model must be be established. We implement a simple Gaussian kernel-based approach that allows us to infer the ‘influence’ each store has over an LTLA, and then use that to attribute sales to LTLAs proportionately.

We establish a two-dimensional Gaussian with variance, \({\sigma }_{s}^{2}\), over each store, s, with a mean, μs, centred at the store’s longitude and latitude. \({\sigma }_{s}^{2}\) was varied depending on a theorised catchment radius (metres) corresponding to a store’s type (Regional Mall 15,000 m, Hospital / Transport Hub 10,000 m, Retail Park / Shopping Centre 8000 m, High Street / Supermarket 5,000 m, Health Centre/Community 3000 m). These catchment ranges reflect 2σ (95% of the kernel) and were devised in collaboration with the data provider. The influence a store, s, had over any point, x, can therefore be assigned as:

$${influence}_{s}(x)=\frac{1}{{\sigma }_{s}\sqrt{2\pi }}{e}^{-{\left(x-{\mu }_{s}\right)}^{2}/2{\sigma }_{s}^{2}}$$


The ‘influence’ the store had over an LTLA is then determined by examining the value of the store’s probability density function at that LTLA’s population weighted centroid. Due to the fact that Gaussian functions have an infinite range, a store will be assigned a non-zero influence for every LTLA across the UK but, in order to simplify analysis, the influence assigned to a point by each store for a given LTLA was zeroed if it fell below a threshold of 3.449937e-318. The ‘total influence’ of each store can then be calculated by summing the influences for the population weighted centroid of every LTLA, L, from the 314 available, \({{{{{{{\mathcal{L}}}}}}}}\):

$$total\,influenc{e}_{s}=\mathop{\sum}\limits_{L\in {{{{{{{\mathcal{L}}}}}}}}}{{{{{{{{\rm{influence}}}}}}}}}_{s}(L)$$


The amount of sales that each store attributes to a given LTLA, L, can then simply be assigned as a proportion of the influence the store has over that LTLA compared to its total influence overall. If sales are then aggregated for each store then we can finally produce an estimate of category sales in that LTLA overall. Formally for any sales category, c, the absolute sales assigned to an LTLA, L, in particularly week, after aggregating all contributions of every store, s, is:

$$total\,sale{s}_{L,c}=\left(\mathop{\sum}\limits_{s\in S}\frac{{{{{{{{{\rm{influence}}}}}}}}}_{s}(L)}{{{{{{{{{\rm{total}}}}}}}}\,{{{{{{{\rm{influence}}}}}}}}}_{s}}\times {{{{{{{{\rm{sales}}}}}}}}}_{s,c}\right)$$


Total sales in any given week, however, does not always represent an informative variable. This is due to the uneven geographical distribution of the retailer’s stores across the country, and the different number of customer bases they serve. A more informative variable is the percentage of sales that a category, c, is contributing that week – in comparison to sales across all product categories, \({{{{{{{\mathcal{C}}}}}}}}\). To this end, we derive a local sales ratio variable for an LTLA, L, as follows:

$$sales\, ratio_{L,c}=\sum\limits_{ {c^{‘} \in {{{{{{\mathcal{C}}}}}}}}\atop{{c^{‘} \ne c}}} \frac{{{{{{{\rm{total}}}}}}\, {{{{{\rm{sales}}}}}}}_{L,c}}{{{{{{{\rm{total}}}}}}\,{{{{{\rm{sales}}}}}}}_{L,c^{‘}}}$$


Here \({{{{{{{\mathcal{C}}}}}}}}\) represents all product categories sold by the retailer (including but not restricted to non-prescription medication categories) and we are therefore normalising by overall sales in that LTLA. This prevents modelling bias towards regions where there is a higher population or number of retailer stores.

It is useful to compare the sales ratio occurring in an LTLA with the national average, as this allows to defend against both feature drift and seasonal fluctuations. To achieve this we define a local sales multiplier variable, which allows us to assess whether non-prescription medication sales are proportionally more dominant in one region compared with another (making this feature more invariant over time):



These three features total sales, sales ratio and local sales multiplier make up the central sales variables within our models (with versions for cough medicines, dry cough medicines, decongestant medicines and throat medicines – products selected due to reports in prior literature associating a rise in their sales with an increase in respiratory illnesses30,33).

Modelling procedure

The analysis of registered deaths from respiratory disease was framed as a regression task with the amount of deaths predicted (the target) a continuous output integer variable, y. The modelling task was prediction of weekly respiratory deaths 17 days in advance for each of the 314 LTLA areas in England from 18th March 2016 to 27th March 2020. For all models (PADRUS, PADRUNOS, baseline) data was temporally stratified into the same training set (45,844 data points from 18th March 2016 to 28th December 2018), approx. 70%) and test set (20,410 data points from 4th January 2019 to 27th March 2020, approx. 30%)

Random forest regressors were trained and evaluated, with meta-parameters for the model optimised using a time series cross-validation (TSCV, 4 splits) and grid search to prevent over-fitting. TSCV is an alternative to k-fold cross-validation, which is applied to prevent temporal data leakage and the risk of over-reported accuracy results. TSCV iteratively applies a ‘walk-forward’ testing process, in order to robust evaluate a models’ ability to generalise. Once optimal meta-parameters had been identified, the model was re-fit to the training dataset, and evaluated against the held-out test set. Three metrics were used to score the regression task: RMSE (root mean squared error), MAE (mean absolute error) and R2 (coefficient of determination).

To assess whether the PADRUS model created was valuable, two baseline models were created. The first, baseline, was created using the same methodology; a random forest regressor optimized using a time series cross-validation grid-search on training data. The baseline performed the same task of predicting weekly respiratory deaths 17 days in advance for each of the 314 LTLA areas in England from 18th March 2016 to 27th March 2020. Data was stratified using the same splits on the training and test data (45,844 training data points, 20,410 test data points), and model predictions on the held out test data were scored using RMSE, MAE and R2. The only difference between PADRUS and the baseline model is its feature inputs, with the baseline inputs consisted of only: the week number (1 to 52) for a dynamic input, and secondly the static input of the population of over 65s in each LTLA. The first feature was chosen because of the known seasonality of deaths by respiratory disease47, yet used alone the prediction accuracy levels were very low. Therefore the second feature was chosen due to the assumption that the size and age of population would greatly influence the number of deaths, with 90% of deaths from respiratory disease in Europe occurring in those aged 65 and over43. The second comparative model, PADRUNOS, was formed in exactly the same way, and used the exact same features as PADRUS except any that related to sales data (see §4.4.5)

Variable importance analyses

In order to understand the impact of the different features driving predictions in the PADRUS model, we conducted a range variable importance analyses. First standard permutation importance was used to score of each feature in producing the model’s predictions, relaying the expected increase in error after the a features contents are randomly shuffled. Second, a SHAP (SHapley Additive exPlanations) Analysis46 was conducted. SHAP carries out a more complex calculation of feature importance, with the Kernel Explainer SHAP tool used in this analysis applying weighted linear regression to determine the importance of each feature based on Shapley values, as derived from game theory46. Both tools were applied to the training data set on an arbitrary instance of the PADRUS model. The SHAP analysis is computationally expensive approach and therefore was run on two random samples of data points of 100 and 1000. This enabled a comparison of results between sample sizes to ensure a large enough sample had been evaluated.

By running these variable importance tools on one arbitrary model, misleading results can be given due to the stochastic nature of the models machine learning algorithms. This is a problem because different instances of an optimised model class can use different variables (features) and in different ways to achieve the same model predictive performance. To address the problem of standard variable importance tools only evaluating one instance of the PADRUS model, MCR (Model Class Reliance) was applied. MCR was developed by Fisher et al. to compute the feature importance bounds across all optimal models called the Rashomon set for Kernel (SVM) Regression (polynomial run-time)36. Smith, Mansilla and Goulding introduced a new technique that extends the computation of MCR to Random Forest classifiers and regressors37. MCR builds on permutation for a single model, computing the permutation feature importance bounds (MCR-, MCR+) for an input variable across all instances of PADRUS; calculating the minimum and maximum impact a variable could have on the predictions across all instances of the model (See Fig. 2). This initial MCR analysis evaluated each feature individually for its importance, for example the importance of ‘minimum temperature’. It did not assess how important a dataset type used to create a number of the models’ features was to predictions as a whole, for example ‘weather’. Ljevar et al. introduced a Grouped Feature approach to MCR for random forest40. Group-MCR was created in order to calculate the effects of variable groups, measuring the importance of a collection of features together on the predictions of random forest classifiers40. In order to evaluate the importance of groups of variables, and their impact in concert on the PADRUS model, we apply for the first time Group-MCR to a random forest regressor. Group-MCR is achieved through a modification of the random forest MCR algorithm, which reconsiders the definition of Model Reliance to be with respect to a group of variables rather than a single variable40.

Modelling without sales data

Although MCR can explain which variables need to be included in a model to achieve the maximum accuracy rates for predictions, it can not compute the difference in accuracy (the loss) if those variables were to be excluded from the model. In order to determine the loss in accuracy if sales data were to be left out of the model, the model PADRUNOS (Predicting the amount of deaths from respiratory disease using no sales) was created as a comparison to PADRUS. PADRUNOS was created inline with the baseline and PADRUS model methodology, as a random forest regressor optimized using a time series cross-validation grid-search on training data. MCR and Group-MCR was applied to PADRUNOS to calculate how the variables were used differently. The three metrics used to score the regression task were compared for both models PADRUS and PADRUNOS. These scores were examined overall, by week, by LTLA and by winter months. Wilcoxon signed-rank tests (two-sided) were used to compare the significance of model’s predictive accuracy scores.

Models for further understanding and reference

In order to determine the loss in accuracy if each set of grouped variables were to be left out of the model, and to enhance understanding of how the MCR variable importance tool functions by comparison, separate random forest regressor models were created each omitting a different variable group (See Supplementary Table 3). A further baseline random forest regressor model was also created using the variables week number and LTLA identifier only, to compare the accuracy of predicting on yearly historical averages of deaths only. A final comparative model was then created by incorporating the LTLA identifier variable into a random forest regressor alongside the 56 variables used in PADRUS (See Supplementary Table 3). The latter aimed to assess whether the LTLA identifier itself provided any additional predictive information beyond the 56 separate features that offered subject-specific data about each LTLA.

Ethics and data management

This work was carried out within an NHSX PhD Internship: Placement Programme69 and continued post-placement partnership. After reviewing the project proposal, NHS England (formerly NHSX) determined that the NHS Research Ethics Committee was not required based on criteria68,70,71. Health data in this study was used under the terms of usual practice for research as defined in the UK Policy Framework for Health and Social Care Research68, with the study designed to investigate the health issues in a population in order to improve population health. To keep health data de-identified and extractable from the National Commissioning Data Repository (NDCR), death counts below 5 within an LTLA were suppressed and reported as 5 prior to receipt. Health data was deleted following the study’s conclusion. Ethics approval for the study was additionally obtained from the University of Nottingham ethics panel (Faculty of Social Sciences NUBS ethics review reference: 201829095).

Both sales and health data were aggregated weekly to Local Authority areas for an ecological analysis, and abided by the retailers privacy policy for data use, and the lawful GDPR basis for processing consumption data was encompassed by: ‘the performance of a task in the public interest’ (GDPR 6(1)(e)) and ‘processing necessary for scientific research purposes’ (GDPR 9(2)(j)). Transactions logged via loyalty cards were geographically aggregated to a national level (England and Wales) and aggregated temporally to a weekly sales total. Sales units of in-store transactions were tagged with store location only; no sales transactions were linked to individual customers, and no personal data was used in this study. Strict data management protocols were abided to at all times, as per the N/Lab data management plan72 and the University of Nottingham’s ISO/IEC 27001 certification.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Latest news
Related news