Peer-Review Report by Ross Gore (Reviewer AM): https://med.jmirx.org/2020/1/e25572
Author Responses to Peer-Review Reports: https://med.jmirx.org/2020/1/e25573
Background: Pandemics including COVID-19 have disproportionately affected socioeconomically vulnerable populations.
Objective: Our objective was to create a repeatable modeling process to identify regional population centers with pandemic vulnerability.
Methods: Using readily available COVID-19 and socioeconomic variable data sets, we used stepwise linear regression techniques to build predictive models during the early days of the COVID-19 pandemic. The models were validated later in the pandemic timeline using actual COVID-19 mortality rates in high population density states. The mean sample size was 43 and ranged from 8 (Connecticut) to 82 (Michigan).
Results: The New York, New Jersey, Connecticut, Massachusetts, Louisiana, Michigan, and Pennsylvania models provided the strongest predictions of top counties in densely populated states with a high likelihood of disproportionate COVID-19 mortality rates. For all of these models, P values were less than .05.
Conclusions: The models have been shared with the Department of Health Commissioners of each of these states with strong model predictions as input into a much needed “pandemic playbook” for local health care agencies in allocating medical testing and treatment resources. We have also confirmed the utility of our models with pharmaceutical companies for use in decisions pertaining to vaccine trial and distribution locations.
Socioeconomic vulnerability can directly influence the severity of pandemics and their impact on mortalities in ways like access to health care, household overcrowding, and comorbidities. Prior studies of swine flu (H1N1) have pointed to these factors as contributors to the spread and severity of that pandemic . Other studies have identified national level correlations that are helpful, but not actionable at a local level where actual health care resource allocation decisions are made [ ].
Early and accurate decisioning for health care resource allocations are particularly critical in geographic locations with high population density. This research sought to create a repeatable modeling process that uses readily available data sources to identify the top counties in densely populated states with a high likelihood of disproportionate COVID-19 mortality rates.
Stepwise linear regression was used as the modeling technique. Other similar epidemiological research has also used the stepwise linear regression approach including Thomson et al’s  2006 research on environmental models to predict meningitis epidemics in Africa; Chung et al’s [ ] 2012 study of the West Nile encephalitis epidemic in Dallas, Texas; Fulton et al’s [ ] 2019 predictive models for hospital-based back surgery demand, and Yu et al’s [ ] 2005 study on SARS (severe acute respiratory syndrome).
Our objective was to create a repeatable modeling process to identify regional population centers with pandemic vulnerability.
Exploratory data research at a national level was performed using county level data (Federal Information Processing Standards [FIPS] for county identification). COVID-19 mortality data sets (deaths per 100,000 people) were created using the Johns Hopkins Dataset  and data from the US Census Bureau [ ]. Socioeconomic vulnerability data sets at the county level were created using subcomponents of the Centers for Disease Control and Prevention’s (CDC) social vulnerability index (SVI) [ ]. The full list of subcomponents can be found in .
Scatterplots and trendlines were used to identify variables most correlated with COVID-19 mortalities (see samples in). Few, if any, social vulnerability variables correlated across all of the 3142 FIPS counties, but minority status correlated strongly in certain regions, particularly those with high mortality rates. These initial findings led the author to focus the next phase of research and modeling on state level rather than national level correlations. County-level dependent variable data sets were created using COVID-19 mortality and population data from the Corona Data Scraper website (data service that scrapes county level COVID-19 data on a daily basis) [ ] as well as from USAFacts [ ] for cumulative mortalities as of April 8, 2020, and May 8, 2020, respectively.
County-level independent variable data sets were created using socioeconomic data from the County Health Rankings website, a collaboration data set created by the Robert Wood Johnson Foundation and the University of Wisconsin Population Health Institute . Given the year to year stability of most of these socioeconomic variables, the latest data available from the County Health Rankings website was used, and no attempt was made to augment the data set to try to match the time-series to either April 8, 2020, or May 8, 2020. The full list of independent variables available in this data set can be found in .
Cumulative COVID-19–specific mortality data (deaths per 100,000 people) by county for states with a high mortality rate (New York, New Jersey, Connecticut, Massachusetts, Louisiana, Michigan, and Pennsylvania) as of May 8 was used as the dependent variable . The full ranking of states by mortality rate can be found in .
The May 8 date was used to ensure that the dependent variable would be tuned to a timeframe at or around the peak in daily mortalities when health care resources (testing, treatment, and tracing) were typically most needed. The mortality curves inprovide support for May 8 as the overall date for mortality predictions as shown in the Institute for Health Metrics and Evaluation data set [ ].
A stepwise linear regression technique was used to build each state level model. All relevant independent variables were initially used in the model (ie, include severe housing problems, but exclude violent deaths). Next, the variable with the lowest T-statistic was removed and the linear regression was rerun. This process was repeated until all T-statistics for the remaining independent variables were near a value of 2 or greater.
Predictive models were completed for New York, New Jersey, Connecticut, Massachusetts, Louisiana, Michigan, and Pennsylvania, with statistically significant results. The final list of model variables, coefficients, variable correlations, sample sizes, and P values can be found in.
Model validation comparing predicted to actual county rankings by cumulative mortality rate (deaths per 100,000 people) through May 8 (seeand for New York and New Jersey), and data visualizations using state-level maps were completed ( ). These validations were used to share model methods and results with each state’s Department of Health Commissioner and, where appropriate, with an outside agency.
|New York county||Deaths per 100,000 people (5/8 actuals), n||Deaths per 100,000 people (5/8 model), n|
|New Jersey county||Deaths per 100,000 people (5/8 actuals), n||Deaths per 100,000 people (5/8 model), n|
aIndicates low mortality rate relative to all other counties within the state.
Four further validations were completed. The first validation was to check model performance using COVID-19 mortality data on April 8, 2020, instead of May 8, 2020. This validation tested whether models using data available early in the pandemic would have been sufficient to make accurate predictions. The same variables were used, but coefficients were recalibrated with the April 8 data set. The April 8 and May 8 model outputs were compared to test for stability in the top counties predicted for high COVID-19 mortality rates. For New York, New Jersey, and Connecticut, the models proved to be stable. For Massachusetts, the 4/8 model performance was not stable, but this was easily corrected by using case data in the place of mortality data. This is an important finding as it validates the predictive power contained in early case data, which is more readily available at the start of a pandemic. The New York and Massachusetts model validation result summaries can be found in.
The second validation was to check model performance beyond May 8, 2020 (ie, using the May 8 model to predict July 31 mortalities). Results were less stable as most states began their reopenings in mid-May creating differential effects by county. However, the models for New York, New Jersey, Connecticut, and Massachusetts continued to identify the counties with the highest cumulative mortality rates.
The third validation was to check the independent variables for multicollinearity, with a specific focus on the correlations between “Black” and variables such as “severe housing” and “uninsured.” Strong multicollinearity was seen in Connecticut, Massachusetts, Louisiana, and Michigan partially explaining why these variables did not remain in the model after the stepwise regression process. Multicollinearity results for these states are presented in. Future models could consider composite variables to address this multicollinearity and to maintain combined effects such as “Black,” “severe housing,” and “uninsured.”
The fourth validation leveraged an out-of-sample methodology. For New York, New Jersey, Connecticut, and Massachusetts, only one half of the data points (ie, half of the counties in each state) were used to build the model. Coefficients were recalibrated and variables were removed if T values fell below 2. In each state, the out-of-sample model continued to identify the top counties for cumulative mortality rates through May 8.
With models and validations completed, the Departments of Health for New York, New Jersey, Connecticut, Massachusetts, Louisiana, Michigan, and Pennsylvania were contacted (). Additionally, agents of the New York Department of Health (Northwell Health and CORE), a third-party statistical modeling firm for Connecticut (COVIDACTNOW), and a third-party modeling firm for Pennsylvania (Mathematica) were contacted. Response from these contacts were positive and, in some cases, occurred within an hour of outreach (Northwell Health). This response indicates the strong need for this type of health care resource allocation tool for pandemics and other health crises. In fact, the Pennsylvania Department of Health indicated this tool’s importance in a second wave of COVID-19.
The final data sets for our New York, New Jersey, Connecticut, Massachusetts, Louisiana, Michigan, and Pennsylvania models are contained in.
|DOH or outside agency||Models shared||Receipt accepted||Zoom session|
|New York DOH||✓||✓|
|Northwell Health (New York)||✓||✓|
|CORE (New York)||✓|
|New Jersey DOH||✓||✓|
The research described in this paper has shown the extent to which Black Americans, people living in crowded housing units, and households with less access to health care are at higher risk of severe illness during a pandemic. The results of other studies provide several possible explanations for these findings. Individuals with lower income are more likely to live in crowded housing units and multifamily homes . Lower income households are also structurally disadvantaged in their access to medical insurance and health care [ ]. Some studies have also pointed to a concept called “weathering” within the Black American community. Arline Geronimus, a professor of public health at the University of Michigan, showed through her research that among Black communities, coping with financial strain, discrimination, and barriers to good education elevates the stress response, contributing to obesity, diabetes, hypertension, and heart disease [ ].
The CDC conducted research on the co-occurrence of COVID-19 and certain ethnicities using a sample of 580 patients with lab-confirmed data . Their results showed more hospitalizations in Black patients than White patients. The researchers cited underlying medical conditions, work circumstances, and living conditions to be major factors in COVID-19 mortalities in their sample. For example, members of racial and ethnic minorities were more likely to live in densely populated areas, making it more difficult to practice social distancing and being more susceptible to contracting and spreading COVID-19. These members also lived farther away from grocery stores and medical facilities, thus being less able to receive necessary resources and medical attention. Other examples cited included Hispanic and Black American workers employed in higher-risk industries and often lacking paid sick leave. The researchers also hypothesized that these types of workers were more likely to continue working despite being sick, thus exposing other workers to the disease. The CDC recommended at the conclusion of this research that public health officials communicate to different population groups about COVID-19 and provide more health care services to ethnic minority groups.
While such studies are insightful, we are not aware of any research that translates these impacts into predictive models that can be used to direct local health care resources to the communities most likely to need them, thereby reducing mortalities caused by an ongoing set of institutional inequities.
That said, some organizations have attempted to create health care resource allocation methods using descriptive statistics. The CDC created the SVI, allowing health care communities to see which factors contribute to socioeconomic vulnerability. The CDC SVI factors are grouped into four groups: Socioeconomic Status, Household Composition & Disability, Minority Status & Language, and Housing Type & Transportation. Although these factors are crucial inputs in identifying specific vulnerable communities, these four categories of factors alone are not sufficient to create the types of predictive models that state and local health care agencies can use. One example of this insufficiency is a recent study at Emory University where researchers identified correlations between COVID-19 mortalities and the SVI at a national level in the United States . While this study is valuable, it stopped short of recommending methods or processes to effectively distribute health care resources to specific counties in the United States, particularly in the early days of the pandemic. Another study from the Surgo Foundation stated that COVID-19 created new challenges for many communities tied to health and structural factors that were not completely captured by the CDC SVI [ ]. In addition to the four socioeconomic factors provided by the CDC, the Surgo Foundation added two more factors: Epidemiologic Factors and Healthcare System Factors. They stated that underlying health conditions in addition to health care system factors have been proven to greatly increase a community’s vulnerability during a pandemic. The Surgo Foundation combined these two factors with the CDC SVI to create the COVID-19 Community Vulnerability Index (CCVI). The Surgo Foundation created heatmaps to show retrospectively which counties were most vulnerable as measured by their CCVI. Similar to the Emory University study, however, this methodology did not create a predictive model to identify where the mortalities would be highest at peak periods in a pandemic. We also compared the Surgo Foundation heatmap to our own predictive model rankings and confirmed that our projections of the top counties by per capita mortalities were far closer to actual peaks. Results of this comparison are shown in .
During the early phases of the research described in this paper, we used the CDC’s SVI, similar to the Emory University study. We explored all of the subcategory factors in the SVI, but none showed strong correlations at a FIPS county level across the United States. We then grouped states together by region and found strong correlations in the most densely populated regions, particularly with the minority status subfactors. Similar to the Surgo Foundation study, we posited that the CDC’s SVI subfactors alone would be insufficient to build predictive models, so a far more complete independent variable data set of socioeconomic and health care data was sourced from the County Health Rankings website as discussed earlier. The key differentiation of our work is the predictive models for each state given the local differences in how each of the factors act as predictors of peak COVID-19 per capita mortalities. Our experience in meeting with the Pennsylvania Department of Health in early June 2020 confirms the uniqueness and usefulness of the approach given that they will be using the predictive modeling process for health care resource allocations in the event of a potential second wave of COVID-19 in Fall 2020. Other state-level departments of health and agents of these governmental functions were similarly intrigued by our predictive approach including those in New York, New Jersey, and Connecticut. Finally, the Chief Information Officer of Johnson & Johnson has forwarded the author’s research to J&J’s Health and Human Services Group for possible use in decisions pertaining to vaccine trial and distribution locations.
A number of limitations must be acknowledged. The models employed in the analyses are reliant on the accuracy of the data sets compiled. COVID-19 mortality data in particular has been notoriously difficult for states to report accurately at a county level throughout the pandemic for reasons including mortality cause classification errors at the offices of the local coroner . This systemic undercounting could have created some correlations between our variables and reporting errors. That said, if reporting errors are similar across counties within a state, then these unwanted effects to our model are likely to be small since we created a different model for each state.
The models are only valid within the range of county-level data for the following states: New York, New Jersey, Connecticut, Massachusetts, Louisiana, Michigan, and Pennsylvania. Models would need to be rebuilt and validated for each additional state. In addition, models for states with lower levels of per capita mortalities were far less predictive, although health care resource allocations would be less critical in those regions.
In addition, we would note that while we did not find significant correlations between the subcomponents of the CDC SVI and COVID-19 mortalities, other data sets at the FIPS level (eg, American Community Survey data and census data) might yield different results. For example, there may be variables that are correlated with one another that also correlate with COVID-19 mortalities. Gore et al  showed the difficulties in teasing apart specific population demographic measures at a granular level into a linear regression model since so many of these variables correlate highly with one another.
Our modeling process can be used for the early identification of the communities most in need of health care resources during future pandemics or health crises. The COVID-19 models and the overall methodology have been received with enthusiasm by the New York, New Jersey, Connecticut, Massachusetts, Louisiana, Michigan, and Pennsylvania Departments of Health. All Commissioners responded positively, which validates our hypothesis that neither other researchers nor the Departments of Health themselves have developed a similar modeling process as a means to allocate scarce health care resources such as testing, treatment, tracing, education, and communication. We have also confirmed the utility of our models with pharmaceutical companies for use in vaccine trial and distribution location decisions.
Future modeling processes could also include building hierarchical models to improve county rankings by better accounting for effects of clustered variables similar to Fulton et al  in their 2019 models for predicting hospital-based back surgery by geography. The gradient boosting approach leveraged by Fulton et al [ ] may also be useful to examine states with lower population densities where our stepwise linear regression models proved to be weaker. Finally, the group-personalized regression approach pioneered by Palmius et al [ ] in their 2018 models for predicting mental health scores by group rather than for an overall population could also be explored.
Other research papers evaluated during the course of this research and other data sets referenced in this research can be found in.
The author wishes to acknowledge Dr Fathima Wakeel, Associate Professor at the Lehigh University College of Health, for her mentorship throughout this project and for her review of this manuscript. Dr Wakeel’s guidance was instrumental, particularly in directing the research toward a state level and in driving to actionable recommendations for state Departments of Health. The author also wishes to acknowledge Iwao Fusillo, Global Head of Data & Analytics for the National Football League and the author’s father, for helping her learn various techniques in data set creation, statistical analysis, and data visualization.
Conflicts of Interest
Supplemental materials.PDF File (Adobe PDF File), 1367 KB
Other data sets referenced in this research and research papers evaluated during the course of this research.DOCX File , 18 KB
- Quinn SC, Kumar S, Freimuth VS, Musa D, Casteneda-Angarita N, Kidwell K. Racial disparities in exposure, susceptibility, and access to health care in the US H1N1 influenza pandemic. Am J Public Health 2011 Feb;101(2):285-293 [FREE Full text] [CrossRef] [Medline]
- Nayak A, Islam SJ, Mehta A, Ko YA, Patel SA, Goyal A, et al. Impact of Social Vulnerability on COVID-19 Incidence and Outcomes in the United States. medRxiv. Preprint posted online April 17, 2020 [FREE Full text] [CrossRef] [Medline]
- Thomson MC, Molesworth AM, Djingarey MH, Yameogo KR, Belanger F, Cuevas LE. Potential of environmental models to predict meningitis epidemics in Africa. Trop Med Int Health 2006 Jun;11(6):781-788 [FREE Full text] [CrossRef] [Medline]
- Chung WM, Buseman CM, Joyner SN, Hughes SM, Fomby TB, Luby JP, et al. The 2012 West Nile encephalitis epidemic in Dallas, Texas. JAMA 2013 Jul 17;310(3):297-307. [CrossRef] [Medline]
- Fulton L, Kruse CS. Hospital-Based Back Surgery: Geospatial-Temporal, Explanatory, and Predictive Models. J Med Internet Res 2019 Oct 29;21(10):e14609 [FREE Full text] [CrossRef] [Medline]
- Yu HYR, Ho SC, So KFE, Lo YL. The psychological burden experienced by Hong Kong midlife women during the SARS epidemic. Stress and Health 2005 Aug;21(3):177-184. [CrossRef]
- Johns Hopkins Coronavirus Resource Center. COVID-19 Map. URL: https://coronavirus.jhu.edu/map.html [accessed 2020-08-10]
- County Population Totals: 2010-2019. US Census Bureau. 2020 Jun 22. URL: https://www.census.gov/data/tables/time-series/demo/popest/2010s-counties-total.html [accessed 2020-08-10]
- Centers for Disease Control and Prevention. CDC's Social Vulnerability Index (SVI). URL: https://svi.cdc.gov/data-and-tools-download.html [accessed 2020-08-10]
- Corona Data Scraper. Davis L. URL: https://coronadatascraper.com/ [accessed 2020-08-10]
- US Coronavirus Cases and Deaths. USAFacts. 2020 Aug 7. URL: https://usafacts.org/visualizations/coronavirus-covid-19-spread-map/ [accessed 2020-08-10]
- How Healthy is your County? County Health Rankings. URL: https://www.countyhealthrankings.org/ [accessed 2020-08-10]
- Elflein J. Coronavirus deaths seven-day average by country. Statista. 2020 Jul 22. URL: https://www.statista.com/statistics/1111867/trailing-seven-day-average-number-of-covid-19-deaths-select-countries-worldwide/ [accessed 2020-08-10]
- COVID-19 Estimation Updates. Institute for Health Metrics and Evaluation. 2020 Aug 10. URL: http://www.healthdata.org/covid/updates [accessed 2020-08-10]
- Airgood-Obrycki W, Molinsky J. Estimating the Gap in Affordable and Available Rental Units for Families. Joint Center for Housing Studies, Harvard University. 2019 Apr. URL: https://www.jchs.harvard.edu/research-areas/working-papers/estimating-gap-affordable-and-available-rental-units-families [accessed 2020-08-10]
- Amadeo K. Health Care Inequality in America. The Balance. 2020 May 11. URL: https://www.thebalance.com/health-care-inequality-facts-types-effect-solution-4174842 [accessed 2020-08-10]
- Adamy J. Coronavirus, Economic Toll Threaten to Worsen Black Mortality Rates. The Wall Street Journal. 2020 Jun 13. URL: https://on.wsj.com/2I7YKe6 [accessed 2020-08-10]
- Health Equity Considerations and Racial and Ethnic Minority Groups. Centers for Disease Control and Prevention. 2020 Jul 24. URL: https://covid19.nhc.org/cdc-covid-19-in-racial-and-ethnic-minority-groups/ [accessed 2020-08-10]
- Bringing Greater Precision to the COVID-19 Response. Surgo Foundation. URL: https://precisionforcovid.org [accessed 2020-08-10]
- Ambrosio A. 'Probable' COVID-19 Death Reporting Varies by State. MedPage Today. 2020 Apr 23. URL: https://www.medpagetoday.com/infectiousdisease/covid19/86127 [accessed 2020-08-10]
- Gore R, Diallo S, Padilla J. You Are What You Tweet: Connecting the Geographic Variation in America's Obesity Rate to Twitter Content. PLoS One 2015;10(9):e0133505 [FREE Full text] [CrossRef] [Medline]
- Palmius N, Saunders KEA, Carr O, Geddes JR, Goodwin GM, De Vos M. Group-Personalized Regression Models for Predicting Mental Health Scores From Objective Mobile Phone Data Streams: Observational Study. J Med Internet Res 2018 Oct 22;20(10):e10194 [FREE Full text] [CrossRef] [Medline]
|CCVI: COVID-19 Community Vulnerability Index|
|CDC: Centers for Disease Control and Prevention|
|FIPS: Federal Information Processing Standards|
|SARS: severe acute respiratory syndrome|
|SVI: social vulnerability index|
Edited by G Eysenbach, E Meinert; submitted 13.07.20; peer-reviewed by R Gore; comments to author 04.08.20; revised version received 10.08.20; accepted 04.11.20; published 02.12.20Copyright
©Tara Fusillo. Originally published in JMIRx Med (https://med.jmirx.org), 02.12.2020.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on http://med.jmirx.org/, as well as this copyright and license information must be included.