Modeling the Number of COVID-19 Confirmed Cases and Deaths in Puerto Rico: One-year Experience

Research Article

Austin Biom and Biostat. 2021; 6(1): 1038.

Modeling the Number of COVID-19 Confirmed Cases and Deaths in Puerto Rico: One-year Experience

Suarez EL¹*, Reyes JC¹, Mattei H², Lopez-Cepero A³ and Perez CM¹

¹Department of Biostatistics and Epidemiology, Graduate School of Public Health, University of Puerto Rico, San Juan, Puerto Rico

²Department of Social Sciences, Graduate School of Public Health, University of Puerto Rico, San Juan, Puerto Rico

³Department of Nutrition, Harvard T.H. Chan School of Public Health, Harvard University, Boston, Massachusetts, USA

*Corresponding author: Suarez EL, Department of Biostatistics and Epidemiology, Graduate school of Public Health, Medical Sciences Campus, University of Puerto Rico, P.O. Box 365067, San Juan, Puerto Rico 00936- 5067

Received: April 02, 2020; Accepted: April 23, 2021; Published: April 30, 2021


Aims: To describe and project the number of COVID-19 cases and deaths reported in Puerto Rico, according to age and sex.

Methods: We used surveillance data from March 8, 2020 to March 13, 2021 to describe and predict, by age and sex, the number of cases and deaths in Puerto Rico using Generalized Additive Models. The statistical modeling was performed in R software using the mgcv package.

Results: The analytic sample consisted of 95,208 confirmed cases and 2,080 deaths reported by the Puerto Rico Department of Health until the second week of March 2021. The risk of COVID-19 infection was highest among adults aged 20-59 years, as compared with those younger than 20 years (RR20-39 vs. <20: 2.35 [95% CI: 1.80-3.06] and (RR20-59 vs. <20: 2.30 [95% CI: 1.76-3.00]). However, the pattern in the risk of death showed an inverse relationship: the highest risk of death occurred in adults 60 years and over as compared with those younger than 60 years (RR≥80 vs. <60: 22.4 [95% CI: 18.9-26.5] and (RR60-79 vs. <60: 6.7 [95% CI: 5.6-7.9]). Although there were no significant differences in the risk of infection (p>0.1) by sex, males had a 70% (95% CI: 50-100%) greater risk of death than their female counterparts. The projected weekly number of confirmed cases of COVID-19 showed a downward trend; we expected approximately 510 confirmed cases of COVID-19 in the week ending March 27, 2021. Similarly, the projected weekly number of COVID-19 deaths showed a downward trend.

Conclusion: Future studies are needed to understand age and sex differences in COVID-19 infections and deaths. Increments in the number of COVID-19 cases in the short term are of great concern to justify more substantial preventive restrictions.

Keywords: COVID-19; Predictions; GAM; Puerto rico


COVID-19: Coronavirus Disease 19; SARS-CoV-2: Severe Acute Respiratory Syndrome Coronavirus 2; RR: Relative Risk; CI: Confidence Interval; GAM: Generalized Additive Models; EDF: Effective Degrees of Freedom


Coronavirus disease, COVID-19, is a highly infectious condition caused by the severe acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2). As of March 30, 2021, the COVID-19 pandemic has resulted in over 128 million confirmed cases globally, of which nearly 2.8 million patients have died [1]. The United States of America (USA) is the world’s hardest-hit country, reaching more than 30 million COVID-19 cases and more than 550,000 deaths. In Puerto Rico, an unincorporated territory of the USA, the first confirmed cases of COVID-19 were identified between March 9-13, 2020 [2]. As of March 16, 2021, 95,330 confirmed cases and 2,085 deaths had been reported by the Puerto Rico Department of Health [3]. This epidemic has worsened the ongoing crises (i.e., financial crisis and aftermath of Hurricanes Irma and María and ongoing earthquakes) by creating parallel pandemics that exacerbate socioeconomic standing and residents' health [4]. Poverty may also worsen these challenges, with rates exceeding 43% [5] and debilitating the healthcare infrastructure [6]. Moreover, individuals residing in Puerto Rico have a significant burden of chronic conditions (i.e., obesity and type 2 diabetes) [7,8], putting them at higher risk of COVID-19 complications.

The exponential growth of COVID-19 cases, which fluctuates according to government restrictions, warrants the need to assessing its real burden of COVID-19. This information is critical to ensure adequate medical care and public health resources and mitigate adverse disease outcomes. The high numbers of COVID-19 cases threaten to disrupt healthcare systems further, resulting in severe reductions in health service delivery to detect other diseases [9-11]. Statistical modeling represents an excellent tool to assess the number of future COVID-19 cases, spatial and temporal dynamics of the infection, and evaluate public health interventions' effects. Generalized Additive Models (GAM) can model complex non-linear relationships when there are many predictors. Thus, GAMs are ideal for studying COVID-19 cases and deaths to control the observed fluctuations due to government restrictions at different time points. Using surveillance data collected in Puerto Rico, the present study's objective was to project the number of COVID-19 cases and deaths at two-weeks and estimate the risks of infection and death by age and sex in this vulnerable population.

Materials and Methods

The data on COVID-19 cases and deaths were extracted from the Puerto Rico Health Department Surveillance System, including age, sex, and municipality of residence [12]. People who were not residents in Puerto Rico at the time of infection or death or who had incomplete demographic data were excluded from analyses. The historical period to describe and project the weekly confirmed COVID-19 cases and deaths started on the week ending March 14, 2020 (week 11) and finished on week ending March 13, 2021 (week 63). The data was summarized by epidemiologic week that begins on a Sunday and ends on a Saturday.

To reach the aims of this study, we initially assumed that the weekly number of confirmed COVID-19 cases followed a Poisson distribution [13]; thus, the basic model for the projections was based on the following expression:

log(µi) = log(Pi) + Sexj + Agek + f(t)


μi indicates the expected number of confirmed COVID-19 cases in the i-th week.

Pi indicates the estimated population in the i-th week.

log(Pi) is an offset parameter.

Sexj indicates the sex effect when males are compared to females.

Agek indicates the age effect when k-th group is compared with the reference age group.

f(t) indicates a function of t (unit of time measured in weeks), which is a nonparametric function.

To obtain weekly population estimates for the analysis period, we interpolated the annual population estimates by age, sex, and municipality produced by the US Census (2020) for Puerto Rico with a cubic Hermite piecewise interpolation using the STATA command pchipolate [14]. Since the most recent estimate available was for July 1, 2019, we projected the total population to July 1, 2021, assuming the average population change observed two years before Hurricane María (2015 to 2017) would resume in 2019. The age and sex projected populations were obtained by maintaining their distributions and applying them to the projected population.

The purpose of f(t) in GAM is to control the trend and periodicity of a characteristic of interest (seasonality). GAM is a type of regression spline because it could divide the predictors' range of values into different regions, which are identified by different cut points called knots. The parameter estimation in GAM is based on minimizing the following expression:

λ is a non-constant or tuning parameter.

When using GAM, the assumption of linearity between predictors and response variable is relaxed. Categorical predictors, interactions terms, and probability distributions other than normal can be used. Additionally, mixed approaches can include autocorrelation estimates or hierarchical sampling structures. The definition of different time points determines the flexibility of the fit with GAM (identified by t*) under the following function:

where m+1 is called the order of the spline (the default value is m=2). The function Bjm (ti) is identified as B-spline and is a recursive function defined as follows:

βj is the coefficient for the jth knot that increases or decreases the function Bjm (t).

To choose the model that best fits the data, the Generalized Cross-Validation (GCV) score is computed by comparing the observed and the expected values as follows [15]:

where A is the influence matrix and  denotes the prediction of μi obtained from the model fitted to all data except for the observation yi. A better model is the one with the smallest νg.

To estimate the expected number of confirmed cases of COVID-19, we fitted the historical data of confirmed COVID-19 cases with GAM, including age and sex as categorical predictors, and the smooth term for weeks by age and sex; in addition, the natural log of the mid-year population estimates for 2020 by age and sex was included as an offset variable. Based on our a priori hypothesis that the expected number of confirmed COVID-19 cases and deaths vary by age and sex, we evaluated the age-sex interaction term. Since this interaction term was not statistically significant (p>0.05) based upon a likelihood ratio test, we did not include the interaction term between age and sex in our final models.

We used the function gam with the library mgcv in the R software to fit this model, combining several options to obtain the best fit. We used the option of corrAR1 to account for temporal autocorrelations. We used the option penalized-spline (ps), which constructs a base-model with a subsample of the original database and then uses the base-model to fit the final model. We determined the number of knots based on the lowest values of the GCV using the options k=-1 available in the gam function. For the risk of infection, the quasipoisson option in the probability distribution of the outcome was used due to the observed overdispersion after examining the closeness of the model's deviance to the degrees of freedom; however, for mortality risk, we used the option poisson because no overdispersion was observed. The model's variance was estimated using the Restricted Maximum Likelihood as a bias-reducing alternative to maximum likelihood. The previous modeling procedures were also performed for COVID-19 deaths. The Akaike Information Criterion (AIC) and Schwarz’s Bayesian Information Criterion (BIC) were also used to compare and assess the models' fitness performance. The gam.check function in R was used to generate standard diagnostic plots (Q-Q plots and scatter plots) to examine the residuals' distribution. Running GAM in the library mgcv provides an estimation of the Effective Degrees of Freedom (EDF), according to the stratification required (e.g., age group and sex categories), to have an approximation of the overall trend of the data using a polynomial model [14].

To describe the magnitude of the association between the demographic characteristics (age and sex) and COVID-19 infection in Puerto Rico, we estimated the age-specific rates as follows [15]:

The annual population estimates for Puerto Rico were obtained from the Population Division of the US Census Bureau [16]. Then, we compared the rates of COVID-19 infection in different age groups, estimating the Relative Risk (RR) with 95% confidence intervals based on the parameters’ estimation of GAM, as follows:

where se(Agek) indicates the standard error of the age-effect under GAM.

A RRj vs k > 1.0 indicates that the risk of infection in in the jth age group is greater than the kth age group's risk. A indicates that the risk of infection in the jth age group is lower than in the kth age group.

A similar procedure was performed to compare the risk of death due to COVID-19 by demographic characteristics. The reference categories for modeling COVID-19 risks of infection and death were age groups <20 years and ≤60 years, respectively. Afterward, based on the same model, the weekly number of infected individuals and fatalities due to COVID-19 was projected. All statistical modeling was performed in R software version 2.14.1 [17], and the graphs were made in Stata (v. 16).