This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the JMIRx Med, is properly cited. The complete bibliographic information, a link to the original publication on http://med.jmirx.org/, as well as this copyright and license information must be included.

Infectious disease is one of the main issues that threatens human health worldwide. The 2019 outbreak of the new coronavirus SARS-CoV-2, which causes the disease COVID-19, has become a serious global pandemic. Many attempts have been made to forecast the spread of the disease using various methods, including time series models. Among the attempts to model the pandemic, to the best of our knowledge, no studies have used the singular spectrum analysis (SSA) technique to forecast confirmed cases.

The primary objective of this paper is to construct a reliable, robust, and interpretable model for describing, decomposing, and forecasting the number of confirmed cases of COVID-19 and predicting the peak of the pandemic in Saudi Arabia.

A modified singular spectrum analysis (SSA) approach was applied for the analysis of the COVID-19 pandemic in Saudi Arabia. We proposed this approach and developed it in our previous studies regarding the separability and grouping steps in SSA, which play important roles in reconstruction and forecasting. The modified SSA approach mainly enables us to identify the number of interpretable components required for separability, signal extraction, and noise reduction. The approach was examined using different levels of simulated and real data with different structures and signal-to-noise ratios. In this study, we examined the capability of the approach to analyze COVID-19 data. We then used vector SSA to predict new data points and the peak of the pandemic in Saudi Arabia.

In the first stage, the confirmed daily cases on the first 42 days (March 02 to April 12, 2020) were used and analyzed to identify the value of the number of required eigenvalues (

Our results confirm the impressive performance of modified SSA in analyzing COVID-19 data and selecting the value of

One of the main issues that threatens human health worldwide is infectious diseases. Recently, the 2019 outbreak of the new coronavirus, SARS-CoV-2, which causes the disease known as COVID-19, has led to a global pandemic [

The number of cases and deaths from SARS-CoV-2 globally are considered to be a serious problem [

In addition, several urgent queries related to transmission dynamics, mitigation, and control measures of COVID-19 have been raised, and researchers are attempting to use mathematical modeling to answer these important questions [

There are several standard epidemiological models for modelling epidemics, such as the susceptible, infectious, recovered (SIR) model [

The primary objective of this study is the construction of a reliable, robust, and interpretable model for describing, decomposing, and forecasting the number of confirmed COVID-19 cases and predicting the peak of the pandemic in Saudi Arabia. The rate of mortality in Saudi Arabia is low, less than 1% at the time of writing this paper (May 12, 2020). Therefore, we were only interested in new daily cases of people affected by SARS-CoV-2 in an attempt to detect its peak. The number of cumulative cases was more than 40,000 as of May 12, 2020.

Because our aim was to analyze the daily data series of COVID-19, we sought to use a promising, reliable, and capable method for analyzing time series. A number of methods can be used to perform such an analysis; however, several of these methods are parametric and thus have requirements such as linearity or nonlinearity of a particular form.

An alternative method is to use nonparametric approaches that are neutral with respect to problematic areas of specification, such as linearity, stationarity, and normality [

The SSA technique is considered to be a useful tool that can be applied to solve many problems, such as smoothing; finding trends in different resolutions; simultaneous extraction of cycles with small and large periods; extraction of seasonality components; extraction of periodicities with varying amplitudes; and simultaneous extraction of complex trends and periodicities [

Although signals can be affected by internal or external noise, which often has unknown characteristics, they can be identified if the signal and noise subspaces are accurately separated. It is known that removing noise from any signal is necessary for analyzing any time series and is helpful in properly decomposing signals [

The main idea of SSA is to analyze the main series into different components, then reconstruct the noise-free series for further analysis. This process depends upon two main choices: the window length

We proposed an approach in [

The remainder of this paper is structured as follows. The Methods section gives a short description of the modified SSA approach and its algorithm. In the Results section, we show that this approach can be used to decompose synthetic data into two main distinct subspaces, and we then discuss the implementation of the approach in decomposing and reconstructing series of COVID-19 daily cases. This section also presents the forecasting of the COVID-19 pandemic in Saudi Arabia using vector singular spectrum analysis (VSSA) of the signal extracted by modified SSA. The Discussion section draws the conclusion of the paper and suggests ideas for future work.

This section presents a short description of the modified SSA used in this manuscript (for more details, refer to [

Consider a one-dimensional series _{N}_{1}, …, _{N}_{1}, …, _{K} ,_{i}_{1}, …, _{i}_{+}_{L}_{–1})^{T}^{L}

A matrix ^{T}_{i}_{1} ≥_{L}_{1}, …, _{L}

_{1}+ ⋅⋅⋅ +

_{L}

where _{i}_{i}_{i}^{th}_{F}

Fundamental to the question of eigenvalue behavior, _{i}_{1}, …, ζ_{L}_{1} ≥ ⋅⋅⋅ ζ_{L}_{1} and to understand the behavior of each eigenvalue. This helps identify the value of _{1} that would be used to select the appropriate value of

It was proved in our previous work [_{c}_{c}_{L}_{i}_{S}_{c}_{–1}, ζ_{c}

In this research, we used the third and fourth central measure moments of the distribution, which are the skewness (

Moreover, the coefficient of variation (_{i}

In addition, the Spearman correlation _{S}_{i}_{j}

where _{n}_{n}_{n}_{n}_{n},_{i}_{j}_{i,n}_{j}

These measures of difference between the eigenvalues related to the signal and noise components can specify the cutoff point of separability, namely, the number of leading SVD components that are separated from the residual. Therefore, the final cutoff point of separability between the signal and noise components obtained by the suggested measures corresponds to the rank estimation.

The eigenvalues can be split into two groups by using the above criteria; the first group corresponds to the signal, and the second corresponds to the noise component. Furthermore, the Spearman correlation _{S}_{i}_{j}_{i}_{j}_{S}_{S}_{S}

Once _{i}

where

The algorithm consisted of two main stages. The steps in the first stage used the coefficients of skewness, kurtosis, variation, and correlation to help obtain the optimal value of

The steps in Stage 1 are outlined below:

Map a one-dimensional time series _{N}_{1}, …, _{N}_{1}, …, _{K}_{i}_{i}_{i}_{+}_{L}_{–1}) ∈ ^{L}

Compute the matrix ^{T}^{T}

Decompose matrix ^{T}_{i}_{L}_{i}_{L}_{1}, …, _{L}

Simulate the original series _{i}_{i}_{i}_{i}_{–1} – _{i}_{i}_{i}_{+1}|.

Compute the skewness coefficient for each eigenvalue, _{i}_{c}_{c}_{L}

Compute the coefficient of kurtosis for each eigenvalue, _{i}_{c}

Compute the coefficient of variation, _{i}_{i}_{c}_{–1} correspond to the signal, and the remaining eigenvalues, which have an almost U shape, correspond to the noise.

Compute the absolute values of the correlation matrix between the eigenvalues and represent them in a 20-grade grey scale from white to black corresponding to the values of the correlations from 0 to 1. This matrix also splits the eigenvalues into two groups; the eigenvalues from ζ_{i}_{r}

The steps in Stage 2 are outlined below:

Calculate the approximated signal matrix _{i}_{i}

Averaging over the diagonals of the matrix

The capabilities of modified SSA using different types of synthetic data, including series generated from chaotic map systems with different SNR ratios, are presented in [

Each eigenvalue or singular value contributes to the trajectory matrix decomposition. We can consider the ratio to be the characteristic of matrix _{i}

It should be noted that using the standard criteria in basic SSA, the weighted correlation (

It was shown in [_{S}

We therefore used modified SSA—in particular, some of the proven criteria on the distribution of ζ_{i}

Below, we provide a synthetic example to show the capability of the approach before applying it to the COVID-19 data; for more examples considering different types of series and evaluations with different criteria, refer to [

In the following example, a white noise process

_{t}

_{1}+

_{2}exp(

_{2}

where _{1}=10, _{2}=0.09, and

Realization of the simulated exponential trend series.

Based on observations of the _{i}_{i}_{c}_{=3}. Therefore, the number of eigenvalues required to extract the signal is

Left: w-correlation matrix for the seven reconstructed components of the simulated series. Right: logarithms of the seven eigenvalues of the simulated series. w-correlation: weighted correlation.

Kurt of ζ_{i} for the simulated series. Kurt: kurtosis.

Left: skew of ζ_{i} for the simulated series. Right: CVs of ζ_{i} for the simulated series. CV: coefficient of variation; skew: skewness.

In addition, the Spearman correlation coefficient between ζ_{i}_{i}_{+1} was calculated; _{i}_{i}_{+1}. For the correlation coefficient, the minimum value of _{S}_{i}_{i}_{+1} was used as another indicator for the cutoff point. The results were similar to those that emerged using other criteria and confirmed that the approach works properly. Different criteria, such as root mean square error and mean absolute error, were used in [

Left: Spearman correlation of (ζ_{i},ζ(_{i+1}). Right: matrix of Spearman correlation between (ζ_{i},ζ_{j}).

The correlation matrix also enables us to distinguish and separate the different components from each other. Therefore, the correlation matrix of ζ_{i}_{i}_{j}

The daily numbers of confirmed cases of COVID-19 in Saudi Arabia [

Time series of daily confirmed COVID-19 cases in Saudi Arabia (March 2 to April 12, 2020).

Starting with the first set of COVID-19 data, as mentioned earlier, because our aim was to extract the signal as a whole, we could choose any value for

The results based on these measures in extracting the signal for forecasting gave a curve with a likely peak. However, the predictions using various other choices for

_{i}_{i}_{+1} for _{c}_{=3}, and the minimum value of _{S}_{c}_{–1=3} and ζ_{c}_{=3}. In addition, the matrix of the Spearman correlation for ζ_{i}_{j}

Coefficients of skewness (top left) and kurtosis (top right) for each eigenvalue and the correlations between ζ_{i} and ζ_{i+1} (bottom left) and the results of the matrix correlations (bottom right) for L=7.

Plot of the first time series of daily COVID-19 cases in Saudi Arabia and the fitted curve.

After obtaining the reconstructed series, the next aim was to predict the data for daily new cases from April 13 to August 2020. There are two main forecasting methods in SSA: VSSA (VSSA) and recurrent singular spectrum analysis (RSSA). The VSSA forecasting algorithm is the most widely used in SSA [

To perform SSA forecasting, the basic requirement is that the series satisfies a linear recurrent formula (LRF). The series _{N}_{1}, …, _{N}

_{t}

_{1}

_{t}

_{-1}+

_{2}

_{2}+ ⋅⋅⋅ + a

_{L–1}y

_{t}

_{–}

_{L}

_{+1},

The coefficient vector _{1}, …, _{L}_{–1} is defined as follows:

where _{j}_{j}_{j}

Consider the following matrix:

^{∇}

^{∇T}+ (1 –

^{2})

^{T}

Let us now define the linear operator:

where _{1}, …, _{r}

where _{∆} is the vector of the last _{N}_{j}

where _{1}, …, _{K}_{+}_{h}_{+}_{L}_{–1}] and performing diagonal averaging, a new series

As discussed above, the best values for reconstruction were

The 2-sided KSPA test indicated that there was no statistically significant difference between the distribution of forecast errors at a 95% confidence level (

Consequently, we also concentrated only on the best values obtained,

Plot of the entire time series of daily COVID-19 cases in Saudi Arabia and the fitted curve.

Comparison of the two forecasting scenarios with actual observations. Pred: predicted.

A modified SSA approach was used in this research for the decomposition and forecasting of COVID-19 data in Saudi Arabia. The approach was examined in our previous research and was applied here to the analysis of COVID-19 data.

In the first stage, the first 42 values of confirmed daily cases (March 2 to April 12, 2020) were used and analyzed to identify the value of

All our results confirm the impressive performance of modified SSA in analyzing the COVID-19 data and selecting the value of

In future research, we will include more data and consider different window lengths

coefficient of variation

electroencephalogram

Kolmogorov-Smirnov predictive accuracy

kurtosis

recurrent singular spectrum analysis

susceptible, infectious, recovered

skewness

signal-to-noise ratio

singular spectrum analysis

singular value decomposition

vector singular spectrum analysis

weighted correlation

None declared.