Forecasting Indonesia Covid-19 Up To Q1 2022 using Time Series Model (ARIMA, SARIMA, Facebook Prophet)
I have one question in mind when seeing Covid had surging back in last June to August 2021 and having recovered pretty well in the last 2 months, my question was : ‘Are we safe enough to take domestic trips in Indonesia at least until the end of this year?’
So with this work, I am taking initiative to do forecasting of Indonesia Covid active cases in 2021 and early 2022. The dataset has been used from KawalCovid19.id in xls format, the date range is from 3rd March 2020 to 5th October 2021. ARIMA (Auto Regressive Integrated Moving Average) as one of time series modelling which relies heavily on statistical models will be widely explored in this work. As a forecast highlight comparison, SARIMA (ARIMA with Seasonality) and FB Prophet also be quickly introduced at the end of this passage.
Data Preprocessing
Like any other machine learning modelling, we need to ensure dataset reliability including no missing single point data, NaN value, right date range, no duplicates, and right data type (date-range, and float/int for values).
In this work, since the aim is to predict future active cases for next 6 months we only use active case (‘Kasus aktif’) and create pandas date range from 3/3/2020 to 10/5/2021 (3rd March 2020 to 5th October 2021).
Next we would like to see whether duplicated, missing date-range, or missing data were introduced. As a result there’ no duplicated data that do not need to be worried about after inspection, and no missing data/date range are found.
As subsequence work, we need to remove all commas in active case number and change into numeric (float/int) type so that we can utilise this data further.
Exploratory Data Analysis
It is always important to see the run-sequence-plot of data series before doing any modelling, so here is the series plot of Covid active case from April 2020 to early of Oct 2021. We can see in July 2021, the number of active cases started to spike and become all time high since covid hit the country. Interestingly in September the active case started to decline into stable phase we can see in October 2021 (thank God!).
For seasonal decomposition, we can directly use seasonal_decompose from statsmodel to know trend, seasonality and residual. In this dataset we have trend, and we can see random residual throughout year (this data series fits better in multiplication model), as shown in below plot:
If we also wrangle a bit using matplotlib boxplot we found that throughout years (2020 and 2021), July and August become the worst month of active case.
ARIMA Model
ARIMA (AutoRegressive Integrated Moving Average) is one of the forecasting methods for univariate time series data. It is actually a class of models that ‘explains’ a given time series based on its own past values, that is, its own lags and the lagged forecast errors, so that equation can be used to forecast f.
Any ‘non-seasonal’ time series that exhibits patterns and is not a random white noise can be modeled with ARIMA models. An ARIMA model is characterized by 3 terms: p, d, q, where:
- p is the order of the AR term — get from PACF
- q is the order of the MA term — get from ACF
- d is the number of differencing required to make the time series stationary
Now let’s take a look at our Covid active cases dataset.
To decide every order of those parameters, we can see directly from ACF/PACF plotting, or we can use Auto-Arima model.
The data series being used has a weak stationary as there are two evidence:
First: P-value of the ADFueller test shows less than 0,05 hence we can reject the null hypothesis of non-stationary series. However there are not constant series in rolling mean and in variance as below plot shows, so that this series be considered as weak stationary.
Next we need to define the order of each parameter p,d, and q.
As mentioned before, p is AR terms which have been derived from PACF of stationary series. Now let’s see the ACF & PACF of original series:
From the ACF and PACF above we do not see clear seasonality.
Meanwhile, high autocorrelation in ACF means we need to differencing so that it becomes stationary. After doing two order differencing we get ACF & PACF plot below:
With differing order (d)=2 we can see the ACF is no longer high values at certain period of lags, but with ACF directly went below 0 which indicates over-differencing, so we go tentatively with diff=1. We get:
Now after we choose d=1, p (order of significant PACF) = 1, and since the ACF has doubt in choosing the order we go with the simple one i=2. Hence p,d,q = (1,1,2).
Next thing to do is we need to check statistically whether the orders are appropriate in ARIMA terms using time series analysis library in statsmodel:
Choosing order p=1, and q=2 seems fine here, since P>|z| should ideally be less than 0.05 for the respective X to be significant. From the model with order above we get the residual density in normal distribution, which is good.
To ensure the model is statistically significant in the whole dataset, we need to do Cross Validation, method used is by fitting 85% as train set, and 15% as test set. Means we only train the 85% of the whole dataset, to forecast the rest 15% which has not trained before so that we could know whether the model will work.
Whoops! Does not feel right, isn’t it? The forecast seems to continually rise up, however actual is the contrast.
What to do? We need to re-create our p,d,q order by lowering down the differencing to d = 0 and try combinations of p and q order. Worry no more, we have AutoArima which work well with this!
After run the AutoArima, the algorithm find the best combination of p and q order to perform the lowest AIC score, as result shows below:
One thing to consider, always check the P|z| score which indicates statistical significant score from actual case, which should be less than 0.05. For this order as suggested by AutoArima algorithm, (3,0,2) seems really fine all close to zero.
Now let us check the time series plot of order (3,0,2):
Much better! Looks like the forecast now is heading to the right direction and seems just lagged from actual cases, even though still the actual case does not recover much in statistical significant bounds.
When we see the diagnostics of model, we get:
- Top left: The residual errors seem to fluctuate around a mean of zero.
- Top Right: The density plot suggests normal distribution with mean zero.
- Bottom left: All the dots should fall perfectly in line with the red line. Any significant deviations would imply the distribution is skewed.
- Bottom Right: The Correlogram, aka, ACF plot shows the residual errors are not autocorrelated. Any autocorrelation would imply that there is some pattern in the residual errors which are not explained in the model.
Now let us inspect the matrix performance and how well this model works to forecast.
From various matrix performance we see the correlation coefficient reach 73.8% of actual cases which is good! However MAPE (Mean of Absolute Percentage Error) reached a high percentage of 200% since the model still can not catch up with the fast changing of the deteriorating curve to south.
As a subsequent process, let us do forecast from this ARIMA (3,0,2) Model:
Green line above shows the prediction of next 6 months of active cases is flatten if we see throughout the year’s perspective. But if we see closely the prediction:
That graph says : this model forecasts slowing down the active cases for the next 6 months from November 2021 to April 2022. (We all hope the actual will be the same or even better!)
SARIMA Model
As mentioned above, the active case of covid here does not have clear seasonality throughout the year, but for learning purposes we will conduct a SARIMA model to compare with ARIMA.
SARIMA is an improvement model from ARIMA (AutoRegressive Integrated Moving Average)- forecasting method for univariate time series data. If the ARIMA model cannot handle seasonality, then SARIMA is built for handling the seasonality well.
SARIMA has trend elements and seasonal elements, specifically:
- Trend Elements
- AR denoted as ‘p’: Trend autoregression order — (hint: see from PACF)
- I denoted ‘d’: Trend difference order. — (hint: differencing until series be stationary)
- MA denoted as q: Trend moving average order. — (hint: see from ACF)
2. Seasonal Elements
There are four seasonal elements that are not part of ARIMA that must be configured; they are:
- P: Seasonal autoregressive order. — (hint: see PACF from shifted series as much as num of season)
- D: Seasonal difference order. — (hint: differencing until series be stationary)
- Q: Seasonal moving average order. — (hint: see ACF from shifted series as much as num of season)
- m: The number of time steps for a single seasonal period. — (hint: single period of season)
Back to the data series, from previous ARIMA we have d=1 (later we can compare to d=0) and D (seasonality order)=0 since there will no need any differencing on seasonality series.
Now we directly use AutoArima to choose p,q and P,Q,S:
Check the statistical significant of orders chosen:
Seeing P|z| above all orders can be considered as statistically significant since pvalue less than 0.05.
Now when we’re doing Cross Validation to ensure statistical significance results of the model. Cross validation method here we will be using one preceded year to forecast every one month forwards. So here, we include moving values in the cross validation.
As a result we found the forecast can almost follow the actual direction of the curve, merely having a spike in August as forecast which is not correct. But overall, the forecast seems to work well!
Now let us see the evaluation matrix of the model used:
MAPE 24%, means we can forecast 76% correctly and having 91% correlation coeff is very great!
Since this model sounds reliable. let us do the forecast of model SARIMA (1,1,0),(0,0,1,12) :
And if we take a closer look :
Seems no difference with the ARIMA model since we are not having any differencing order of seasonality in this series.
So it closes up the experimentation of forecasting Covid active case in Indonesia using ARIMA and SARIMA.
To sum up, from those two statistical Time Series Modelling, ARIMA and SARIMA predicting the active cases for next 6 months (from November 2021 until April 2022) to be lowering down. This assumes, only using past data to forecast. If new mutation of virus cannot be handled and society tend to neglect the daily health protocol, my hypothesis: this forecast will not say anything.
Wait, this article is not over yet! I wanted to show you how FB Prophet, forecast the Indonesia active case:
That graph shows, by seeing the trend captured there’s a possibility of this contagious virus will going back to its linear regression by rising up to its statistical significance of blue area in few months forwards.
So be aware!
Do anything we can do to mitigate this and STAY SAFE EVERYONE!
—
Feel free to visit the Python code of this work on my Github
Reference:
IBM Machine Learning Specialization, Time Series & Survival Analysis, 2020
ARIMA implementation
Python FB Prophet io Documentation
Box Jenkins Method