Application of Deep Learning LSTM and ARIMA Models in Time Series Forecasting: A Methods Case Study analyzing Canadian and Swedish Indoor Air Pollution Data

Research Article

Austin J Med Oncol. 2022; 9(1): 1073.

Application of Deep Learning LSTM and ARIMA Models in Time Series Forecasting: A Methods Case Study analyzing Canadian and Swedish Indoor Air Pollution Data

*Corresponding author: Selim Muhammad Khan Cumming School of Medicine, University of Calgary, Canada

Received: November 05, 2022; Accepted: December 22, 2022; Published: December 28, 2022

Abstract

Time series analysis and forecast are vital to understand how a public health hazard evolves over time and what are its influencing factors; these also generate evidence for preventive actions to avoid the potential consequences. There have been a lot of traditional time series analysis methods used in research. The new generation deep learning LSTM (long Short-Term Memory) time series analysis model is promising as it can prevent memory loss that vanishes and explodes the gradient in neural network, deal with enormous volume of data and produce more precise nonlinear forecasts from multivariate inputs whereas the traditional (S)ARIMA (seasonal autoregressive integrated moving average) models that can predict linearly from a single variable. As the health hazard from the soil gas radon is multifactorial and current measures are proved ineffective and death toll from the risk is increasing, we applied this new method to analyze pilot data gathered in Canada through the Evict radon research consortium and got comparable ones from Sweden through Radonova. We conducted both deep learning LSTM and traditional (S)ARIMA modeling using Python-Jupyter notebook and the econometric toolset of MATLAB 2020b. We identified the trends and seasonalities, filtered and trained data, and fitted into the LSTM and (S)ARIMA models; then, forecasted radon levels for the two countries till 2100 AD. We compared and contrasted two models to provide clear ideas to the emerging researchers about the benefits and constraints of both. This methods case study has implications for modelled prediction from big time series data, not limited to the public health risk from indoor air pollution.

Keywords: Time Series Analysis; Deep learning; Long Short- Term Memory; (S)ARIMA; Indoor Air Pollution; Radon Health Risk; Prediction

Highlights

• We applied the next generation deep learning LSTM model for time series forecasting of indoor radon health risk and compared its performance to the traditional ARIMA/SARIMA models.

• Presented the model codes and parameters along with the extra benefits of the cutting-edge deep learning LSTM model over the traditional ones.

• Analytics skills in building and employing advanced models to forecast from big time series data can facilitate research, not lim-ited to the health risk from indoor air pollution.

Learning Outcomes

By the end of this methods case, student researchers should be able to

• Understand and describe the building, training and application of deep neural networks (LSTM model) and ARIMA/SARIMA models to the time series data of any context for risk analysis and prediction.

• Compare and contrast the strengths and weaknesses of these two different analytic tools.

• Experiment and appraise the outcomes generated by these tools to be able to decide on their usage.

Introduction

Time series is a sequence where one or multiple metrics are recorded as data point over regular periodic intervals. With the huge production of high dimensional data, the utility of and interest in time series analysis is increasing day by day. Depending on the frequency, a time series can be annual (annual incidence of a disease, exposure to a risk factor), quarterly (patient turnover or expenses of a hospital), monthly (emergency caseloads, patient admission), weekly (number of deliveries, patient discharges, daily (infectious covid-19 cases, recovery), hourly (visitor’s traffic, outdoor consultation), minutes (inbound calls in emergency room) and even seconds wise (Twitter trends, web traffic). [1,6,9,15,16]. By analyzing such continued time series data, we can forecast what the future values of the series will be. Such time series forecasting such as number of health-related cases expected in the days, months or years to come has tremendous planning, management, and fiscal importance. In the health sector, such analyses drive the essential policy, program development and implementation decisions. Any errors in the forecasts must be paid with the cost of lives and here is the importance of accurate forecasting.

Forecasting a time series can be broadly divided into two types. If we use only the previous values of the time series to predict its future values, it is called Univariate Time Series Forecasting. The traditional ARIMA (Auto Regressive Integrated Moving Average) modeling is a forecasting algorithm based on the idea that the information in the past values of the time series can alone be used to predict the future values. Whereas Multi Variate Time Series Forecasting, uses predictors other than the series (exogenous variables) to forecast. Deep neural networks like Long Short-Term Memory (LSTM) recurrent neural networks can almost seamlessly model problems with multiple input variables. This is a great benefit in time series forecasting, where classical linear methods can be difficult to adapt to multivariate or multiple input forecasting problems. This advanced tool can provide practical solutions to the rapidly growing universal time data with improved performance and computational efficacy [7]. We wanted to know how radon levels evolved over time from 1945 till date and how this level can be predicted for risk prevention. In this methods case study, we examined the development and application of both (S)ARIMA and LSTM model for multivariate time series forecasting with the Keras deep learning library [3,9]. Although there is application of (S)ARIMA model in health risk analysis, we could not find any prior study or experiment that applied LSTM model in the time series analysis of any public health risk, particularly the risk from exposure to indoor air radon. Therefore, our main contribution to this paper includes performance comparison between deep learning LSTM and (S)ARIMA models that validates the proposed models to be suitable for application in the area of risk analysis with minimum data pre-processing and feature sophistication.

Project Overview and Context

Radon gas is an established category one carcinogen for lung cancer [4] that releases from bedrock, enters residential buildings and can be accumulated beyond the hazardous level to human exposure (>100Bq/m3; [18]. The health hazards of radon came to the limelight over 70 years ago when a high incidence of lung cancer in Uranium miners was identified in the USA [8]. Till date, very few countries have taken regulatory actions to tackle the issue. International Residential Code of 2010 demands an active radon control system to be operational in all new buildings in radon prone areas. The Council of the European Union’s (2013) Basic Safety Standards Directive obligates the member states to control the health risk with preemptive policies. Many European and US states have implemented buildings codes incompliance to the international standards [10]. Canadian federal government adopted a model national building code (NBC) since 1941. This was revised every five to ten years with enhanced policy directions but has not imposed any legal requirements so far [11]. The NBC becomes acts only when adopted by the provincial and territorial governments [13]. Most Canadian provinces and territories revised their building codes by 2019that require builders to follow indoor radon control measures during new constructions. However, no strict regulatory obligationis in place in Canada that can require radon testing and disclosure during property transaction. Such lenient policy allows residents’ exposure to the carcinogen and consequently, there is now 31.5% higher radon level detected in the newly built houses in Canada compared to the ones built before 1992 [14].

We applied LSTM deep learning and (S)ARIMA models for time series analysis of the pilot data collected on indoor radon gas to find out the historical trends, seasonalities and forecasted level still the end of this century so that the impacting factors can be identified, and compelling evidence can be generated that stir policy to prevent radon induced lung cancer incidence. The objective of this methods case study includes exploring cutting- edge research tools for conducting a time-series analysis that can be applied in health research beyond indoor air pollution by experimenting the robustness of two methods for producing the outcomes with the highest degree of precision and lowest errors.

Research Design

Evict radon is an umbrella term attached to a range of interrelated public health research project approved by the Research Ethics Boards of the University of Calgary (REB approval 17-2239) that is applies various investigative methods to understand people’s exposure to indoor radon gas. Our study area spreads over the entire landscape of Canada and researchers across the universities from coast to coast comprising of experts from radiation biology, genomics, building science and architecture, psychology, geology, public policy, communication to public and population health (see www.evictradon.org for more details). The project leads are committed to the tri-council policy statement and ethics as well as the regional guidelines and regulations for research involving citizen science participants.

The project randomly targeted participants took informed consent and engaged with adult citizen scientists who voluntarily purchase alpha track 90+ day radon test kits that is quality controlled by the investigators. We excluded participants having lung cancer but tailored to collect a representative sample that entails all sex, gender, race, age, income groups. We continue collecting data through online survey questionnaire that is readily deidentified and collected in a format ready to analyze.

Research Practicalities

Data Collection: We collected Canadian data through Evict Radon, a consortium of researcher spread across Canada with active partnership with the citizens scientists. We used web and social media to contact study participants who purchased radon test kits and completed a survey. The test kits were sent to the lab and test results forwarded to the investigator to communicate to the participants. We gathered Swedish radon data through our partner, Radonova in Sweden.

Data Processing: AS the data came in large volume, collected over an extensive timeline, upon different numbers of observations and types of variables, we preprocessed and equalized by putting them into the same or similar scales and analyzed only the matching timeline variables to be able to compare the outcomes between the two countries. As data collection is still in progress, we picked a sample from the pilot data from both Canada and Swedish radon testing cohorts and put years of houses built in a time series from 1946 to 2020; thus, got two simulated datetime series of extend over 74 years to analyze for this case study.

Methods

Descriptive and time-series analyses we conducted using both traditional Python-Jupyter Notebook, Keras deep learning library, and time series analysis and forecasting (TSAF) toolsets in MATLAB2020b using the TSFA econometric platform. Descriptive statistics of the concentrations of indoor radon (222Rn) test results, time-series trends and seasonality analyzed, filtered to remove trends and seasonalities, thus, random fluctuations generated, and appropriate models trained to forecast and compare radon levels in houses for desired number of years in the future both for Canada and Sweden.

ARIMA/SARIMA Model

As for the timeseries prediction, historic data should be stationary where the covariance of the variable of importance is a function of lag, not of time. We found both Canadian and Swedish datasets we non-stationary through descriptive and inferential statistical Adfuller (Table 2) tests that means both datasets had trends and seasonalities (Figure 2). As per rules, we duly removed these trends and seasonalities through differential filtering and decomposition to get the stationary data with random fluctuations of radon levels suitable to assign to an ARIMA (Auto-Regressive Integrated Moving Average) model that predicted the future trends (Figure 3). Besides, we identified seasonalities in our data and that why we Deseasoned it and add an additional term to take it a step further that is then, called SARIMA [3,12]. Details of methods’ particularities with formulas and parameters are described below; the minutiae of codes and calculations are available on demand.