Specialized Deep Learning Architectures for Time Series Forecasting
Introduction
Why Should We Use Deep Learning for Forecasting?
Statistical algorithms have long been widely used for making forecasts with time series data. These classical algorithms, like Exponential Smoothing, and ARIMA models, prescribe the data generation process and require manual selections to account for factors like the trend, seasonality, and autocorrelation. However, modern data applications often deal with hundreds or millions of related time series. For example, a demand forecasting algorithm at Amazon may have to consider sales data from millions of products, and an engagement forecasting algorithm at Instagram may have to model metrics from millions of posts. Traditional forecasting methods learn characteristics of individual time series, and hence do not scale well because they fit a model for each time series and do not share parameters among them.
Deep learning provides a datadriven approach that makes a minimal set of assumptions to learn from multiple related time series. In the previous article, I did a detailed literature review on the state of statistical vs machine learning vs deep learning approaches for time series forecasting^{1}. It is important to note that deep learning methods are not necessarily free of inductive biases. While DL models make very few assumptions about the features, inductive biases creep into the modeling process in the form of the architectural design of the DL model. This is why we see certain models perform better than others on certain tasks. For example, Convolutional Neural Networks work well with images due to their spatial inductive biases and translational equivariance. Hence a careful design of the DL model based on the application domain is critical. Over the past few years, several new DL models have been developed for forecasting applications. In this article, we will go over some of the most popular DL models, understand their inductive biases, implement them in PyTorch and compare their results on a dataset with multiple time series.
Basic Concepts
Before going any further, let’s look at some fundamental concepts that will help develop a better understanding of the models.
Types of Forecasting
We will focus on the following two types of forecasting applications:
 Point forecasting: Our goal is to predict a single value, which is often the conditional mean or median of future observations.
 Probabilistic forecasting: Our goal here will be to predict the full distribution. Probabilistic forecasting captures the uncertainty of the future and hence plays a key role in automating and optimizing operational business processes. For example: in retail, it helps to optimize procurement, and inventory planning; in logistics, it enables efficient labor resource planning, delivery vehicle deployments, etc.
We further divide probabilistic forecasting into two categories:
 Parametric probabilistic forecasting: We directly predict the parameters of the hypothetical target distribution, for example, we predict the mean and standard deviation under the gaussian distribution assumption. Maximum likelihood estimation is commonly applied to estimate the corresponding parameters in this setting.
 NonParametric probabilistic forecasting: We predict a set of values corresponding to each quantile point of interest. Such models are commonly trained to minimize quantile loss.
Covariates
Along with the time series data, a lot of models also incorporate covariates. Covariates provide additional information about an item being modeled or the time point corresponding to the item’s observed values. For example, covariates ensure that the information about the absolute or relative time is available to the model when we use the windowing approach during training.
 Itemdependent covariates, can be product id, product brand, category, product image, product text, etc. One common way to incorporate them into the modeling process is by using embeddings. Some numeric covariates can also be repeated along the time dimensions to be used together with the time series input.
 Timedepended covariates, can be product price, weekend, holidays, dayoftheweek, etc, or binary features like price adjustment, censoring, etc. This information can usually be modeled using the corresponding numeric values.
Covariates can also be both time and item dependent, for example, productspecific promotions.
Models
1. DeepAR
DeepAR is an RNNbased probabilistic forecasting model proposed by Amazon that trains on a large number of related time series ^{2}. Even though it trains on all time series, it is still a univariate model, i.e. it predicts a univariate target. It learns seasonal behavior and dependencies without any manual feature engineering. It can also incorporate cold items with limited historical data. Assuming $t_{0}$ as the forecast creation time, i.e. the time step at which the forecast for the future horizon must be generated, our goal is to model the following conditional distribution:
Using the autoregressive recurrent network, we can further represent our model distribution as the product of likelihood factors
parameterized by the output $h_{i,t}$ of an autoregressive recurrent network.
Handling Unique Data Challenges
Amazon’s website catalogs millions of products with very skewed sales velocity. The magnitudes of sales numbers among series also have a large variance which makes common normalization techniques like input standardization or batch normalization less effective. Also, the erratic, intermittent and bursty nature of data in such demand forecasting violates the core assumptions of many classical techniques, such as gaussian errors, stationarity, or homoscedasticity of the time series. The authors propose two solutions:

Scaling input time series: An averagebased heuristic is used to scale the inputs. The autoregressive inputs $z_{i,t}$ are divided by the following average value $v_{i}$ and at the end, likelihood parameters are multiplied by the same value.

Velocitybased sampling: Large magnitude variance can also lead to suboptimal results because an optimization procedure that picks training instances uniformly at random will visit the small number time series with a large scale very infrequently, which results in underfitting those time series. To handle this, a weighted sampling scheme is adopted where the probability of selecting a sample with scale $v_{i}$ is proportional to $v_{i}$ (see
WeightedSampler
class in code).
The paper also recommends using the following covariates: age (distance to the first observation in that time series), dayoftheweek and houroftheweek for hourly data, weekoftheyear for weekly data, monthoftheyear for monthly data, and a time series id (representing the product category) as embedding. All covariates were standardized to zero mean and unit variance.
Likelihood model
DeepAR maximizes the loglikelihood but does not limit itself to assuming Gaussian noise. Any noise model can be chosen as long as the loglikelihood and its gradients wrt the parameters can be evaluated. The authors recommend using Gaussian distribution (parameterized by mean and standard deviation) for realvalued data, and negativebinomial likelihood (parameterized by mean and shape) for unbound positive count data and longtail data. Other possibilities include Beta distribution for unit interval, Bernoulli for binary data, and mixtures for complex marginal distributions. The paper also recommends using the SoftPlus activation function for parameters, like standard deviation, that are always positive.
Model Architecture
DeepAR model uses an encoderdecoder design but uses the same architecture for the encoder and decoder components. During training (left), at each time step t, the model takes the target value at the previous time step $z_{i,t1}$, covariates $x_{i,t}$, as well as the previous network output $h_{i,t1}$. The model is using teacher forcing approach, which has been shown to pose a few problems for tasks like NLP but hasn’t had any known adverse effect in the forecasting setting. For prediction, the history of the time series is first fed for all timestamps before the forecast creation time (left), and a sample is drawn (also called ancestral sampling) and fed back for the next point until the end of the prediction range (right).
DeepAR model has been extended in several other research works. For example, DeepAR with quantiles functions ^{3} , DeepAR with dilated causal convolutions ^{4}, and DeepAR for multivariate forecasting (TimeGrad) ^{5}.
PyTorch Code


Refer to this Gist for the complete code for this experiment: https://gist.github.com/reachsumit/c1f7e11c0cfa5f696fd9ccd392f9b8d0
You can read more about DeepAR in the original paper.
2. MQRNN
Multihorizon Quantile Recurrent Neural Network (MQRNN) is an RNNbased nonparametric probabilistic forecasting model proposed by Amazon ^{6}. It uses quantile loss to predict values for each desired quantile for every time step in the forecast horizon. One problem with recursive forecast generators like DeepAR, is that they tend to accumulate errors from previous steps during recursive forecast generation. Some empirical search also suggests that directly forecasting values for the full forecast horizon are less biased, more stable, and retrains the efficiency of the parameter sharing ^{7}. MQRNN builds on this idea and uses direct multihorizon forecasting instead of a onestepahead recursive forecasting approach. It also incorporates both static and temporal covariates and solves the following largescale time series regression problem:
Incorporating Future Covariate Values
This paper suggests that distant future information can have a retrospective impact on nearfuture horizons. For example, if a festival or blackFriday sales event is coming up in the future, the anticipation of it can affect a customer’s buying decisions. As explained later, this future information is supplied to the two decoder MLP components.
Decoder Design
The model adopts the encoderdecoder framework. The encoder is a vanilla LSTM network that takes historical time series and covariates values. For the decoder structure, the model uses two MLP branches instead of a recursive decoder. As stated earlier, the model focuses on directly producing output for the full horizon at once. The MLP branches are used to achieve this goal. The design philosophy for the two decoders is as follows:

Global Decoder: The global MLP takes encoder output and future covariates as input, and generates two context vectors as output: horizonspecific context, and horizonagnostic context.
The idea here behind horizonspecific context is that it carries the structural awareness of the temporal distance between a forecast creation time point and a specific horizon. This is essential to aspects like seasonality mapping. Whereas the horizonagnostic context is based on the idea that not all relevant information is timesensitive. Note that in the standard encoderdecoder architecture, only horizonagnostic context exists, and horizon awareness is indirectly enforced by recursively feeding predictions not the RNN cell for the next time step.

Local Decoder: The local MLP takes the global decoder’s outputs and also future covariates as input, and generates all required quantiles for each specific horizon.
The local MLP is key to aligning future seasonality and event, and the capability to generate sharp, spiky forecasts. In cases where there is no meaningful future information, or sharp, spiky forecasts are not desired, the local MLP can be removed.
Model Architecture
The loss function for MQRNN is the sum of individual quantile loss, and the output is all the quantile forecasts. At 0.5 quantile, the quantile loss is simply the Mean Absolute Error and its minimizer is the median of the predictive distribution. MQRNN generates multihorizon forecasts by placing a series of decoders, with shared parameters, at each recurrent layer (time point) in the encoder, and computes the loss against the corresponding targets. Each time series of arbitrary lengths can serve as a single sample in the model training, hence allowing cold content with limited history to also be used in the model. It also uses static information by replicating it across time. The authors also recommend trying different encoder structures for processing sequential input, like dilated 1D causal convolution layers (MQCNN), NARXlike LSTM, or WaveNet.
PyTorch Code


Refer to this Gist for the complete code for this experiment: https://gist.github.com/reachsumit/b0dfc7390fe4eff023226bbb6c0f538b
You can read more about MQRNN in the original paper and the official codebase.
3. Bayesian LSTM
Another clever usage of the encoderdecoder framework for timeseries forecasting was proposed by Uber ^{8}. The main objective of this proposal was to create a solution for quantifying the forecasting uncertainty using Bayesian Neural Networks. This approach can also be used for anomaly detection at scale. The bayesian perspective introduces uncertainty measurement to deep learning models. A prior (often Gaussian) is introduced for the weight parameters, and the model aims to fit the optimal posterior distribution. Many methods, such as stochastic search, variational Bayes, and probabilistic backpropagation, have been proposed to approximate inference methods for Bayesian networks at scale. This paper adopts a Monte Carlo Dropout based approach suggested by prior research ^{9}. Specifically, stochastic dropouts are applied after each hidden layer, and the model output can be approximately viewed as a random sample generated from posterior prediction distribution. Hence the uncertainty in model predictions can be estimated by the sample variance of the model predictions in a few repetitions.
Quantifying Uncertainty
This paper decomposes prediction uncertainty into 3 types:
 Uncertainty due to model (ignorance of model parameters). This uncertainty can be reduced by collecting more samples or by using MC dropouts.
 Uncertainty due to data (train and test samples are from a different population). This can be accounted for by fitting a latent embedding space for all training time series using an encoderdecoder model and calculating variance using only the encoder representations.
 Inherent noise (irreducible). This noise can be estimated through the residual sum of squares evaluated on an independent helpout validation set.
Model Architecture
The encoderdecoder is a standard RNNbased framework that captures the inherent pattern in the time series during the pretraining step. After the encoderdecoder is pretrained, the encoder is treated as an intelligent featureextraction blackbox. A prediction network learns to take input from both the learned encoder, as well as any potential external features to generate the final prediction. Before training, the raw data are logtransformed to remove exponential effects. Within each input sliding window, the first day is subtracted from all values, so that trends are removed and the neural network is trained for the increment value. Later, these transformations are reverted to obtain predictions at the original scale. In the paper, learned embeddings (i.e. the last LSTM cell states in the encoder) are projected to lower dimensions using PCA for interpreting extracted features.
PyTorch Code


Refer to this Gist for the complete code for this experiment: https://gist.github.com/reachsumit/f4a55186706675a085157c64fd1e0634
You can read more about Bayesian LSTM in the original paper.
4. DeepTCN
The Deep Temporal Convolutional Network (DeepTCN) was proposed by Bigo Research ^{10}. Instead of using RNNs, DeepTCN utilizes stacked layers of dilated causal convolutional nets to capture the longterm correlations in time series data. RNNs can be remarkably difficult to train due to exploding or vanishing gradients ^{11}, and backpropagation through time (BPTT) often hampers efficient computation ^{12}. On the other hand, Temporal Convolutional Network (TCN) is more robust to error accumulation due to its nonautoregressive nature, and it can also be efficiently trained in parallel.
Notes on TCN
A TCN is a 1D FCN with causal convolutions. Casual means that there is no information leakage from future to past, i.e. the output at time t can only be obtained from the inputs that are no later than t. Dilated convolutions enable exponentially large receptive fields (as opposed to linear to the depth of the network). Dilation allows the filter to be applied over an area larger than its length by skipping the input values by a certain step.
Recent studies have empirically shown that TCNs outperform RNNs across a broad range of sequence modeling tasks. These studies also recommend using TCNs, instead of RNNs, as the default starting point for sequence modeling tasks ^{13}.
Data Preparation
This research used timeindependent covariates like product_id to incorporate serieslevel information that helps in capturing the scale level and seasonality of each specific series. Other covariates included houroftheday, dayoftheweek, and dayofthemonth for hourly data, dayoftheyear for daily data, and monthoftheyear for monthly data. One of the major concerns for this study was to effectively handle complex seasonal patterns, like holiday effects. This is done by using a handcrafted exogenous variable (covariates) such as holiday indicators, weather forecasts, etc. New products with little historical information used zeropadding to ensure the desired input sequence length.
Model Architecture
DeepTCN uses an encoderdecoder framework. The encoder is composed of stacked residual blocks based on dilated causal convolutional nets to capture temporal dependencies. The diagram above shows an encoder on the left with {1, 2, 4, 8} as the dilation factors, filter size of 2, and receptive field of 16. More specifically, the encoder architecture is stacked dilated causal convolutions that model historical observations and covariates. The diagram below shows the original DeepTCN encoder on the left, and the one I adapted from another research^{13} for my implementation.
For the decoder, the paper proposed a modified resnetv block that takes two inputs (encoder output and future covariates). The decoder architecture is shown in the figure below.
PyTorch Code


Refer to this Gist for the complete code for this experiment: https://gist.github.com/reachsumit/e0a56592c32844231d40e3e48a1bc64a
You can read more about DeepTCN in the original paper.
5. NBEATS
NBEATS is a univariate time series forecasting model proposed by Element AI (acquired by ServiceNow) ^{14}. This research work came out at the time when, ESRNN, a hybrid of Exponential Smoothing and RNN model won the M4 competition in 2018. A popular narrative at that time suggested that the hybrid of statistical and deep learning methods could be the way forward. But this paper challenged this notion by developing a pureDL architecture for time series forecasting that was inspired by the signal processing domain. Their architecture also allowed for interpretable outputs by carefully injecting a suitable inductive bias into the model. However, there is no provision to include covariates in the model. NBEATS was later extended by NBEATSx ^{15} to incorporate exogenous variables. Another work, NHiTS ^{16} altered the architecture and achieved accuracy improvements along with drastically cutting longforecasting compute costs.
Model Architecture
The above diagram shows the NBEATS architecture from the most granular view on the left to the highlevel view on the right. NBEATS used the residual principle to stack many basic blocks and the paper has shown that we can stack up to 150 layers and still facilitate efficient learning. Let’s look at each of the above 3 columns individually.
 [Left] Block: Each block takes the lookback period data as input and generates two outputs: a backcast and a forecast. The backcast is the block’s own best prediction of the lookback period. The input to the block is first processed by four standard fully connected layers (with bias and activation) and output from this FC stack is transformed by two separate linear layers (no bias or activation) to something the paper calls expansion coefficients for the backcast and forecast, $\theta^{b}$, and $\theta^{f}$, respectively. These expansion coefficients are then mapped to output using a set of basis layers ($g^{b}$ and $g^{f}$).
 [Middle] Stacks: Different blocks are arranged in a stack. All the blocks in a stack share the same kind of basis layer. The blocks are arranged in a residual manner, such that the input to a block $l$ is $x_{l} = x_{l1}  \hat{x}_{l1}$, i.e. at each step, the backcast generated by the block is subtracted from the input to that block before it is passed on to the next layer. And all the forecast outputs of all the blocks in a stack are added up to make the stack forecast. The residual backcast from the last block in a stack is the stack residual
 [Right] Overall Architecture: On the right, we see the toplevel view of the architecture. Each stack is chained together so that for any stack, the stack residual output of the previous stack is the input and the stack generates two outputs: the stack forecast and the stack residual. Finally, the NBEATS forecast is the additive sum of all the stack forecasts.
Basis Layer
A basis layer puts a constraint on the functional space and thereby limits the output representation constrained by the chosen basis function. We can have the basis of any arbitrary function, which gives us a lot of flexibility. In the paper, NBEATS operates in two modes: generic and interpretable. The generic mode doesn’t have any basis function constraining the function search (we can think of it as having an identity function). So, in this mode, we are leaving the function completely learned by the model through a linear projection of the basis coefficients. Fixing the basis function allows for human interpretability about what each stack signifies. The authors propose two specific basis functions that capture trend (polynomial function) and seasonality (Fourier basis). So we can say that the forecast output of the stack represents trend or seasonality based on the corresponding chosen basis function.
PyTorch Code


Refer to this Gist for the complete code for this experiment: https://gist.github.com/reachsumit/9b31afe74c560c9f804081af3e1b4a1d
You can read more about NBEATS in the original paper.
Experiment Results
I used UCI’s ElectricityLoadDiagrams20112014 dataset^{17} to run a quick experiment with minimal PyTorch implementations of these models. The dataset contains 370 time series sampled at 15 mins with a total of 140K observations for each series. The data were resampled to 1 hour. Three covariates (weekday, hour, month) and one time series id was used wherever allowed by the model architecture. Refer to the notebooks linked in respective sections to see the code for each experiment. The following chart shows RMSE values for the test set for each of the models. Results are sorted from left to right by the best to the worst.
Summary
In this article, we defined the need for using deep learning for modern time series forecasting and then looked at some of the most popular deep learning algorithms designed for time series forecasting with different inductive biases in their model architecture. We implemented all of the algorithms in Python and compared their results on a toy dataset.
References

“Statistical vs Machine Learning vs Deep Learning Modeling for Time Series Forecasting”, https://blog.reachsumit.com/posts/2022/12/statsvsmlforts/ ↩︎

Flunkert, Valentin & Salinas, David & Gasthaus, Jan. (2017). DeepAR: Probabilistic Forecasting with Autoregressive Recurrent Networks. International Journal of Forecasting. 36. 10.1016/j.ijforecast.2019.07.001. ↩︎

Jan Gasthaus, Konstantinos Benidis, Yuyang Wang, Syama Sundar Rangapuram, David Salinas, Valentin Flunkert, and Tim Januschowski. Probabilistic forecasting with spline quantile function RNNs. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1901–1910, 2019. ↩︎

oord, Aaron & Dieleman, Sander & Zen, Heiga & Simonyan, Karen & Vinyals, Oriol & Graves, Alex & Kalchbrenner, Nal & Senior, Andrew & Kavukcuoglu, Koray. (2016). WaveNet: A Generative Model for Raw Audio. ↩︎

Kashif Rasul, Calvin Seward, Ingmar Schuster, and Roland Vollgraf. Autoregressive denoising diffusion models for multivariate probabilistic time series forecasting. In International Conference on Machine Learning, pages 8857–8868. PMLR, 2021. ↩︎

Wen, Ruofeng & Torkkola, Kari & Narayanaswamy, Balakrishnan. (2017). A MultiHorizon Quantile Recurrent Forecaster. ↩︎

Ben Taieb, Souhaib & Atiya, Amir. (2015). A Bias and Variance Analysis for MultistepAhead Time Series Forecasting. IEEE transactions on neural networks and learning systems. 27. 10.1109/TNNLS.2015.2411629. ↩︎

Zhu, Lingxue & Laptev, Nikolay. (2017). Deep and Confident Prediction for Time Series at Uber. ↩︎

Gal, Yarin & Ghahramani, Zoubin. (2015). Dropout as a Bayesian Approximation: Appendix. ↩︎

Chen, Yitian & Kang, Yanfei & Chen, Yixiong & Wang, Zizhuo. (2019). Probabilistic Forecasting with Temporal Convolutional Neural Network. ↩︎

Pascanu, Razvan & Mikolov, Tomas & Bengio, Y.. (2012). On the difficulty of training Recurrent Neural Networks. 30th International Conference on Machine Learning, ICML 2013. ↩︎

Werbos, Paul. (1990). Backpropagation through time: what it does and how to do it. Proceedings of the IEEE. 78. 1550  1560. 10.1109/5.58337. ↩︎

Bai, Shaojie & Kolter, J. & Koltun, Vladlen. (2018). An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling. ↩︎ ↩︎

Oreshkin, Boris & Carpo, Dmitri & Chapados, Nicolas & Bengio, Yoshua. (2019). NBEATS: Neural basis expansion analysis for interpretable time series forecasting. ↩︎

Gutierrez, Kin & Challu, Cristian & Marcjasz, Grzegorz & Weron, Rafał & Dubrawski, Artur. (2022). Neural basis expansion analysis with exogenous variables: Forecasting electricity prices with NBEATSx. International Journal of Forecasting. 10.1016/j.ijforecast.2022.03.001. ↩︎

Challu, Cristian & Gutierrez, Kin & Oreshkin, Boris & Garza, Federico & Mergenthaler, Max & Dubrawski, Artur. (2022). NHiTS: Neural Hierarchical Interpolation for Time Series Forecasting. ↩︎

https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014 ↩︎