Encoder-Decoder Model for Multistep Time Series Forecasting Using PyTorch

How to use encoder-decoder model for multi-step time series forecasting.

Encoder-Decoder Model for Multistep Time Series Forecasting Using PyTorch
Photo by Daniele Levis Pelusi on Unsplash

Encoder-decoder models have provided state of the art results in sequence to sequence NLP tasks like language translation, etc. Multistep time-series forecasting can also be treated as a seq2seq task, for which the encoder-decoder model can be used. This article provides an encoder-decoder model to solve a time series forecasting task from Kaggle along with the steps involved in getting a top 10% result.

The solution code can be found in my Github repo. The model implementation is inspired by Pytorch seq2seq translation tutorial and the time-series forecasting ideas were mainly from a Kaggle winning solution of a similar competition.


The dataset used is from a past Kaggle competition — Store Item demand forecasting challenge, given the past 5 years of sales data (from 2013 to 2017) of 50 items from 10 different stores, predict the sale of each item in the next 3 months (01/01/2018 to 31/03/2018). This is a multi-step multi-site time series forecasting problem.

Kaggle Competition

The features provided are quite minimal:

There are 500 unique store-item combinations, meaning that we are forecasting 500 time-series.

Sales plot of 10 items chosen at random

Data Preprocessing

Feature Engineering

Deep learning models are good at uncovering features on its own, so feature engineering can be kept to a minimum.

From the plot, it can be seen that our data has weekly and monthly seasonality and yearly trend, to capture these, DateTime features are provided to the model. In order to capture the yearly trend of each item’s sale better, yearly autocorrelation is also provided.

Many of these features are cyclical in nature, in order to provide this information to the model, sine and cosine transformations are applied to the DateTime features. A detailed explanation of why this is beneficial can be found here — Encoding cyclical continuous features — 24-hour time

sine and cosine transformation of month feature

So the final set of features is as given below.

Data Scaling

Neural networks expect the value of all features to be on the same scale, therefore data scaling becomes mandatory. The values of each time-series are normalized independently. Yearly autocorrelation and year are also normalized.

Sequence Building

The encoder-decoder model takes a sequence as input and returns a sequence as output, therefore the flat dataframe we have must be converted into sequences.

moving window forecasting

The length of the output sequence is fixed as 90 days, to match our problem requirement. The length of the input sequence must be selected based on the problem complexity, and the computing resources available. For this problem, an input sequence length of 180 (6 months) is chosen. The sequence data is built by applying a sliding window to each time-series in the dataset.

Dataset and Dataloader

Pytorch provides convenient abstractions — Dataset and Dataloader — to feed data into the model. The Dataset takes the sequence data as input and is responsible for constructing each datapoint to be fed to the model. It also handles the processing of different types of features fed to the model, this part will be explained in detail below.

class StoreItemDataset(Dataset):
    def __init__(self, cat_columns=[], num_columns=[], embed_vector_size=None, decoder_input=True, ohe_cat_columns=False):
        self.sequence_data = None
        self.cat_columns = cat_columns
        self.num_columns = num_columns
        self.cat_classes = {}
        self.cat_embed_shape = []
        self.cat_embed_vector_size = embed_vector_size if embed_vector_size is not None else {}
        self.ohe_cat_columns = ohe_cat_columns
        self.cat_columns_to_decoder = False

    def get_embedding_shape(self):
        return self.cat_embed_shape

    def load_sequence_data(self, processed_data):
        self.sequence_data = processed_data

    def process_cat_columns(self, column_map=None):
        column_map = column_map if column_map is not None else {}
        for col in self.cat_columns:
            self.sequence_data[col] = self.sequence_data[col].astype('category')
            if col in column_map:
                self.sequence_data[col] = self.sequence_data[col].cat.set_categories(column_map[col]).fillna('#NA#')
                self.sequence_data[col].cat.add_categories('#NA#', inplace=True)
            self.cat_embed_shape.append((len(self.sequence_data[col].cat.categories), self.cat_embed_vector_size.get(col, 50)))
    def __len__(self):
        return len(self.sequence_data)

    def __getitem__(self, idx):
        row = self.sequence_data.iloc[[idx]]
        x_inputs = [torch.tensor(row['x_sequence'].values[0], dtype=torch.float32)]
        y = torch.tensor(row['y_sequence'].values[0], dtype=torch.float32)
        if self.pass_decoder_input:
            decoder_input = torch.tensor(row['y_sequence'].values[0][:, 1:], dtype=torch.float32)
        if len(self.num_columns) > 0:
            for col in self.num_columns:
                num_tensor = torch.tensor([row[col].values[0]], dtype=torch.float32)
                x_inputs[0] = torch.cat((x_inputs[0], num_tensor.repeat(x_inputs[0].size(0)).unsqueeze(1)), axis=1)
                decoder_input = torch.cat((decoder_input, num_tensor.repeat(decoder_input.size(0)).unsqueeze(1)), axis=1)
        if len(self.cat_columns) > 0:
            if self.ohe_cat_columns:
                for ci, (num_classes, _) in enumerate(self.cat_embed_shape):
                    col_tensor = torch.zeros(num_classes, dtype=torch.float32)
                    col_tensor[row[self.cat_columns[ci]].cat.codes.values[0]] = 1.0
                    col_tensor_x = col_tensor.repeat(x_inputs[0].size(0), 1)
                    x_inputs[0] = torch.cat((x_inputs[0], col_tensor_x), axis=1)
                    if self.pass_decoder_input and self.cat_columns_to_decoder:
                        col_tensor_y = col_tensor.repeat(decoder_input.size(0), 1)
                        decoder_input = torch.cat((decoder_input, col_tensor_y), axis=1)
                cat_tensor = torch.tensor(
                    [row[col].cat.codes.values[0] for col in self.cat_columns],
        if self.pass_decoder_input:
            y = torch.tensor(row['y_sequence'].values[0][:, 0], dtype=torch.float32)
        if len(x_inputs) > 1:
            return tuple(x_inputs), y
        return x_inputs[0], y

The data points from the Dataset are batched together and fed to the model using the dataloader.

Model Architecture

An encoder-decoder model is a form of Recurrent neural network(RNN) used to solve sequence to sequence problems. The encoder-decoder model can be intuitively understood as follows.

The encoder-decoder model consists of two networks — Encoder and Decoder. The encoder network learns(encodes) a representation of the input sequence that captures its characteristics or context, and gives out a vector. This vector is known as the context vector. The decoder network receives the context vector and learns to read and extract(decodes) the output sequence from it.

In both Encoder and Decoder, the task of encoding and decoding the sequence is handled by a series of Recurrent cells. The recurrent cell used in the solution is a Gated Recurrent Unit (GRU), to get around the short memory problem. More information on this can be found in Illustrated Guide to LSTM’s and GRU’s.

The detailed architecture of the model used in the solution is given below.

Encoder decoder architecture


The input to the encoder network is of the shape (sequence length, n_values), therefore each item in the sequence is made of n values. In constructing these values, different types of features are treated differently.

Time dependant features — These are the features that vary with time, such as sales, and DateTime features. In the encoder, each sequential time dependant value is fed into an RNN cell.

Numerical features — Static features that do not vary with time, such as the yearly autocorrelation of the series. These features are repeated across the length of the sequence and are fed into the RNN. The process of repeating in and merging the values are handled in the Dataset.

Categorical features — Features such as store id and item id, can be handled in multiple ways, the implementation of each method can be found in encoders.py. For the final model, the categorical variables were one-hot encoded, repeated across the sequence, and are fed into the RNN, this is also handled in the Dataset.

The input sequence with these features is fed into the recurrent network — GRU. The code of the encoder network used is given below.

class RNNEncoder(nn.Module):
    def __init__(self, rnn_num_layers=1, input_feature_len=1, sequence_len=168, hidden_size=100, bidirectional=False, device='cpu', rnn_dropout=0.2):
        self.sequence_len = sequence_len
        self.hidden_size = hidden_size
        self.input_feature_len = input_feature_len
        self.num_layers = rnn_num_layers
        self.rnn_directions = 2 if bidirectional else 1
        self.gru = nn.GRU(
        self.device = device

    def forward(self, input_seq):
        ht = torch.zeros(self.num_layers * self.rnn_directions, input_seq.size(0), self.hidden_size, device=self.device)
        if input_seq.ndim < 3:
        gru_out, hidden = self.gru(input_seq, ht)
        if self.rnn_directions * self.num_layers > 1:
            num_layers = self.rnn_directions * self.num_layers
            if self.rnn_directions > 1:
                gru_out = gru_out.view(input_seq.size(0), self.sequence_len, self.rnn_directions, self.hidden_size)
                gru_out = torch.sum(gru_out, axis=2)
            hidden = hidden.view(self.num_layers, self.rnn_directions, input_seq.size(0), self.hidden_size)
            if self.num_layers > 0:
                hidden = hidden[-1]
                hidden = hidden.squeeze(0)
            hidden = hidden.sum(axis=0)
        return gru_out, hidden


The decoder receives the context vector from the encoder, in addition, inputs to the decoder are the future DateTime features and lag features. The lag feature used in the model was the previous year's value. The intuition behind using lag features is, given that the input sequence is limited to 180 days, providing important data points from beyond this timeframe will help the model.

Unlike the encoder in which a recurrent network(GRU) is used directly, the decoder is built be looping through a decoder cell. This is because the forecast obtained from each decoder cell is passed as an input to the next decoder cell. Each decoder cell is made of a GRUCell whose output is fed into a fully connected layer which provides the forecast. The forecast from each decoder cell is combined to form the output sequence.

class DecoderCell(nn.Module):
    def __init__(self, input_feature_len, hidden_size, dropout=0.2):
        self.decoder_rnn_cell = nn.GRUCell(
        self.out = nn.Linear(hidden_size, 1)
        self.attention = False
        self.dropout = nn.Dropout(dropout)

    def forward(self, prev_hidden, y):
        rnn_hidden = self.decoder_rnn_cell(y, prev_hidden)
        output = self.out(rnn_hidden)
        return output, self.dropout(rnn_hidden)

Encoder-Decoder Model

The Encoder-decoder model is built by wrapping the encoder and decoder cell into a Module that handles the communication between the two.

class EncoderDecoderWrapper(nn.Module):
    def __init__(self, encoder, decoder_cell, output_size=3, teacher_forcing=0.3, sequence_len=336, decoder_input=True, device='cpu'):
        self.encoder = encoder
        self.decoder_cell = decoder_cell
        self.output_size = output_size
        self.teacher_forcing = teacher_forcing
        self.sequence_length = sequence_len
        self.decoder_input = decoder_input
        self.device = device

    def forward(self, xb, yb=None):
        if self.decoder_input:
            decoder_input = xb[-1]
            input_seq = xb[0]
            if len(xb) > 2:
                encoder_output, encoder_hidden = self.encoder(input_seq, *xb[1:-1])
                encoder_output, encoder_hidden = self.encoder(input_seq)
            if type(xb) is list and len(xb) > 1:
                input_seq = xb[0]
                encoder_output, encoder_hidden = self.encoder(*xb)
                input_seq = xb
                encoder_output, encoder_hidden = self.encoder(input_seq)
        prev_hidden = encoder_hidden
        outputs = torch.zeros(input_seq.size(0), self.output_size, device=self.device)
        y_prev = input_seq[:, -1, 0].unsqueeze(1)
        for i in range(self.output_size):
            step_decoder_input = torch.cat((y_prev, decoder_input[:, i]), axis=1)
            if (yb is not None) and (i > 0) and (torch.rand(1) < self.teacher_forcing):
                step_decoder_input = torch.cat((yb[:, i].unsqueeze(1), decoder_input[:, i]), axis=1)
            rnn_output, prev_hidden = self.decoder_cell(prev_hidden, step_decoder_input)
            y_prev = rnn_output
            outputs[:, i] = rnn_output.squeeze(1)
        return outputs

Model Training

The performance of the model highly depends on the training decisions taken around optimization, learning rate schedule, etc. I’ll briefly cover each of them.

  1. Validation Strategy — The cross-sectional train-validation-test split does not work since our data is time dependant. A time-dependant train-validation-test split poses a problem, which is that the model is not trained on the recent validation data, which affects the performance of the model in test data.
    In order to combat this, a model is trained on 3 years of past data, from 2014 to 2016, and predicts the first 3 months of 2017, which is used for validation and experimentation. The final model is trained on data from 2014 to 2017 data and predicts the first 3 months of 2018. The final model is trained in blind mode without validation, based on learnings from the validation model training.
  2. Optimizer — The optimizer used is AdamW, which has provided state of the result in many learning tasks. A more detailed analysis of AdamW can be found in Fastai. Another optimizer explored is the COCOBOptimizer, which does not set the learning rate explicitly. On training with COCOBOptimizer, I observed that it converged faster than the AdamW, especially in the initial iterations. But the best result was obtained from using AdamW, with One Cycle Learning.
  3. Learning Rate Scheduling1cycle learning rate scheduler was used. The maximum learning rate in the cycle was determined by using the learning rate finder for cyclic learning. The implementation of the learning rate finder used is from the library — pytorch-lr-finder.
  4. The loss function used was Mean squared error loss, which is different from the completion loss — SMAPE. MSE loss provided a more stable convergence, that using SMAPE.
  5. Separate optimizer and scheduler pairs were used for the encoder and decoder network, which gave an improvement in result.
  6. In addition to weight decay, dropout was used in both encoder and decoder to combat overfitting.
  7. A wrapper was built to handle the training process with the capability to handle multiple optimizers and schedulers, checkpointing, and Tensorboard integration. The code for this can be found in trainer.py.


The following plot shows the forecast made by the model for the first 3 months of 2018, for a single item from a store.

The model can be better evaluated by plotting the mean sales of all items, and the mean forecast to remove the noise. The following plot is from the forecast of the validation model for a particular date, therefore the forecast can be compared with the actual sales data.

The result from the encoder-decoder model would have provided a top 10% rank in the competition’s leaderboard.

I did minimal hyperparameter tuning for achieving this result, so there is more scope for improvement. Further improvements to the model can also be made by exploring attention mechanisms, to further boost the memory of the model.

Thanks for reading, let me know your thoughts. Have a good day! 😄


NLP From scratch: Translation with a sequence to sequence network and attention

Web traffic time series forecasting solution

Encoding cyclical continuous features — 24-hour time

Illustrated Guide to LSTM’s and GRU’s

AdamW and Super-convergence is now the fastest way to train neural nets

Training Deep Networks without Learning Rates Through Coin Betting

Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates